Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,276

Privacy Policy Terms
filter by tags archive
time to read 8 min | 1557 words

Referencing counting is probably the oldest memory management technique in existence. It is widely used, easily to understand and explain and in most cases does the Right Thing.

There are edge cases and nasty scenarios, but for the most part, it works. At least, as long as you are running as a single threaded program. Here is what a reference counting scheme looks like:

However, the moment that we have any form of concurrency, the simplicity goes right out the window. Consider the call to add_ref vs. release. Both of them need a reference to a valid object. However, what happens if we have the following sequence of events?

  • We have an object whose reference count is set to 1.
  • Thread  #1 gets the address of the object.
  • Thread #2 now calls to release();
    • There are no more references to the object and it is freed.
  • Thread #1 now calls to add_ref();
    • It is passing the address of the recently released object, meaning that we have undefined behavior.

This is a hard problem, because we need to keep the reference count somewhere, and we also need to release the resource as soon as there are no more references to it.

The more generic term for this issue is the ABA problem.

In literature, there are a lot of attempts to solve this issue:

  • Hazzard pointers
  • GC
  • Epochs

They are complex solution to the problem, but mostly because we have two distinct steps here. First we get a pointer to the ref counted value, then we need to change that, but we need to do that safely with concurrent releases. One way of ensuring that this works is to take a lock around the entire operation (acquire the pointer & add ref) and take the same lock for release. That is a somewhat heavy weight approach.

It is also pretty much completely redundant on any modern system. Any x86/x64 CPU released in the past decade will have support this assembly instruction:

image

The cmpxchg16b instruction allow us to do an atomic operation on a value that is 128 bits long (16 bytes). That is important, because that means that we can break apart the stages above. We can store both the pointer and its counter in a single location, and operate on them atomically.

This is called the DCAS (double compare and swap), which greatly simplify the problem. To the point where there is really no reason to want to use anything else.

Except… if you care about ARM systems. There is no comparable instruction to this on ARM, and given how common those machines are, that seem to point us right back to the complexities of hazard pointers and epochs. Of course, ARM has 64 bits atomic instructions, as you can see here, for example:

image

But our pointer is also 64 bits, so that doesn’t really help us that much, does it? Interesting tidbit here, however, the pointers we use aren’t actually 64 bits in size. They are just 48 bits, in truth. That is for both x64 and for AArch64. That means that you can target only a maximum of 256TB of RAM, but there are no machines that big right now (the biggest that I’m aware of are 10% of this size and are expensive). Given that this is a CPU limit, we can probably assume that this isn’t going away soon and that when it does, ARM will likely have a 128 bits atomic instruction.

That means that a 64 bits instruction can give us a 16 bits of free space to work with. But we can do better. Let’s assume that we get our pointers from malloc(), or a similar call. We know that malloc() is required to return the data aligned on max_align_t size. For 64 bits, that is 16. In other words, we have full 20 bits to play with that we can utilize for our needs, while preserving the original pointer value. If we are using page aligned pointers, however, we can use 28(!) bits out of the 64 of the pointer value, for that matter.

Let’s see how we can take that assumption and turn that into something usable. I’m going to use Zig here, because I like the language and it gives us a succinct manner with which to work with native code. The first thing to do is to define how we are going to overall structure:

The size of this structure is 12 bytes, and I’m using Zig’s arbitrary precision integers to help us pack the data into a single u64 value, without having to write all the bit shifts manually. All the other functions that I’m showing here are going to be inside the generic structure. I’m asserting the size and that the T that we are working on is a pointer of some kind. You can see that the structure is also ready to be marked as an error, so we have three possible states:

  • Empty – no value is stored inside
  • Errored – we’ll just retain the error code
  • Value – we have a value and we keep the reference count in the references field. Note that this is a 19 bits field, so we have a maximum of 512K outstanding references for the counter. For pretty much all needs, I believe that this will be sufficient.

In addition to the data itself, we also have the notion of a version field. That one is needed because we want to allow the caller to wait for the value to become available. Let’s see how we can get a value from this?

What I’m doing here is to get the current value, do some basic checks (do we have a value, was an error registered, etc). Then we increment the reference as well as ensure that we don’t overflow it. Finally, we publish the new value using cmpxchng call. Note that the whole thing is in a while loop, to ensure that if the cmpxchng fails, we’ll retry. If we were able to update the reference count, we turn the compacted value into its original form and return that to the user. Because we get the pointer value and increment the reference count as a single step, we are ensured that you cannot end up with missing the release call.

The tryAcquire() call also have a sibling, which will wait for the value to become available, it looks like this:

If the value does not exists (if there is an error, we’ll return immediately), we’ll wait using the Futex.wait() call. This is why we need the version field. It allows us to properly wait without requiring to create kernel level objects. Let’s see how we set the new value, potentially concurrently with threads that want to get to it as well.

We have some convenience functions to make it clear what it is that we are actually setting (an error or a value) and then we get to the meat of this structure, the set() call. There isn’t much here, to be honest. We check that the value isn’t already set, and then set it properly as either an error or a value with a single reference (which is for the caller).

We again use cmpxchng() call to ensure that we are safe in regard to multiple threads (although the usage I have in mind for this calls for a single writer, not competing ones, it doesn’t cost us anything to make it safer to use). After setting the value, we increment the version field and wake any waiting threads.

You can also see how we validate and pack the pointer value to 44 bits.

All of this work, but we are still missing one part. The whole point of reference counting is to delete the value when it it no longer in use, where is that code at? Let’s take a look:

We are taking both the value that we are releasing and the destruction function. The value we want to free is there to ensure that it wasn’t modified meantime, and if there are no more references, we know that we can safely destroy it.

Note that this whole scheme relies on the fact that we are managing the reference count externally to the object itself. In other words, we are assuming that the RefCount value is going to be kept alive. In my case, I”m actually intending to use that as a cache, and we’ll keep an array of those values around for a long time. Otherwise, you run the same risk as before. If you have a reference to the RefCount value, and you may release that, you may end up with a situation where you have a reference to a released memory.

This technique of pointer packing is valid if you use manual memory management mode, you cannot use that in C#, for example, because object may move, and even if you pinned a value, the GC will not consider the value to be a valid pointer, so it will not work for you. For managed languages, there is actually a much better option. Just let go of the value and let the finalizer handle that. In other words, something else (an already existing component) will ensure that there are no live references remaining).

time to read 8 min | 1563 words

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.

Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.

The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:

image

The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.

After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:

image

What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.

What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.

That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?

In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.

Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?

image

At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).

Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?

The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:

image

What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?

We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?

On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.

The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?

The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:

image

When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.

To reproduce this scenario, you’ll need to do something similar to:

  • Map a large number of pages (in this case, hundreds of GB)
  • Modify the data in those pages (in our case, write documents, indexes, etc)
  • Call FlushFileBuffers on the data
  • From many threads, access the recently flushed data (each thread ideally accessing a different page)

On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.

The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.

The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.

The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

time to read 4 min | 606 words

imageI recently had to discuss the issue on the impact of latency a few times, and I found the coffee cup analogy to be an excellent tool to explain exactly what is going on. Consider the humble coffee cup, without which there would be no code.

It is a pretty simple drink, composed of coffee, water and milk. I’ll ignore coffee snobs and the like for now and focus strictly on the process of making a cup of coffee. I found this recipe:

  • 1 cup milk
  • ½ cup cold brewed coffee
  • 2 sweetener

Mix milk, coffee, and sweetener together in a glass until sweetener is dissolved.

If I was writing this in code, I would probably write something like this:

Simple enough, right? There is just a little bit of details to fill. How are the coffee() or sweetner() methods implemented?

The nice thing about this code is that this is nicely abstracted, the coffee recipe and the code reads almost in the same manner. However, there is an issue with the actual implementation. We have the go_to_store() method, but we know that this is an expensive operation. To avoid making it too often, we calculate the amounts that we need to make 20 cups of coffee and make sure that we set the relevant XYZ_AMOUNT_TO_BUY appropriately.

What do you think will happen on the 21th cup of coffee, however? We run out of coffee, so we’ll go to the store to get some. Once we got it, we can pour the coffee to the cup, but then we need to put the milk in, in which case we’ll discover that we run out. Off to the store we go, and all the way back. And then there is the sweetener that run out, so that is the third trip to the store.

Abstraction, in this case, is actively hurting us. We ignore the fact that ingredients may be missing, and that isn’t something that we can afford to. The cost of going to the store outweigh anything else in the process of making a cup of coffee, and we just did that three times.

In the context of software, of course, we are talking about the issue of making a remote call. For example, sending a separate query to the database for each datum that you need. The cost of the remote call far exceed any other costs you have in the system.

To solve the coffee cup problem, you’ll need to do something like:

Abstraction? What abstraction? There are no abstractions here. We are very clearly focused on the things that need to happen to get it working properly. In fact, a better alternative would be to not check that we have enough for the current cup but to schedule a purchase when we notice that we are low.

That, again, intermix the responsibilities of making the coffee and making sure that we have the ingredients at hand. That is not an actual problem, however. That is something that we are fine with, given the difference in performance that this entails.

In the same manner, when I see people trying to hide (RPC, database calls, etc) behind an abstraction layer, I know that it will almost always end in tears. Because if you have what looks like a cheap function call go to the store for you, the end result is that you have to wait a lot of time for your coffee. Maybe enough to (gasp) not even have coffee.

On that note, I have a cup of coffee to finish…

time to read 2 min | 251 words

I got an interesting question by email and I thought that this is worth a post. The question was whatever RavenDB can handle Pivot tasks. Consider the case where I have orders data, and I want to see a summary product sales on a monthly basis, like so:

image

This data was produced using the sample data in RavenDB and the following map/reduce index:

That works, but it gives each individual month on its own row. When using Excel, we can Pivot the whole thing so instead of rows, we’ll get columns. For certain types of data, that makes it much easier to work with. For example, let’s say that I want to compare monthly sales data across different products.

The data we see is the same, it is just the way we process and show it that is different. Let’s see how we can do that in RavenDB. We can do that with a secondary aggregation step in the reduce, like so:

The idea is that the reduce step in RavenDB can have its own complex processing, and the result of this process gives us the following output:

If we use JavaScript indexes, we can even manipulate the data to skip the nested values, the code is nastier (likely a product of my skill in JavaScript, I’ll freely admit), but the results are nice.

image

time to read 3 min | 585 words

imageI recently run into a bit of code that made me go: Stop! Don’t you dare going this way!

The reason that I had such a reaction for the code in question is that I have seen where such code will lead you, and that is not anywhere good. The code in question?

This is a pretty horrible thing to do to your system. Let’s count the ways:

  • Queries are happening fairly deep in your system, which means that you’re now putting this sort of behavior in a place where it is generally invisible for the rest of the code.
  • What happens if the calling code also have something similar? Now we got retries on retries.
  • What happens if the code that you are calling has something similar? Now we got retries on retries on retries.
    • You can absolutely rely on the code you are calling to do retries. If only because that is how TCP behaves. But also because there are usually resiliency measures implemented.
  • What happens if the error actually matters. There is no exception throw in any case, which means that important information is written to the log, which no one ever reads.
  • There is no distinction of the types of errors where retry may help and where it won’t.
  • What is the query has side effects? For example, you may be calling a stored procedure, but multiple times.
  • What happens when you run out of retries? The code will return null, which means that the calling code will like fail with NRE.

What is worst, by the way, is that this piece of code is attempting to fix a very specific issue. Being unable to reach the relevant database. For example, if you are writing a service, you may run into that on reboot, your service may have started before the database, so you need to retry a few times to the let the database to load. A better option would be to specify the load order of the services.

Or maybe there was some network hiccup that you had to deal with? That would sort of work, and probably the one case where this will work. But TCP already does that by resending packets, you are adding this again and it is building up to be a nasty case.

When there is an error, your application is going to sulk, throw strange errors and refuse to tell you what is going on. There are going to be a lot of symptoms that are hard to diagnose and debug.

To quote Release It!:

Connection timeouts vary from one operating system to another, but they’re usually measured in minutes! The calling application’s thread could be blocked waiting for the remote server to respond for ten minutes!

You added a retry on top of that, and then the system just… stops.

Let’s take a look at the usage pattern, shall we?

That will fail pretty badly (and then cause a null reference exception). Let’s say that this is a service code, which is called from a client that uses a similar pattern for “resiliency”.

Question – what do you think will happen the first time that there is an error?  Cascading failures galore.

In general, unknown errors shouldn’t be handled locally, you don’t have a way to do that here. You should raise them up as far as possible. And yes, showing the error to the user is general better than just spinning in place, without giving the user any feedback whatsoever.

time to read 2 min | 234 words

I run into a task that I needed to do in Go, given a PFX file, I needed to get a tls.X509KeyPair from that. However, Go doesn’t have support for PFX. RavenDB makes extensive use of PFX in general, so that made things hard for us. I looked into all sorts of options, but I couldn’t find any way to manage that properly. The nearest find was the pkcs12 package, but that has support for only some DER format, and cannot handle common PFX files. That was a problem.

Luckily, I know how to use OpenSSL, but while there are countless examples on how to use OpenSSL to convert PFX to PEM and the other way around, all of them assume that you are using that from the command line, which isn’t what we want. It took me a bit of time, but I cobbled together a one off code that does the work. The code has a strange shape, I’m aware, because I wrote it to interface with Go, but it does the job.

Now, from Go, I can run the following:

As you can see, most of the code is there to manage error handling. But you can now convert a PFX to PEM and then pass that to X509keyPair easily.

That said, this seems just utterly ridiculous to me. There has got to be a better way to do that, surely.

time to read 5 min | 854 words

I needed to pay one of our suppliers. That supplier happens to be living in Europe, while Hibernating Rhinos is headquartered in Israel. That means that I have to send an international money transfer to get them paid.

So far, that isn’t an issue, this is literally something that we have to do multiple times a week and have been doing for the past decade and a half. This time, however…

We run into a problem, we initiated the payment round as usual and let the suppliers know that the money was transferred. And then I forgot about it, anything from this point on is on the accounting department.

A few days later, we started getting calls, telling us that the money that we sent didn’t arrive. I called the bank and they checked, it appears that some of the transfers that we made hit an internal bank limit. Basically, there are rules in place for how much money you can send out of the country before you have to get the IRS involved. I believe that the issue is about moving money to off shore accounts in order to provide taxes on that, but that isn’t relevant for the story at hand.  What is relevant here is that the bank didn’t process some of the payments in the run (those that hit the limit for the tax block). Some payment did went through, but some didn’t.

The issue with such things isn’t so much the block (I can understand why it is there, although I wish there was some earlier notice). The major issue was that we try to have a good payment schedule for our suppliers, meaning that we’ll pay most invoices within a short amount of time. When something like that happens, it means that we have to wade into the bureaucracy of the (international) tax system. That takes time, and in that time, the suppliers aren’t getting paid. Technically, we are okay, we are usually paying far ahead of the invoice due date, but I strongly dislike this.

We used a different source for the funds and paid all the suppliers immediately, then set to clear the tax hurdles involved in the usual manner in which we are paying. I also paid to expediate the transfer and they all had the money arrive faster than normal. In all, I would estimate that this meant that we had a delay of just a few days over when the money would arrive normally.

But that isn’t where the story ends. For most of the suppliers, the original transfers never happened, because of the tax issue. For two of them, however, the money was gone from our account. One of them also confirmed that they received the money, so I expected the second one to get it as well.

They didn’t. But the money was gone from our account. Talking to the bank, the money was in the bank’s account, waiting for the tax permit to go ahead with that. I asked the bank to cancel the order, then I transferred the money using the alternative way.

Except… that supplier called me, confused. The money appeared in their account twice. Checking with the bank, it was indeed gone from the original and alternative accounts. Well, that wasn’t expected. Luckily, this is a supplier that I’m doing regular business with, we decided that the simplest option is just to consider the extra payment to be credit for future charges. The supplier sent me a(n already marked as) paid invoice, and aside from shaking my head in disbelief, the issue was done.

Except.. that supplier called me, even more confused. Their bank called them, saying that the originating bank has cancelled the transfer, and they need to send the money back.

Que: imageimageimage

The key here was that their bank wanted them to transfer the money back to us. I had a very negative reaction to that, because this pinged all the hallmarks for a common scam: Overpayment Scam.  I asked the supplier to do nothing with that, since if the bank need to transfer the money, they can do it directly.

The fear was that the supplier would send the money back, then the bank will refund the money, resulting in the supplier having no money. I mentioned already: image?

I talked to my bank and cancelled the cancellation, so hopefully things will stabilize now. This is an ongoing event and I don’t know if we hit peak kafkianess.

As for what happens, I suspect that when I asked to cancel the original transfer, the was another logic branch that was followed. Since the money already left my account, they had to record that as a cancellation, but that apparently was sent to the destination bank, along with the money, I guess?

At this point, I don’t even wanna know.

time to read 5 min | 834 words

I wrote a post a couple of weeks ago called: Architecture foresight: Put a queue on that. I got an interesting comment from Mike Tomaras on the post that deserve its own post in reply.

Even though the benefits of an async queue are indisputable, I will respectfully point out that you brush over or ignore the drawbacks.

… redacted, see the real comment for details …

I think we agree that your sync code example is much easier to reason about than your async one. "Well, it is a bit more complex to manage in the user interface", "And you can play games on the front end" hides a lot of complexity in the FE to accommodate async patterns.

Your "At more advanced levels" section presents no benefits really, doing these things in a sync pattern is exactly the same as in async, the complexity is moved to the infrastructure instead of the code.

This is a great discussion, and I agree with Mike that there are additional costs to using the async option compared to the synchronous one. There is a really good reason why pretty much all modern languages has something similar to async/await, after all. And anyone who did any work with Node.js and promises without that knows exactly what are the cost of trying to keep the state of the system through multiple levels of callbacks.

It is important, however, that my recommendation had nothing to do with async directly, although that is the end result. My recommendation had a lot more to do with breaking apart the behavior of the system, so you aren’t expected to give immediate replies to the user.

Consider this: ⏱. When you are processing a user’s request, you have a timer inherent to the operation. That timer can be a real one (how long until the request times out) or it can be a mental one (how long until the user gets bored). That means that you have a very short SLA to run the actual request.

What is the impact of that on your system? You have to provision enough capacity in the system to handle the spikes within the small SLA that you have to work with. That is tough. Let’s assume that you are running a website that accepts comments, and you need to run spam detection on the comment before actually posting that. This seems like a pretty standard scenario, right? It doesn’t require specialized scenarios.

However, the service you use has a rate limit of 10 comments / sec. That is also something that is pretty common and reasonable. How would you handle something like that if you have a post that suddenly gets a lot of comments? Well, you’ll have something that ensure that you don’t pass the limit, but then the user is sitting there, waiting and thinking that the request timed out. On the other hand, if you accept the request and place it into a queue, you can show it in the UI as accepted immediately and then process that at leisure.

Yes, this is more complex than just making the call inline, it requires a higher degree of complexity, but it also ensure that you have proper separation in your system. The front end submit messages to the backend, which will reply when it is done. By having this separation upfront, as part of your overall design, you get options. You can change how you are processing things in the backend quickly. Your front end feel fast (which is usually much more important than being fast, mind you).

As for the rate limits and the SLA? In the case of spam API or similar services, sure, this is obvious. But there are usually a lot of implicit SLAs like that. Your database disk is only able to serve so many writes a second, for example. That isn’t usually surfaced to you as X writes / sec limit, but it is true nevertheless. And a queue will smooth over any such issues easily. With making the request directly, you have to ensure that you have enough capacity to handle spikes, and that is usually far more expensive.

What is more interesting, in my opinion, is that the queue gives you options that you wouldn’t have otherwise. For example, tracing of all operations (great for audits), retries if needed, easy model for scale out, smoothing out of spikes, etc.

You cannot actually put everything into a queue, of course. The typical example is that you’ll want to handle a login page. You cannot really “let the user login immediately and process in the background”. Another example where you don’t want to use asynchronous processing is when you are making a query. There are patterns for async query completions, but they are pretty horrible to work with.

In general, the idea is that whenever the is any operation in the system, you throw that to a queue. Reads and certain key aspects are things that you’ll need to run directly.

time to read 2 min | 395 words

A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:

This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.

First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:

The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:

What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:

The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.

It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.

time to read 1 min | 149 words

Implementing a unit of work in Python can be an interesting challenge. Consider the following code:

This is about as simple a code as possible, to associate a tag to an object, right?

However, this code will fail for the following scenario:

You’ll get a lovely: “TypeError: unhashable type: 'Item'” when you try this. This is because data classes in Python has a complicated relationship with __hash__().

An obvious solution to the problem is to use:

However, the id() in Python is not guaranteed to be unique. Consider the following code:

On my machine, running this code gives me:

124597181219840
124597181219840

In other words, the id has been reused. This makes sense, since this is just the pointer to the value. We can fix that by holding on to the object reference, like so:

With this approach, we are able to implement proper reference equality and make sure that we aren’t mixing different values.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats