Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,273 | Comments: 50,512

Privacy Policy Terms
filter by tags archive
time to read 8 min | 1507 words

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.

As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:

 Dictionary mismatch on Dic #375
   at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)

This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.

So now we know what is going on, but we don’t understand why.

When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.

The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.

I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.

After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.

Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?

Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.

I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.

When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).

Given all of this information, here is the sequence of events that we need to get the error in question:

  1. The user enabled documents compression.
  2. The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
  3. A document write as the first action.
  4. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
  5. That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
  6. RavenDB persisted the new dictionary and used that to compress the new document.
  7. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
  8. A third command runs, but this one throws an error (such as optimistic concurrency violation).

At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?

For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.

At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.

The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.

If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.

On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.

The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.

time to read 1 min | 199 words

The end of the year is closing fast, and I run into the following metric (below). What you can see here is one of our RavenDB production instances over the past year. We are continuously dogfooding our own software, and there is a clear indication of the results.

What you can see here is the total memory used by RavenDB (production load, fairly constant over time)  for the past year. As we update RavenDB, we benefit from various optimizations, and the trend line is very encouraging.

image

Around August, we had a change that saved us a single allocation in some cases, here is the chance, you can see the impact it had:

image

We also started using a new feature in production around December, and that seems to have an additional memory cost, so we optimized that as well:

image

You can see the new build deployed around the 17th of the month.

time to read 5 min | 912 words

I got into a good discussion about how RavenDB implements some optimizations with transaction handling. The details got big enough (and hopefully interesting enough) that they warrant their own post.

When we are talking about transactions, one of the key factors in the cost of a transaction is the amount of time that it takes to persist that. This is different for local and distributed transactions.

For a local transaction, we can consider the transaction committed if it is durably stored on the disk.

For a distributed transaction, we can consider the transaction committed if it is durably stored on a majority of the disks in the cluster.

That factor right there is the primary reason why a distributed transaction is more expensive. For a local transaction, you are waiting for the disk. For a distributed transaction, you are waiting for multiple disks and the network.

One of the core optimizations for speeding up transactions is the ability to batch things. The cost of writing to the disk is more or less the same, regardless of how much you write (within an order of magnitude or so). In other words, writing 8 KB and writing 32 KB has pretty much the same cost. Writing 1 MB and writing 100 MB does not, but writing 1 MB vs 4 MB isn’t meaningfully different (sequential durable write, which is what we care for in the case of transactions).

The point of this post is how this is actually handled. RavenDB utilizes a process called transaction merging to reduce the number of times that we have to go to the disk. Concurrent transactions will be bundled into the same write call, massively increasing our throughput. To give you some context, without transaction merging, you can peak at a few hundreds transactions per second. With transaction merging, you can jump to high thousands of transactions per second. Here is how this works:

image

RavenDB actually takes this further, in addition to transaction merging, we also apply something we call async commit. Take a look at the following timeline:

image

A transaction is actually composed of two separate steps. First we need to execute whatever commands we have in the transaction, then we have to write the transaction changes to disk.

RavenDB is able to start processing the next transaction as soon as the previous one started the write to the disk. The idea is to parallelize compute and I/O, and we are able to benefit greatly as a result. Note that this is safe to do, since the next transaction won’t be committed until the prior transaction has been durably stored.

How does this work in practice? Whenever we have a new transaction, we add it to a queue. A dedicated thread will merge those transactions and pull them from the queue, running the separate transactions as one big operation. When we run out of pending transactions or hit certain size / timing limits, we’ll commit the merged transaction and start working on the next one while the commit is completing in the background.

There are certain algorithms that try to maximize throughput, such as Nagle. They do that by waiting for additional transactions to arrive before actually going to the disk. RavenDB doesn’t use that approach. If a system is idle and we get a single transaction, we’ll immediately execute and commit it.

But the fact that we don’t explicitly do Nagle doesn’t mean that it isn’t useful. Because we have to wait for the disk, what ends up happening is that under load, we start getting more pending transactions in the queue. Which will then be executed as a merged unit. In other words, RavenDB implements a dynamic batching approach, affected by the actual I/O constraints and the load on the system. If we have independent transactions, we’ll execute them immediately. As the load increases, we’ll start making more merged transactions. This way we can keep a fairly consistent response time even when the load of the system grows by leaps and bounds.

The situation is similar when we are talking about distributed transactions. RavenDB uses the Raft protocol for managing its distributed behavior. I’m going to focus just on the batching aspect of the behavior. RavenDB will send an AppendEntries message to the other members in the cluster every 200 ms or so. However, if we have a new command to send to the cluster, it will go out immediately over the network. An  important factor here is that we are using TCP, and we require acknowledgment from the other side before we send the next message. As a result of those behaviors, we have pretty much the same situation. Depending on the network latency and the processing time, we’ll send more entries in a single roundtrip.

In short, the overall behavior for RavenDB is that we’ll start the operation immediately on the first action (both for disk and network), and then we’ll batch anything that happens while the first operation is in flight and send that as a result.

After over a decade of working in this manner, I can tell that this has proven to be a highly adaptable system that results in the minimum number of knobs to mess with. It favors latency over throughput when there isn’t a lot of activity and shifts toward favoring throughput over latency as the load grows.

time to read 3 min | 420 words

imageA RavenDB user has accidentally deleted a collection. They intended to do something else, probably, but…  They have a backup, but as you can imagine, this is a bad place to be in.

They talked to us and mentioned that they want a feature where deletion in the studio can be locked by a password or even two factor authentication, to prevent such a scenario.

We are not going to implement such a feature. From a technical perspective, this is a pretty easy thing to do, of course. My issue is that it doesn’t make sense for such a feature to exist. Let me dig into the details and explain what the problem is.

Locking deletes behind a password or two factor authentication is a security feature. That has a major impact on all aspects of the design. However, this is about preventing mistakes on the part of the user, not another security capability (this user can do deletes, this one cannot).

As such, this isn’t a security feature, but a UX one. The delete is already asking for confirmation, but it is the sort of thing that you rarely read,  as we all know.

The distinction between a security feature and a UX feature is important. If this is a security feature, that means that I need to prevent doing mass deletes everywhere. As the result of queries, iterating over ids, in patch operations, etc. If this is a UX issue, that is a whole different level.

Looking at other such destructive operations, where the user is allowed to do the operation, but we want to prevent accidents leads me to consider something like this:

image

Where we require the user to perform some action if there is a major risk. That shifts the burden to the user, but it means that we now need to consider how to apply this.

Are we dealing with just mass deletes? What about update queries?

The purpose here isn’t to prevent the user from making the operation, but to have them stop and consider for a moment. The problem is that for common operations, that is something that you would add a significant amount of friction to your daily work.

When working on importing data, for example, it is common to delete the previous run each time that you run (think, development time, every 3 minutes). Adding hurdles along the way is a PITA.

time to read 4 min | 626 words

Our monitoring system pinged us about a problem with a RavenDB cluster running in production. The problem was simple, we saw quite a bit of server restarts for that particular cluster. Looking deeper, it was obvious that the RavenDB instances for the cluster would occasionally run out of memory and crash. The customer, by the way, was unaware of this issue. From their perspective, the RavenDB cluster would switch the primary node for the database in question on a regular basis. On our end, we could see that each node would start using higher and higher memory and end up dying because of that. They would be restarted, of course, and the primacy of the cluster would switch automatically, but that is not a proper way to run.

The problem was figuring out what was going on. It took some time to figure out what exactly was going on. We didn’t see any such behavior on any other customer, but this customer had two factors that affected the outcome. The first is that the database in question is encrypted, which means that RavenDB will need some place to put the decrypted values. The second is that the user is issuing streaming queries that have a lot of results. We were able to reproduce the high memory usage when issuing the same queries, however, we were utterly unable to reproduce the problem when trying to run it on our own machines.

That was… strange, and it took a while to figure out that we need to run on Linux to get the issue. We subjected the system to a very high load on Windows, with no issue. On Linux, it would be quickly apparent that we are consuming more and more memory. We were able to narrow things down to this call:

posix_memalign(&ptr, 4096, 8192);

What we are asking here is an 8KB buffer aligned on 4KB boundary. And we were leaking those like crazy but we couldn’t figure out how. We are pretty careful with manual memory management and we have the tools around to detect leaks. Each and every call to allocate was also freed. The problem is that we aren’t the only ones using the system. Basically, posix_memalign will use the same memory pool as malloc(). The problem is memory fragmentation, basically. The way posix_memalign() works is to issue:

image

Where nb is 8192 bytes, alignment is 4096 bytes and MINSIZE is 32 bytes. We then release the end of the buffer, which ends up being ~4KB or so in most cases. Along with other allocations, that created severe fragmentation in our memory.

We need the memory to be page aligned, because we use that for direct memory access. The memory is typically pooled, so we won’t be allocating and freeing it all the time, but when you use streaming queries, you may be running through a lot of data, so we exceeded the size of the pool. At that point, we would allocate (and free), but we’ll also fragment the memory.

We fixed the issue by using mmap() directly, which will give us page aligned memory and won’t cause us to use more memory than needed. Given that we get page aligned memory with is a multiple of page size, we can be sure that we’ll get reuse of the memory, instead of having to deal with internal fragmentation inside the malloc implementation. With this change, there are no issues, and we are actually slightly faster than before.

The reason we didn’t run into the same problem on Windows, by the way? There we called VirtualAlloc() from the get-go, which will ensure that we have page aligned memory, so no need to deal with fragmentation.

time to read 4 min | 667 words

RavenDB is rarely deployed in isolation, it is typically used in existing systems and is integrated into the overall system. One of the key ways by which this is promoted is the built-in ETL support that we have. RavenDB currently has ETL for Postgres, SQL Server, Oracle, MySQL, Elastic,  OLAP / Date Lake, and other RavenDB instances.

We are looking into adding RavenDB ETL support to queues (RabbitMQ, Kafka, SQS, AQS, etc). That support is the topic of this blog post. I wanted to summarize my thinking about the topic and along the way gather some insight from you about what kind of shape this feature should have.

imageWhen talking about ETL to Queues, we have to deal with two distinct scenarios: receiving and sending. For the other ETL targets in RavenDB, we just send data, but for queues, given that there is a well defined interface for pulling the results, it makes sense to support receiving as well. Let’s consider what it means to be able to receive messages from a queue into RavenDB…

It means that RavenDB will listen to a queue and apply a script to it. That script will be able to insert or modify documents as a result of the message contents. For example, let’s assume that we have the queue defined as in the image on the right. We can write the following script to process messages from the queue.

The script above handles two message types. A recording of a new order or adding a line item to an existing order. It will be invoked by RavenDB whenever it receives a message from the queue. In this way, you can have RavenDB build your domain model directly from the message traffic. Of course, this is a pretty simplistic scenario, there are a lot of more interesting scenarios to explore here.

The second part is when RavenDB will be the one sending messages to the queues. Those messages, naturally, would be generated from the documents in the database. How would that work? We can write a script that would be applied to documents as they change which will output the messages to write to the queue. That is how ETL in general works in RavenDB. For queues, however, the situation is a bit more complex.

When we use ETL to sync data from RavenDB to a relational database, any update of the document will also update the data in the relational database. When we send the data to a queue, what would happen then? Well, we can’t update a message in the queue, that doesn’t make any sort of sense. So we need to consider what is the scenario we have here. One option would be to just send the message each time, every update of a document will generate a new message. Or the author of the ETL script may decide to only send it once, of course.

The scenario that I think is far more likely is to use RavenDB and ETL to Queue as part of a larger scheme. Consider the scenario where you want to use the outbox pattern. In other words, you have a transaction that needs to do a bunch of things, including sending messages on a queue. Instead of trying to create a distributed transaction or carefully coordinate things, you will use this feature. Your transaction will save a Message document alongside any other changes. That relies on RavenDB’s ACID nature to ensure that this happens in an atomic manner.

Then you will be able to utilize the ETL to Queues option to actually send that over to the actual queue, in a reliable manner.

Those two scenarios (send & receive) are the two most likely scenarios for this feature, but the point of this post is to get more feedback from you. What kind of use cases do you think that this will enable? What would you like to be able to do?

time to read 1 min | 195 words

imageConsider the image on the right, where we have three charges on separate months. This is a time series, showing charges over time. We can very easily issue queries that will give us the results of how much we paid in a time period, but what if we wanted to get the cumulative value. How much have I paid so far? Here is how this should look like:

image

However, that is not something that we provide in RavenDB. Luckily, we do provide a very flexible query engine, so we can make it happen anyway. Here is what the query will look like:

Note that we are using a JavaScript function to process the time series and run the computation that we want, and then we return an array, which is translated to multiple results set per document. Here is the result of this query:

image

time to read 2 min | 300 words

I give a lot of talks about performance and in those talks, I tend to emphasize the architectural impact of your choices. There is a huge tendency to focus on micro optimizations to get the performance you need, even though you can usually get orders of magnitude higher performance by making architectural changes.

Good architecture can still benefit from micro optimizations, however, and it is sometimes really surprising to see by how much. During a routine performance review, we identified a particular scenario as a performance issue. Here is the code in question:

This is being triggered when you are using a parameterized query, like this one:

 

And here is the profiler trace for that:

image

That is kind of ridiculous, to be honest. About 18% of the client side query process went into generating the name of the query. Opps, that is not really something that I expected.

And here is the optimized version:

Basically, we prepare, in advance, the most likely names, so we can find them as cheaply as possible. The result for that particular operation is impressive:

So we almost halved the costs that we have here, but what is more interesting is what happens at higher level of the stack…

image

This is the query processing portion, and you can see that the pretty minimal saving of 187 ms in the AddQueryParameter method is translated to a far greater saving down the line. The overall cost went down by almost 30%.

The probable reason is that we are now allocating slightly less, we saved a few allocations for each query parameter, and that in turn translated to a far better overall performance.

time to read 2 min | 260 words

For many business domains, it is common to need to deal with hierarchies or graphs. The organization chart is one such common scenario, as is the family tree. It is common to want to use graph queries to deal with such scenarios, but I find that it is usually much easier to explicitly build your own queries..

Consider the following query, which will give me the entire hierarchy for a particular employee:

I can define my own logic for traversing from the document to the related documents, and I can do whatever I want there. You can also see that I’m including the related documents, here is how this looks like when I execute the query:

image

A single query gives me all the details I need to show the user with one roundtrip to the server.

Let’s go with a more complex example. The above scenario had a single path to follow, which is trivial. What happens if I have a more complex system, such as a family tree? I took the Games of Thrones data (easiest to work with for this demo) and threw that into RavenDB, and then executed the following query:

And that gives me the following output:

image

This is a pretty fun technique to explore, because you can run any arbitrary logic you need, and expressing things in an imperative manner is typically much more straightforward.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Implementing a file pager in Zig (13):
    21 Jan 2022 - Write behind implementation
  2. re (30):
    14 Jan 2022 - Are You Sure You Want to Use MMAP in Your Database Management System?
  3. Production postmortem (33):
    03 Jan 2022 - An error on the first act will lead to data corruption on the second act…
  4. Negative feature response (2):
    20 Dec 2021 - Protect the user from accidental collection deletion
  5. Challenge (63):
    16 Dec 2021 - Find the slow down–answer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats