Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,270 | Comments: 50,505

Privacy Policy Terms
filter by tags archive
time to read 5 min | 949 words

At long last, we are now at the point where we can write data back to the disk.  Before we can do that, however, we need to figure out what sort of writes we want to allow. The idea that I have in mind for the Pager is to follow the same path as Voron does. In other words, for writes, we can make the following assumptions:

  • There is only a single thread doing writes to the pager.
  • There are no readers for the writes until the write is completed.
  • There is a distinction between writing the data to the pager and writing the data to the disk.

Let’s break those assumptions apart and see what they bring to the table. The fact that we can assume only a single writer thread at any given point is pretty important. It means that the level of complexity that we have to face is greatly reduced. In the same sense, the fact that we don’t need to deal with concurrent readers or any consistency boundary for the data while it is being written will greatly simplify things for us. Finally, we make a distinction between writing to the pager and writing to the disk. Writing to the disk is _slow_, so we want to avoid doing that at any critical areas and push that to the background.

Finally, there is another aspect to consider. Internally, the Pager works with 2MB chunks, but to the outside world, it is using 8KB pages. When we write, we always write at the 8KB pages, not chunks. How would that work for the Pager?

The Pager itself is concurrent, but we only allow a single writer at a time, we can achieve this by centralizing all the write activities in the Writer struct, like so:

For now, I’m going to ignore the fields in the Writer struct, we’ll touch on them in detail later. In order to use the writer, you need to acquire it, write as many pages as you need, then release it. Here is a usage example:

The basic idea is fairly simple. With the writer, we operate at the page boundary to write as many pages as we need, once we are done, the call to flushWrites() persists the data to disk and then we can release the writer. Let’s dig a bit deeper and see how that works, shall we?

The write() call is about as basic as you can get. We use the getPage() function to get the right page, memcpy the data and that is about it, right? There are only two other things here that are important:

  • We record which chunk (the 2MB chunk of memory, out of which we carve the 8KB pages) at the writer’s level, is using the loadedChunksForWrites value.
  • We remember which pages we wrote to using the writtenPages hash table.

This is intentionally bare bones, because that is actually sufficient for our needs. The  fact that we remember which chunks we loaded (and keep a reference to them) will prevent us from reclaiming them, so even though we just wrote to memory, another thread can get the data and start using it without waiting for the disk. Of course, we still need to hit the disk eventually, that is what flushWrites() is about.

There is a lot that is going on here, let’s break it up. We start by allocating a temporary array and copying the keys from the writtenPages hash table to it. We then sort the array. This is done so we’ll be able to process the writes in a sequential manner, which is likely to be faster, even with async I/O. We then scan the list of pages in order, trying to merge writes together. The idea is to issue the minimum number of write calls. Finally, we’ll wait for all the writes to complete. Okay, maybe it isn’t that complex.  There is a bunch of code here, but it is mostly straightforward. Note that we also prepare the writeResults list to accept the results of the write to the disk.

As for writing to the disk, this is done using the PagerRing we previously looked at:

To write a buffer to the disk, we simply get the buffer from the Pager (reusing all the work we did in getPage()), increment the number of outstanding writes and then submit the work for the ring for processing. We setup the completeFlush as the callback function on completion. The PagerRing will call us when it is done writing to the disk. If there is an error, we’ll record it and reduce the number of outstanding writes. If there are no more outstanding writes, we’ll wake any waiters. That part is handled in the waitForAllDiskWritesToComplete().

We start by waiting for the outstanding writes to complete, waiting if needed. Then we can reset the state of the Writer. We start by resetting the written pages and then iterate over all the loaded chunks and release them. After the call, the Pager may decide to remove them from memory. This is fine, since they were already written to disk.

Except… if there was an error. You might have noticed that we are gathering the errors on each individual write operation we send, but we are actually only looking at the first one. For that matter, we clear the state of the Writer regardless if there were errors or not.

In general, an I/O error from the disk is not something that is recoverable. What you can do at this stage is to raise the error higher and run whatever recovery you have on startup.

In the next post, I’m going to be talking about durability and the overall expected performance of the system under this sort of model.

time to read 12 min | 2246 words

I was pointed to this paper on twitter: Are You Sure You Want to Use MMAP in Your Database Management System?

As you can imagine, this is a topic near and dear to my heart. This is especially the case since I am currently writing the Implementing a file pager in Zig posts series. I implemented the same low level mechanics using mmap, using mmap, I have < 100 lines of code and can start building higher level concepts almost immediately. Writing my own pager is currently a 10 posts series and the end doesn’t seem to be in sight.

I’m going to use this post to respond to the article. As a reminder, I’m the founder of RavenDB and I wrote Voron, a mmap based storage engine, and has been running that across hundreds of clients and literally tens of millions of instances in production. I am also writing a book about building a storage engine that uses mmap internally.

The paper itself does a great job of outlining the issue of using mmap as the buffer pool in DBMS. What it doesn’t cover, however, is the alternative. I will touch on specific points from the paper shortly, but I want to point out that the article compares apples to camels in the benchmarks and conclusions. Note that I don’t necessarily disagree with some of the statements, mmap certainly has challenges that you need to deal with, but if you avoid that, you can’t have wave everything that it brings to the table.

When building a database, using mmap has the following advantages, the OS will take care of:

  • Reading the data from disk
  • Concurrency between different threads reading the same data
  • Caching and buffer management
  • Eviction of pages from memory
  • Playing nice with other processes in the machine
  • Tracking dirty pages and writing to disk*

I put an asterisk on the last one because it probably requires your attention as well.

If you aren’t using mmap, on the other hand, you still need to handle all those issues. That is a key point that I believe isn’t addressed in the paper. Solving those issues properly (and efficiently) is a seriously challenging task. Given that you are building a specialized solution, you can probably do better than the generic mmap, but it will absolutely have a cost. That cost is both in terms of runtime overhead as well as increased development time.

The comparison that was made by the paper was done using fio benchmark tool, which is great if you want to test your storage system, but is pretty much irrelevant if you are trying to benchmark a buffer pool. Consider the following:

For the mmap version, we need to compute the address of the page and that is pretty much it. For the manual buffer pool, the list of tasks that we need to handle is long. And some of them require us to be thread safe. For example, if we handed a page to a transaction, we need to keep track of that page status as being in use. We cannot evict this page until the transaction is done with it. That means that we probably need to do atomic reference counting, which can have very high costs. There are other alternatives, of course, but they all have even higher degrees of complexity.

In practice, data access within a database isn’t actually random, even if you are doing random reads. There are pages that are going to almost always be referenced. The root page in the B+Tree is a good example. It is always going to be used. Under atomic reference counting, that page is going to be a bottleneck.

Ignoring such overhead of the buffer pool management means that you aren’t actually comparing equivalent structures. I should also point out that I’m probably forgetting a few other tasks that the buffer pool needs to manage as well, which complicate its life significantly. Here is an example of such a buffer pool implementation from what is effectively a random GitHub repository. You can see what the code is trying to do here. The reason I point to this is that there is a mutex there (and I/O under the lock), which is fairly typical for many buffer pools. And not accounting for the overhead of buffer pool management is seriously skewing the results of the paper.

All of this said, I absolutely agree that mmap can be challenging. The paper outlines 4 different problems, which I want to address.

Problem #1 – Transactional safety

A database needs to know when the data is persisted to disk. When using mmap, we explicitly give up that knowledge. That can be a challenge, but I don’t see that as a seriously different one from not using mmap. Let’s consider the manner in which Postgres is working. It has its own buffer pool, and may modify the pages as a result of a write. Postgres may need to evict modified pages to disk before the transaction that modified them is committed. The overhead of managing that is just… part of the challenge that we need to deal with.

For RavenDB, as the paper points out, we modify the pages outside of the mmap memory. This is actually not done for the reason the paper describes. I don’t actually care if the data is written to memory behind my back. What I care about is MVCC (a totally separate concern than buffer management). The fact that I’m copying the modified data to the side means that I Can support concurrent transactions with far greater ease. In a similar fashion, Postgres handles MVCC using multiple entries for the same row in the same page.

When the transaction commits and older transactions no longer need the old version of the data, I can push the data from the modified buffers to the mmap region. That tends to be fairly fast (given that I’m basically doing memcpy(), which runs at memory speed) unless I have to page data in, more on that later.

The paper mentions the issue of single writer in LMDB, and I wanted to point out that a single writer model is actually far more common (and again, not really related to the buffer pool issue). Off the top of my head, most embedded databases implement a single writer model.

  • LMDB
  • Voron (RavenDB’s storage engine)
  • LevelDB
  • Lucene

The one that I can think that doesn’t have a single writer is RocksDB(where allow_concurrent_memtable_write is for writes to the memtable, not related to file I/O).

The reason this matters is that embedded systems can typically assume that all operations in a transaction will complete as a unit. Compare to Postgres, where we may have a transaction spanning multiple network calls, interleaving writes is a must. If we could avoid such concurrency, that would be far preferable. You can get additional concurrency by having sharding writes, but that is usually not needed.

Problem #2 – I/O Stalls

The paper points out, quite correctly, that not having control over the I/O means that you may incur a page fault at any time. In particular, you may end up blocked on I/O without really noticing. This can be a killer especially if you are currently holding a lock and blocked on page fault. Indeed, I consider this to be the most serious issue that you have to deal with mmap based systems.

In practice, however, the situation isn’t so clear cut. Until quite recently, the state of asynchronous I/O on Linux was quite iffy. Until the arrival of io_uring, certain operations that you expected to be async would block occasionally, ruining your day. The paper mentions that you can use async I/O to issue I/O requests to load the next pages (non sequentially) from the disk when you are performing certain operations. You can do the same with mmap as well, and RavenDB does just that. When you start a scan on a B+Tree, RavenDB will ask the OS to ensure that the memory we are interested in is in memory before we actually get to it. On Linux, this is done with madvise(WILL_NEED) call. That call may be blocking, so we actually have a dedicated thread that is meant to handle such a scenario.  In practice, this isn’t really that different from how you’ll handle it with async I/O.

Another consideration to deal with is the cost of mapping at the kernel level. I’m not talking about the I/O cost, but if you have many threads that are faulting pages, you’ll run into problems with the page table lock. We have run into that before, this is considered an OS level bug, but it obviously has an impact on the database. In practice, however, the overhead of memory management is the same in most cases. If you are reading via mmap or allocating directly, you’ll need to orchestrate things. Note that the same page table lock is also in effect if you are heavily allocating / freeing, since you’re also modifying the process page table.

Problem #3 – Error Handling

Error handling is a serious concern for a database. The paper points out that databases such as SQL Server may run a checksum when reading data from disk. When you use a buffer pool, the boundary of reading from the disk is obvious and you can easily validate the read from the disk. Voron is using mmap exclusively, and we do have checksums. We validate the page from the disk the first time that we access it (there is an internal bitmap that is used for that).  There isn’t a big difference between the two anyway. We only check a given page once per run, because to do otherwise is meaningless. When you use read() to get data from the disk, you have no guarantees that the data wasn’t fetched from a cache along the way. So you may validate the data you read is “correct”, while the on disk representation is broken. For that reason, we only do the check once, instead of each time.

A far greater issue to deal with is I/O errors. What do you do when a read or a write fails? If you are using system calls to manage that, you get a return code and can react accordingly. If you are using mmap, the system will generate a SIGBUS that you’ll have to (somehow) handle.

For a database, dealing with I/O errors has a single correct answer. Crash and then run recovery from scratch. If the I/O system has returned an error, there is no longer any way to know what the state of that is. See: fsync-gate. The only way to recover is to stop, reload everything (apply the WAL, run recovery, etc) and get back into a stable state. SIGBUS isn’t my cup of tea with regards to handling this, but error handling for I/O error isn’t actually something that you do, so just restarting the process ends up more acceptable than you might initially think.

Problem #4 – Performance issues

The paper points out three common reasons for performance issues with mmap usage:

  1. page table contention
  2. single threaded page eviction
  3. TLB shootdowns

The first issue is something that I have run into in the past. It was a bug in the operating system which was fixed. There is no longer a single page table in both Windows and Linux.

The single threaded eviction, on the other hand, is something that we never run into. When using Voron, we map the memory using MAP_SHARED, and most of the time, the memory isn’t dirty. If the system needs memory, it can do that when it assigns a page by just discarding the memory of an unmodified shared page. In this model, we typically see most of the memory as shared, clean. So there isn’t a lot of pressure to evict things, and it can be done on as needed basis.

The TLB shootdown issue is not something that we ever run into as a problem. We have run TB range databases on Raspberry PI with 4GB of RAM and hammered that in benchmarks (far exceeding the memory capacity). The interesting thing here is that the B+Tree nature means that the upper tiers of the tree were already in memory, so we mostly ended up with a single page fault per request. In order to actually observe the cost of TLS Shootdown in a significant manner, you need to have:

  • really fast I/O
  • working set that significantly exceeds memory
  • no other work that needs to be done for processing a request

In practice, if you have really fast I/O, you spent money on that, you’ll more likely get more RAM as well. And you typically need to do something with the data you read, which means that you won’t notice the TLB shootdown as much.

Finally, going back to how I started this post. This assumes that there are no other costs of not using mmap and using direct IO. The benchmark doesn’t account for those extra costs. For example, without mmap, who is doing evictions? In practice, that will lead to the same sort of considerations that you’ll have when dealing with mmap. This is especially the case with TLS shootdown when we start talking about high memory traffic (which likely modifies page allocations for the process, leading to the same scenario).

The paper has been quite interesting to read and it has presented a number of real problems that occur with mmap based systems, but I’m afraid that it doesn’t present the alternatives properly and vastly underestimates both costs and complexity of not using mmap and writing your own buffer pool.

time to read 10 min | 1893 words

Up to this point, we focused on reading data from the disk, we can do that up to a point. Eventually we’ll run out of memory (assuming that the database is bigger than memory, which is a pretty safe assumption). That means that we need to decide what to remove from memory. When we use mmap(), the OS gets to decide that for us. In fact, that is probably the best argument against using mmap(). The additional control we get from knowing how to manage the memory is the chief reason to take the plunge and manage our own memory.

There are a lot of algorithms around managing memory, I really like this one, because it is quite elegant. However, that requires quite a lot of states to be dealt with, especially when working with highly concurrent systems. Instead, I chose to look at the clock sweep algorithm. This is also implemented by PostgreSQL, and it is actually far simpler to work with. The idea is that for each page, we maintain a usage count. Each time that we need to get a page, we’ll increment its usage count (up to a small limit). Each time we need to evict a page, we’ll search for a page that can be evicted and has no recent usage. If it has usages, we’ll decrement that value and repeat until we find something.

Our buffer management isn’t actually dealing with pages, however. We are working with 2MB chunks, instead. The principal is the same, but using bigger aggregates is advantageous given typical memory sizes these days. The first thing that we need to do is to modify the ChunkMetadata. I’m showing only the relevant changes.

The major change here is the introduction of the usages field. That is a 3 bits field (0 .. 8 in range) and we reduced the number of references a chunk can have to about 500 million (should be sufficient, I believe). The idea here is that each time that we call addRef(), we’ll increment the usages count, like so:

Zig has a nice feature that I’m using here, saturating addition. In other words, if the value is incremented beyond its limit, it is clamped to the limit. That means that I don’t have to worry about overflows, etc. I took a look at how this is implemented, and the compiler generates the following assembly for this code (x +| 100) :

This may look daunting, but I’ll break it down. First, we add the 100 to the value, as you can expect (the value is currently in the EAX register). Then we store –1 (value of 0xFFFFFFFF) in the ECX register. Finally, we use the CMOV instruction (the CMOVB in the snipper is a variant of CMOV), telling it to store ECX in EAX if the carry flag is set on the last addition instruction. For fun, this also avoids a branch in the code, which is great for performance.

One of the critical functions that we need to consider here is the behavior of the pager when we are accessing rare pages once, and then never again. A really common scenario is when we are doing a scan of some data for a rare query. For that reason, the usages behavior is a bit more complex than one might imagine. Let’s explore this for a bit before moving on. Look at the following code, I marked the important lines with stars:

When we mark a chunk as loaded, we copy the usages from the current record. That starts out as zero, so for the scenario of accessing rare pages, we’ll have a good reason to evict them soon. However, there is a twist, when we remove a chunk from memory, we also set its usage count to 1. That is an interesting issue. The chunk is not loaded, why does it have a usage count? That is because if we removed it from memory, and we load it again, we want it to start with a higher usage count (and less chance to be evicted). In this manner, we are somewhat simulating the 2Q algorithm.

Now, let’s take a look at the actual reclaiming portion, shall we? In the chunk metadata, we have the following behavior:

If we have a value and no outstanding references, we can reclaim it, if we don’t have a value, we’ll reduce the usage count anyway. The idea is that when we unload a page, we’ll set the usages count to 1. The usage counter of an unloaded page will reset this as time goes by, so if there has been a sweep on an empty chunk that has a usage count, we’ll not count that for the next time we load it. Basically, after unloading a chunk, if we reload it soonish, we’ll keep it around longer the next time. But if it is only rarely loaded, we don’t care and will forget that it was loaded previously.

The process for actually reclaiming a chunk is shown here:

We look at a particular index in the chunks, and check if we can claim that. Following out previous behavior, there are a bunch of options here. We can either have an empty chunk that remembers the previous usage, which we can reduce, or we can actually try to reclaim the chunk. Of course, that isn’t that simple, because even if we found a candidate for reclaiming, we still need to reduce its usages count. Only if it has no usages will we be able to actually start the removal process. Finally, we have the actual scanning of the chunks in the file, shown here:

We scan the file from the provided start index and try to reclaim each chunk in turn. That may reduce the usages count, as we previously discussed. The actual process will continue until we have found a chunk to reclaim. Note that we are always claiming just a single chunk and return. This is because the process is repeatable. We start from a given index and we return the last index that we scanned. If the caller needs to free more than a single chunk, they can call us again, passing the last index that we scanned.

That is why this is called the clock sweep algorithm. We are sweeping through the chunks that we have in the system, reaping them as needed. The code so far is all in the same FileChunks instance, but the Pager actually deals with multiple files. How would that work? We start by adding some configuration options to the pager, telling us how much memory we are allowed to use:

We have both soft and hard limits here, because we want to give the users the ability to say “don’t use too much memory, unless you really have to”.  The problem is that otherwise, users get nervous when they see 99% memory being used and want to keep some free. The point of soft and hard limits is that this gives us more flexibility, rather than setting a lower than needed limit and getting memory errors with GBs of RAM to spare.

In the Pager, we have the loadChunksToTransaction() that we looked at in the previous post. That is where we read the chunk from the file. We are going to modify this method so will reserve the memory budget before we actually allocate it, like so:

As you can see, we reserve the memory budget, then actually allocate the memory (inside markLoading()). If there is a failure, we release the budget allocation and report the error. To manage the memory budget, we need to add a few fields to the Pager:

You can see that releaseChunkMemoryBudget() is pretty trivial. Simply release the memory budget and move on. Things are a lot more complex when we need to reserve memory budget, however. Before we dive into this, I want to talk a bit about the filesRelcaimSweeps field. That is an interesting one. This is where we’ll keep the last position that we scanned in the Pager (across all pages). However, why is that an array?

The answer is simple. The Pager struct is meant to be used from multiple threads at the same time. Under memory pressure, we are likely to need to evict multiple chunks at once. In order to avoid multiple threads scanning the same range of the Pager to find chunks to remove, I decided that we’ll instead have several sweeps at the same time. On startup, we’ll initialize them to a random initial value, like so:

In this manner, under load, each thread is likely to scan an independent portion of the Pager’s memory, which should avoiding competing on the same memory to evict. And with that behind us, let’s see how we can actually use this to evict memory:

There is a lot of code here, and quite a few comments, because this is choke-full of behavior. Let’s dissect this in detail:

We start by checking the memory budget and see if we are below the soft memory limits. We are doing this optimistically, but if we fail, we reset the state and then start to scan the Pager for chunks we can release. We do that by accessing one of the filesRelcaimSweeps values using the current thread id. In this way, different threads are likely to use different values and move them independently.

We find the relevant file for the index and start scanning for chunks to release. We’ll stop the process when one of the following will happen:

  • We released a chunk, in which case we are successful and can return with glory.
  • We didn’t find a chunk, but found candidates (whose usage count is too high to discard).

In the case of the second option, we’ll look for better options before getting back to them and discarding them.

If we are completely unable to find anything to release, we’ll check if we exceed the hard memory limit and error, or just accept the soft limit as, well… soft limit and allocate the budget anyway.

This happens as part of the process for loading chunks in the pager, so we’ll only need to release a single chunk at a time. For that reason, we remember the state so the next operation will start from where we left off. You can think about this as a giant set of hands that are scanning the range of chunks in memory as needed.

There are actually a few things that we can implement that would make this faster. For example, we always scan through all the chunks in a file. We could try to maintain some data structure that will tell us which pages have usage count to consider, but that is actually complex (remember, we are concurrent). There is also another factor to consider. The ChunkMetdata is 64 bits in size, and a FilesChunk struct contains an array of 4096 such values, totaling 32KB in size. It is actually cheaper to scan through the entire array and do the relevant computation on each candidate than try to be smart about it.

I think that this is it for now, this post has certainly gone for quite a while. In the next post in the series, I want to tackle writes. So far we only looked at reads, but I think we have all the relevant infrastructure at hand already, so this should be simpler.

time to read 9 min | 1677 words

We have finally gotten to the point where we can ask the pager for a page of data (reminder, a page in this case is 8KB of data) and get it back. Here is what this looks like:

There are a few things to note here. There is a new concept here, the transaction. We don’t want to have that concept at the pager level, so we just pass the relevant state to the pager to deal with. Basically, any transaction we’ll have will need to have a bit of state for the pager to deal with it. The state we need at this point is minimal, just the list of chunks that are referenced by the pager. You can also see that we provide a timeout for the getPage() call. What is that about?

This leads to the second concern we need to consider. How are we expected to run this? If we call getPage() on a page (actually, a chunk containing this page) that isn’t resident in memory, we’ll need to go to the disk to read it. That can take a while, sometimes a long while. At a glance, that is one of those things that async/await was meant for. Since Zig supports async functions, that is certainly something that we can do, but it is something that I want to be cautious about. Having explicit blocking is far easier to understand and debug, at least for now. This is especially true if we’ll want to consume the pager API from a non Zig target.

That leads to an interesting issue, however. If a call to getPage() can block, how can we avoid blocking the thread. In most cases, we would like to avoid blocking, after all. It would be simple to have tryGetPage() method, which will not block (but schedule the load of the page from disk), and then maybe register for a notification for that. If that sounds like async to you, that is because it is. The problem with this sort of approach is that you need to suspend execution somewhere in the middle of a transaction operation and continue when the data is loaded. Without async/await, you can’t really do that. Well, I mean, you could try, but we have a lot of experience with trying to manage state via callbacks, that isn’t really going to work for anything beyond the simplest systems (see: node.js without async/await).

There is one thing that we can do that would be both simple and effective, however: we can error if the page isn’t in memory. That sounds like a pretty bad idea, no? How would that help us?

Well, the concept is simple. If a transaction attempts to access a page that isn’t resident in memory, we’ll do the following operations:

  • Schedule the chunk the page resides on to load into memory.
  • Return an error from the pager
  • Rollback the transaction
  • Keep the loadedChunks for that transaction active and wait for the chunk to be loaded
  • Re-run the transaction again, now the chunk is in memory and we can proceed further

Each time that we re-run the transaction, we make sure that the chunks it needs are in memory, eventually ensuring that all the required chunks are resident and we don’t need to block.

At the same time, the code to work with the transactions is not going to care about blocking, etc. We need to do the usual error handling, but that is required anyway. There is a single location where we need to deal with callbacks from the pager, so there is a limited blast radius of complexity. For write transactions, for example, this is a very reasonable strategy. We assume that there is only a single thread writing at a given time. A transaction being blocked because it needs to read a page from the disk can stall other pending transactions. By having it abort and retry later, we can keep the line moving. For read operations, on the other hand, that is likely not something that you want to do. If I’m already streaming results to the caller, I can’t just repeat the transaction.

I’m not making any decisions at this point, just considering the various options and implications that we have to deal with at this early level.

Now, let’s look at how the getPage() is actually implemented, shall we?

There is a lot that is going on here, I know. We start by defining a set (a hash map using a 64 bits unsigned integer to a zero-sized value). The way this works with Zig is quite elegant, since we pay no memory cost for the values here.

The majority of the work is done in the loadChunksToTransaction() function, which we’ll examine shortly, but you can see some interesting details already in getPage(). We assume that we have a page loaded, and any range of pages that we ask is always within a single page.

The call to load the chunks actually puts them in the loadedChuncks argument. We verify that we loaded the chunk properly and then we create a slice to return for the caller. Note that we may request more than a single page and it is valid to ask for a range that contains multiple chunks. We validate that the range we return is within the memory range for the current file, we ensured that the chunks for a specific file are consecutive in memory, so we can safely return this pointer across multiple chunks without needing to think about it.

There is another aspect of loadedChunks that we need to discuss. A transaction may use multiple pages from the same chunk, but we only need to load the chunk once. At the same time, we can avoid adding a reference to the chunk multiple times for each loaded page. When we close the transaction, we need to release the reference for these chunks, so we need to keep track of those. With that in mind, let’s see how we actually load the chunks to memory.

As a reminder, we have two actors working together here. The FileChunks is used to store the chunks in memory and the PagerRing is used for parallel I/O.

That is a lot of code to throw at you, I know, let’s dissect it in detail. In this method, we are working on chunks, not pages, and we assume that we may have multiple chunks, that is why we have the while loops. We start by checking if the chunk is already loaded in the transactions’ loadedChunks. If it isn’t, we compute the position of the chunk in the file (the chunk number we get from the caller is the global one, after all) and try to get it from the FileChunks. This is where things get interesting.  When we call tryGet() for the current chunk, we may get an error because of two possible scenarios:

  1. The value is currently being loaded from the disk (some other transaction asked for it, probably). We don’t need to do anything further other than wait for it to show up.
  2. Another transaction tried to load it, but got an error. At this point we just return the error. We don’t try to do anything special here. In general, there may be a lot of errors to consider here. We may have temporary I/O issue, or run out of memory or something that is transient. Or we may have an actual problem at hand (bad sector on disk, corrupted data, etc). Regardless of what we are doing, we aren’t going to try to do any error handling here. We’ll just record the error and any future attempt to access that chunk will also error. The proper way to recover at this point is to restart the pager. This is assuming we have the other components of a database at play here. So we’ll re-run the journal files, apply recovery, etc. In short, any I/O issues like that are critical errors and require a restart of the system to come back to a known state.

If the tryGet() method returned without an error, there are still two options to consider. The call may have returned a value (so we called addRef() on the chunk internally), we can simply add that to the chunks we own and move on. If there isn’t a value in memory, things start to get interesting. At this point we call markLoading(). We are basically racing to be the owners for loading this chunk. If we are successful in this race, we’ll get the buffer back from the FileChunks and can schedule reading the relevant chunk from the disk. You’ll note that we are setting the callback to completeLoad, we’ll look into that shortly. If we aren’t successful (we didn’t get a buffer back), then some other thread was able to get the buffer and will schedule the read for us, so we are done.

After we either ensured that all the chunks are loaded or scheduled them to be loaded, we use getBlocking() to wait for all the relevant chunks to be available. Once that is done, we can safely return and getPage() will complete the process, as we saw earlier.

The only thing that we have to look at is the completeLoad function, which is about as basic as you can get:

Most of the function is about error handling. We register either the fact that we got an error reading from the disk or that we completed the load process and maybe log something. In general, there isn’t really much that we need to do here. The act of calling markLoaded() will release any threads waiting on getBlocking(), after all. So the whole thing comes together quite nicely.

With this done, we are mostly done on the reading side of the pager and this post as well. In my next post, I want to discuss how we should handle eviction of data. So far, we are just reading into memory, never releasing. We need to take care of that as well, of course. Once that is done, we can move to the wonderful topic of handling writes and durability…

time to read 5 min | 970 words

This is my 7th post in this series, and we are now starting to get into the really interesting bits. So far we worked with the individual components, each of them doing just one thing, which makes it hard to see how they are all put together. In this post, we are going to start plugging things together.

As a reminder, the whole point of this series of posts is to explore what we need to do in order to ask for a page (8KB, in our case) from the data file and work with it in memory.  We have most of everything ready, let’s put them back together. The first thing that we need to do is actually tie a file to the FileChunks structure that we previously created. This is as simple as this structure:

We are simply initializing the FileChunks and opening the file, nothing else. Remember that a single pager is going to be responsible for multiple files. Furthermore, all the data structures that we are dealing with now are meant for concurrent use. That means that we should be prepared for this type of code:

When we use a managed language, that is actually fairly simple to work with. In an unmanaged language, those two lines of code are tough. Why is that? Let’s look at the raw data members for the Pager structure, shall we?

Of particular interest to us is the files member. This is the list of files that are being managed by the pager. Each one of them has a maximum size of 8GB in size. The problem is that we may have one thread accessing the list at the same time that another thread wants to increase its size. How would that work?

The simplest option is that we’ll reallocate the array, but that will move it. What would the first thread be doing in that scenario? The good thing from our point of view is that we don’t need to worry about concurrent modifications. There is only a single thread that is allowed to modify the Pager’s state at any given point in time.

Trying to find solutions for this problem leads into a rabid rabbit’s hole. You go into hazard pointers, epoch GC and other fun stuff. Also, the call to getPage() is one of the most important ones in a database, anything that we can do to reduce its cost will be done. As such, we can’t typically use any form of locking. A reader/writer lock can be a killer. Here is a good example for how that can happen.

I thought about how to resolve this, and I decided against a generic solution, instead, let’s look at the actual behavior that we need. The files array is going to be accessed a lot, it has an entry per 8GB of disk space that we take. That means that it isn’t going to be experiencing any rapid growth. It is also only going to grow. We also need to worry only when we grow the physical backing store for this array, if we overprovision and use less than we need, that is perfectly fine. Each element in the array is 8 bytes in size, so if we allocate a single memory page (4KB) we can store 512 file references in it. That represents 4 TB(!) of data, so we can probably just accept the additional cost and allocate it and not think about it.

Databases with > 4TB of disk size do exist, and we don’t want to have this artificial limit on us, do we? Instead, we can use another approach. We’ll start by allocating the array with a minimum size of 8 elements (sufficient until you get to 64GB). But what happens when we reach that size?

What we do here is cheat. We don’t need to free the memory immediately, when we reach the limit of the size of the array, we’ll double its size, copy the data that we have and register it to be freed when we close the Pager. At that point, the caller already needs to ensure that there are no other users of the pager.

Because we copy the values from the old array, but keep it around, old readers may use the old or new arrays, but we don’t actually care. The memory remains valid and accessible. In terms of wasted space, if our database went from a single file to being 128 TB in one run, we’ll have an array with 16,384 elements (whose size is 128KB). Along the way, we had to double the size of the array a dozen times and we “waste” 128KB of unused buffers. This seems like a pretty reasonable cost to significantly reduce the level of complexity of concurrent access. Using this method, we can avoid any sort of synchronization on the read side. That is certainly a plus.

Here are the init() and deinit() calls for the Pager, to complete the picture:

As you can see, we allocate a PagerRing for the pager, which will deal with the actual I/O. The actual disposal of the files array is managed using the pendingFree list. That is a small cost to pay, to reduce the cost of adding a new file. In the deinit() routine, note that there is a distinction between releasing the files themselves (where we close the FilesChunk, release the relevant memory, close the file handle, etc) and releasing the arrays that hold the files themselves.

I’m quite pleased with how this turned out, zero cost for reads (important) and negligible memory cost for most scenarios).

In my next post, I’ll get started with actually reading the data from disk and putting that in the pager. So far, that is a 7 post series, and we haven’t completed the first part. That simple scenario is surprisingly tricky.

time to read 4 min | 742 words

imageOne of my favorite activities is teaching. I love explaining how things work and passing on knowledge. Another good way to pass the time is to learn, which can be a source of great joy and incredible frustration.

Recently I had a conversation with a university student in the Computer Science track talking about databases. As you might imagine, that is a topic that is near and dear to my heart, so I had a lot of interest in hearing from the student what kind of databases they have been exposed to and what impression they had on them.

The student in question was familiar with MySQL from his courses and had run into PostgreSQL in some job interviews. He was very much in the PostgreSQL camp, that isn’t an uncommon reaction, but the reason for that was interesting to me. In his course, they had to setup and configure MySQL from scratch. That was annoying and hard, especially since they used the command line to make queries to the database.

In the PostgreSQL case, however, he was given access to a web front end to the database and was assigned tasks to complete. They could get started right away doing the most important aspects of their task.

When I’m teaching, part of the job is data dump (here are the things you need to know), part of the job is to answer questions, make sure that they understand the material, etc. A crucial part of teaching is getting the students to do things, making sure that the students are actually exercising the new knowledge. In such cases, I noted that I provide them with the baseline and they need to complete just the parts that are missing, the core pieces.

That is pretty much the same thing that the student ran into during their interview.

In retrospect, for teaching, I think that this approach is a serious issue.

One of the most important hurdles that I see for new developers is their ability to deal with problems. Whether it is actually reading the errors, composing some mental model for what is going on or just being able to dig deeper and understand what is happening. I’m not talking about experience, mind. I’m talking about the approach. If most of the tasks that they have dealt with so far were ones which were “fill the missing pieces”, they are likely never had the experience of dealing with everything else. And in many cases, the issues aren’t in the areas that you are thinking, they can be somewhere else completely.

I remember many times where I had to do something, and I ran into a wall. That was incredibly frustrating, especially when the issue was somewhere completely orthogonal to what I’m trying to do. A great example recently was having to figure out how to do cross compilation in GitHub action using GCC. That took a lot of time and all I wanted to do is to just call a single function from native code.

As frustrating as that is, I think that there is a huge amount of value in those kinds of “side quests”. That is especially true when someone is in the early stages of their career. Those are just the sort of hurdles that can teach you not only what is going on at your level but about the ecosystem in general and the layers beneath you.

A great example of lack of such knowledge is a candidate in an interview recently that was asked: “Given a static HTML page and a domain name that was registered, what do you need to setup a website for that page in that domain?” The candidate had no idea what was actually involved (nothing about DNS, routing, servers, etc). They were likely able to write an application using modern practices, but everything about pushing to production, what is a website, what is a domain name or IPs… nope.

And that makes sense, they never had even run into something like that.

On the other hand, I remember building websites using FTP and Apache / IIS in the 90s. It wasn’t fun, but I had very little abstraction to deal with and was exposed to the working of the engine.

And that sort of thing matters, because you will need to understand details such as DNS propagation times and its impact on what your system is doing, for example.

time to read 5 min | 872 words

After implementing the memory management in the previous post, I set out to handle the actual I/O primitives that we need. As a reminder, we are separating the concerns here. We managed memory and reference counting in the previous post and now I want to focus on how we can read and write from the disk in as efficient a manner as possible. Before we get to the guts of the code, I want to explain a bit about the environment that I have in mind. Most of the time, the pager is able to provide the requested page from memory directly. If it can’t do that, it needs to consider the fact that there may be multiple threads that are trying to load that page.  At the same time, while we are loading the page, we want to be free to do other things as well.

I decided to implement the I/O routine using async I/O. Here is the rough sketch of the API I have in mind:

The idea is simple, we use a struct to which we can submit work in an asynchronous manner. At some later point in time, the work will complete and our read or write will be done. At that point we’ll be able to invoke the provided callback for the user. The code above is about as simple as you can manage, it spawns a dedicated thread to manage the I/O and then just issues those operations directly. To save myself some headache, I’m using an eventfd as a synchronization mechanism, I don’t strictly need this, but it will be useful down the road.

In terms of the API, I can now write the following code:

The basic idea is that the BackgroundRing struct doesn’t manage file descriptors or buffers. It is a strict way to manage I/O. The API is pretty piss poor as well, in terms of usability. No one will ever want to write generic I/O routines using this method, but we aren’t trying to do generic I/O, we are trying to ensure usability in a very constrained manner, inside the pager.

About the only nice thing that we do in this implementation is handle partial reads and writes. If we were asked to read more than what we got, we’ll repeat the operation until we get to the end of the file or succeed.

In terms of implementation, as well, this is a really bad idea. We look like we are doing async I/O, but we are actually just passing it all to a background thread that will do the work off a queue. That means that it will be very hard to make full use of the hardware capabilities. But I think that you can guess from the name that I’m not going to leave things like that. I’m going to use the new io_uring API in Linux to handle most of those concerns. That idea is that we’ll allocate a command buffer in the kernel and allow the kernel to handle asynchronous execution of the I/O operations. We still retrain the same rough structure, in that we are going to have a dedicated background thread to manage the commands, however. Amusing enough, the io_uring API is meant to be used from a single thread, since otherwise you’ll need to orchestrate writes to the ring buffer from multiple providers, which is much harder than a single consumer, single producer scenario.

The use of io_uring is also why we are using the eventfd model. We are registering that file descriptor in the io_uring so it will let us know when event completes. This also does double duty as the method that we can use to wake the background thread when we have more work for it to do. The most major change is inside the background worker, of course. Here is how this looks like:

We create  the ring in the init function (see full code listing below) and in the background thread we are simply waiting for an event using the eventfd. When a caller submits some work, we’ll register that on the io_uring and wait for it to complete (also using the eventfd). You can see that I’m handling some basic states (partial reads, full queues, etc). The code itself ends up being pretty small. Once we are done with the operation, we let the user know about the completion.

There are a few things here that are interesting to note. We are actually allowing interleaving of operations, so we may have many outstanding operations at any given point. We aren’t trying to guarantee any kind of ordering between the operations, nor are we providing anything but the most bare bones interface for the caller. Even so, there is quite a bit of power in the manner in which we are working here. We need to complete a few more components first, and then we can bring it all together…

Here is the full listing of the PagerRing code.  In the next post, I want to focus on the actual overall pager, when we are managing multiple files and how we work with all of them together. In particular, we want to understand how we manage the memory budget across all of them.

time to read 8 min | 1495 words

After writing the post about handling chunk metadata, I started thinking about the overall approach. Both the method using compressed pointers and the baseline computation felt… off to me. They were certainly workable, but it was too complex and felt fragile.

I don’t like dealing with a high level of complexity, I would rather put a lot of effort into simplifying the solution. The overall approach may be complex, but the system should be nice to work with. Usually, we can get away with a great deal of simplification if we accept some constraints on what we want to do with the system. For now, I’m going to assume the following constraints:

  • We are using 64 bits OS (and can assume effectively unlimited address space).
  • We want to go with a file pager (instead of the memory mapped one) because I want to be able to control the I/O behavior better.
  • The files we use are limited to 8 GB in size (can use more than a single file, of course).

The last one deserves some additional words. When thinking about a storage solution, accepting a maximum size is generally a bad idea (640KB, anyone?). However, if we decide that our storage solution is going to be composed of files of specific size, we can combine them to reach any size needed.

But why accept this limitation? Why say that a single file will not exceed 8 GB? It turns out that this has several advantages.

Let’s assume that we have a dataset that is 100GB in size, using 8 GB files, that would be 13 files to a total of 104 GB of used disk space. Now we want to delete some of that data. What do we do with the actual used disk space? It is actually quite hard to release disk space back to the operating system if you have a single file. You might need to run compaction of the data, or use advanced API such as hole punching (see FALLOC_FL_PUNCH_HOLE). Advanced API is something that I would like to avoid, too easy to fall into some pitfall that no one else has run into. Working with sparse files (with holes in them) also typically requires you to utilize dedicated tools and can be awkward.  If we split the data into separate files, we can retain most of the same benefits, and give ourselves a simpler environment for the user to work with.

With the 8GB limitation in place, I can choose to manage the paging using the following manner:

The idea is pretty simple. Instead of trying to stitch together the memory for the file, we are going to just allocate a single 8GB range of virtual memory. This can be done using the following command:

This reserves (but does not use) 8GB of address space. We can now allocate ranges from that safely. This is important because if we have a request to two sequential chunks, they will reside in memory right next to one another. Note that we also don’t need to handle any pointers, since we can rely on a stable base address for the whole file. The nice thing about this is that we aren’t actually allocating memory, just reserving it.

Let’s see how that will work? The chunks array is used to control references to the chunks in the file. The chunk metadata is a 64 bits value that has several responsibilities at the same time. It stores the tag of a chunk, which indicate its status (loaded, error, empty, etc) and the number of outstanding references to the chunk. That uses up 34 bits in the value, the rest of the bits are used as a version field, which is incremented on each change. That allows us to avoid the ABA problem. The actual data, of course, is managed using the ptr value.

Here is how we can get a chunk from this struct:

What we are doing here is checking that the value is loaded to memory, and if it is, we increment the reference and then return it. This code runs in a loop, because we assume that multiple threads may run it in the same time. This handles just getting data that is already loaded. If the data isn’t loaded, what will happen? We’ll get a null back. Here is the blocking version of this method:

Just based on those two methods, you should be able to draw some conclusions. If the value isn’t loaded, we’ll always return null, but there is this Loading stage as well, in that case, we may want to wait for it. How is that going to work?

This works using two important functions: markLoading() and markLoaded(), the idea is that we’ll first try to call tryGet() to load a chunk, if there is no value, we need to load it from disk. At that point, remember, there may be multiple threads accessing the relevant chunk. So all of them would be competing on the markLoading function, like so:

The code itself is pretty simple, we are updating the tag of the chunk and try to update it optimistically. We are moving the state of the chunk from Empty to Loading in a thread safe manner. If we are successful in doing so, we know that we are the only thread that owns the loading portion of the chunk. Note that part of the markLoading process is to ask the OS to give us the memory for the chunk (in the range that we previously allocated).

At this point, we can load the data from disk somehow and then we’ll call the markLoaded function, which completes the process:

The idea is that we are splitting the responsibility for managing the chunks references from how we load the data to memory.

In other words, the expected usage of this struct is something like this:

  1. Call tryGet() a page in a given chunk.
  2. If successful, do the work you wanted to do.
  3. If not successful, compete to be the loader for this data by calling markLoading().
  4. If you lost, call getBlocking() to wait for the winner to get the data.
  5. Somehow, load the data from the disk and call markLoaded().
  6. Proceed to make use of the data.

Another important aspect that we have to deal with is when we want to discard the data. Basically, if we filled our memory budget and we need to load a value from the disk, what can we do then? The answer is that we need to evict the data somehow, before we can do that, we need to know what data is currently in use. That is why we have the calls to addRef() and release(). We use those (using atomic operations) to track the usage of the various chunks. When we need to evict data from memory, we’ll need to have some sort of a policy to do so. I’m deferring the actual policy to a later point in time, right now I want to discuss how do we know what we can evict and how that is going to work.

Here is the code to handle eviction, currently implementing a policy of simple scanning (not ideal by a long shot):

In the reclaim method, we are scanning through the chunks. To be able to reclaim a chunk, the following conditions need to hold:

  1. The chunk holds a value.
  2. There are no outstanding references to the chunk, only the pager is holding a reference to the chunk.

Note that in order to do this safely, we have to assume that while we are trying to reclaim a chunk, another thread is trying to use it. This behavior complicates our lives a bit. We handle that by doing  a racy update of the chunk, trying to move it to a loading state. The idea is that the Loading state is meant to be used as a busy signal. While the chunk is in Loading state, the rest of the system knows that it cannot use this and needs to wait. Note that this means that we have the following transitions:


Most of the code that we have in the struct is there to handle concurrency from multiple threads dealing with the system at once, note. The actual behavior is fairly simple. We check if we can reclaim the chunk (no one is looking), we take a lock on by trying to move its state to Loading. Then we can discard the memory by calling mmap on the chunk’s memory with PROT_NONE.

For fun, we are using 2MB chunks because that fits well into huge pages. On a properly setup system, we can significantly reduce the paging metadata overhead inside the kernel by allocating a single 2MB page for each chunk.

You can see the entire implementation here. In the next post, I want to look into handling the I/O portion of reading the data from the disk. After that we’ll talk about how we can implement a proper eviction policy.

time to read 8 min | 1507 words

The topic of this post is a bug in RavenDB, a pretty serious one. The end result is that a user reported that they got an error from RavenDB that they are unable to read a stored document. In some cases, RavenDB needs to read a document on startup, which means that it wasn’t able to start up if that document had this behavior.

As you can imagine, this is one of those issues that gets our full and immediate attention. The error itself gave us a lot of information:

 Dictionary mismatch on Dic #375
   at Voron.Data.Tables.ZstdLib.AssertSuccess(UIntPtr v, CompressionDictionary dictionary)

This is related to RavenDB’s document compression behavior. In order to get a great compression ratio from our documents, we train RavenDB on the recent documents that you have and generate a compression dictionary. The problem at hand is that the compression dictionary we have and the compression dictionary that was actually used are different. As you can see from the error, we are using zstd as the compression algorithm. When zstd generates a dictionary it will (by default) generate an id from that document that is mostly based on the xxhash64 of its content, rounded to 32 bits. You can see the relevant part here. This is pretty nice, since it means that there is a good chance that we’ll detect the wrong dictionary.

So now we know what is going on, but we don’t understand why.

When we wrote this feature, we were quite aware that we’ll not be able to make any sort of sense from the documents if we don’t have the right dictionary. For that reason, we store the dictionaries three times. Once inside of RavenDB itself and twice in ancillary files, which we can use during recovery. This sort of error should be utterly impossible. And yet, we had run into that in production, so we have to dig deeper still.

The primary suspect was the dictionary training portion. One of the things that RavenDB does on a continuous basis is measure the compression ratio of the documents, if we aren’t able to hit a good compression ratio, RavenDB will try to generate a new dictionary from the most recent documents and see if that new dictionary can do better. This can be very helpful in maintaining good compression rates. As your documents change, RavenDB will detect that and realize that it can do better, retrain on the recent data and compress even further. The problem is that this code path is also quite tricky, we first compress the document using the current dictionary, then we try generating a new dictionary and see if compressing with the new dictionary is better. If that is the case, we can install the new dictionary for future operations, otherwise, we need to discard it.

I suspected that the issue was somewhere around that area, we might not be handling the rejection of the new dictionary properly. So I went into the code and started digging, but I found absolutely nothing. The entire process is covered in tests and has been in production for close to 18 months, so this isn’t something that obvious.

After spending quite a bit of time on the issue, I decided that the code is perfect, it handled everything properly and taken into account all the right behaviors.

Clearly the fault was elsewhere. Before setting out to blame the nearest cat (you can never trust those), I had an idea, what if the problem wasn’t during the training process, but afterward?

Well, that doesn’t really matter, does it? RavenDB is a transactional database, if we had a failure after the training process, we’ll have to discard some of the data, for sure, but that would be about it. Unless, what if we have some state that wasn’t transactional? As part of looking at the compression training code, I ran into just such a scenario. Running the training to generate a new compression dictionary is an expensive proposition, so we don’t want to do that often. As such, we’ll do that for only about 1K document changes where we exceed the desired compression ratio by over 10%. How do we know to act every 1K documents? Well, we have a counter that we increment on every change. That value is incremented using Interlocked.Increment() and isn’t part of the transactional state. If the transaction is aborted, the value is still incremented.  The actual value doesn’t matter, mind, only that it is moving forward, so that isn’t an issue.

I mentioned the dictionary id before, but I should clarify that this is the zstd’s dictionary id. Internally, RavenDB uses a different value. That value is simply the sequence number of the dictionary, RavenDB counts the number of generated dictionaries and gives the new dictionary the next available value. That value, by the way, is part of the transaction. If we rollback a transaction, we’ll use the same dictionary id. But that doesn’t matter, of course.

When using compression dictionaries, we need to load them from a buffer. There is quite a bit of work that is involved in that, there is memory allocation, entropy tables to load, etc. In order to save repeated work, RavenDB caches the compression dictionaries (after all, their whole point is to be used repeatedly). That cache can be used by multiple transactions at the same time (two read transactions using the same dictionary will use the same instance).

Given all of this information, here is the sequence of events that we need to get the error in question:

  1. The user enabled documents compression.
  2. The user runs a transaction with at least four commands, which needs to satisfy the following conditions.
  3. A document write as the first action.
  4. Then a write to document whose compression ratio exceeded the expected ratio by over 10%, as a result, RavenDB tried to train a new compression dictionary.
  5. That dictionary had a better compression ratio and was accepted as the new default compression dictionary.
  6. RavenDB persisted the new dictionary and used that to compress the new document.
  7. Another command (in the same transaction) had stored a document in the same collection, now RavenDB will read the new dictionary and store that in a cache.
  8. A third command runs, but this one throws an error (such as optimistic concurrency violation).

At this point, RavenDB will rollback the entire transaction and return the error to the user. Let’s say the user has chosen to submit the same two documents again, shall we?

For the first command, we’ll again discover that the compression ratio (of the old compression dictionary) is insufficient. We will not generate a new compression dictionary, why is that? Remember the counter that we increment using Interlocked? That one was not rolled back, so we’ll need to wait for another 1K documents for the stars to properly align for us. That doesn’t impact correctness in any way, shape or form, however.

At this stage, the stage is set, but everything is still okay. The problem will happen on the next time that we’ll trigger a new dictionary. At that point, we’ll again scan the most recent documents, build a dictionary, etc. However, the dictionary id that RavenDB will use will be identical to the dictionary id that we previously discarded. The data that dictionary was trained on, however, will almost certainly be different. We persist the new dictionary to disk and everyone is happy, the new document that we wrote will use the new compression dictionary and we are perfectly fine.

The next write for this collection, however, will run into a problem. It will need to use the current (the new one) dictionary when we want to make a write. In order to do that, it will load the value using the cache, but there is already a value for that dictionary in the cache, the same dictionary that was discarded. At this point, RavenDB will start compressing documents using the in memory dictionary while the on disk dictionary is different.

If you’ll try to access the document which triggered the new dictionary, you’ll get an error, but documents that were modified later will continue working with no issue. Until you restart, of course.

On restart, we’ll read the dictionary from disk, where we wrote the new dictionary, at this point, all those documents that we wrote will give us the error above. Note that the sequence of events has to be very exact, you need to have a dictionary training as part of a multi act transaction which failed after the dictionary training has been successful and wrote additional documents. In a year and a half of production usage and very heavy load, that happened only a couple of times, it seems.

The issue has been fixed, of course and we’ll be rolling it out to both users and cloud customers. We’ll now rollback such in memory state on a transaction rollback as well, avoiding this issue entirely. It is amazing to me that despite very careful planning, it wasn’t the code itself that caused a problem, but a sequence of independent operations and failure modes that we never even considered about this.


  1. Implementing a file pager in Zig: Write durability and concurrency - 10 hours from now

There are posts all the way to Jan 18, 2022


  1. Implementing a file pager in Zig (11):
    17 Jan 2022 - Writing data
  2. re (30):
    14 Jan 2022 - Are You Sure You Want to Use MMAP in Your Database Management System?
  3. Production postmortem (33):
    03 Jan 2022 - An error on the first act will lead to data corruption on the second act…
  4. Negative feature response (2):
    20 Dec 2021 - Protect the user from accidental collection deletion
  5. Challenge (63):
    16 Dec 2021 - Find the slow down–answer
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats