In my previosus post, I talked about how Voron achieves MVCC. Instead of modifying data in place, we copy the page or pages we want to modify to a scratch buffer and modify that. When the write transaction completes, we are updating a Page Translation Table so any reference to the pages that were modified would go to the right place in the scratch file.
Note, Voron uses mmap files as scratch buffers. I use the term scratch buffer / scratch file to refer to the same thing.
That is all well and good, and if you are familiar with how virtual memory works, this is exactly the model. In effect, every transaction get a snapshot of the entire database as it was when it was opened. Read transactions don’t modify the data, and are ensured to have a stable snapshot of the database. The write transaction can modify the database freely, without worrying about locking or stepping over other transactions.
This is all pretty simple, and the sole cost that we have when committing the transaction is flushing all the dirty pages to disk, and then making an atomic pointer swap to update the Page Translation Table.
However, that is only part of the job, if all the data modifications happens on the scratch buffer, what is going on with the scratch files?
Voron has a background process that monitor the database activity, and based on certain policy (size, time, load factor, etc) it will routinely write the data from the scratch files to the data file. This is a bit of an involved process, because we can’t just do this blindly.
Instead, we start by seeing what is the oldest active transaction that is currently operating. We need to find that out to make sure that we aren’t writing any page that this transaction might visit (thus violating the snapshot isolation of the transaction). Once we have the oldest transaction, we gather all the pages from the Page Translation Table that came from older transactions and write them to the data file. There are a couple of tricks that we use here. It is very frequent for the same page to be modified multiple times (maybe we updated the record several times in different transactions), so we’ll have multiple copies of it. But we don’t actually need to copy all of them, we just need to copy the latest version (up to the oldest active transaction).
The process of copying all the data from the scratch file to the data file can happen concurrently with both read and write transactions. After the flush, we need to update the PTT again (so we open a very short write transactions to do that), and we are done. All the pages that we have copied from the scratch buffer are marked as free and are available for future transactions to use.
Note, however, that we haven’t called fsync on the data file yet. So even though we wrote to the data file, we made a buffered write, which is awesome for performance, but not so much for safety. This is done intentionally, for performance reasons. In my next post, I’ll talk about recovery and safety at length, so I’ll just mention that we fsync the data file once a minute or one once every 2GB or so. The idea is that we give the OS the time to do the actual flush on the background, before we just in and demand that this will happen.
Another problem that we have with the scratch buffer is that, like any memory allocation routine, it has issues. In particular, it has to deal with fragmentation. We use power of two allocator to reduce fragmentation as much as possible, but certain workloads can fragment the memory in such a way that it is hard / impossible to deal with it. In order to deal with that issue, we keep track on not just the free sections in the scratch buffer, but also on the total amount of used memory. If a request cannot be satisfied by the scratch buffer because of fragmentation, but there is enough free space available, we’ll create a new scratch file and use that as our new scratch. The old one will eventually be freed when all read transactions are over and all the data has been flushed away.
Scratch files are marked as temporary and delete of close, so we don’t actually incur a high I/O cost when we create new ones, and it typically only when we have very high workload of both reads and writes that we see the need to create new scratch files.This tend to be drastically cheaper than trying to do compaction, and it actually work in all cases, while compaction can fail in many cases.
You might have noticed an issue with the whole system. We can only move pages from the scratch file to the data file if it was modified by a transaction that is older than the oldest current transaction. That means that a long running read transaction can stall the entire process. This typically is only a problem when we are seeing very high write usage as well as very long read transactions, which pushes the envelope on the size of the scratch buffer but at the same time doesn’t allow to clean it.
Indeed, using Voron, you are typically aware on the need to close transactions in a reasonable timeframe. Within RavenDB, there are very few places where a transaction can span a long time (streaming is pretty much the only case in which we’ll allow it, and it is documented that if you have a very long streaming request, that push memory usage on the server up because we can’t clean the transaction). In practice, even transactions that takes multiple minutes are fine under moderate write load, because there is enough capacity to handle it.