I talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running Voron on production systems. In particular, the competing I/O requirements.
So what do we have with Voron?
- A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
- Scratch files (small memory mapped files) marked as temporary and delete on close.
- Journal files requiring durable writes.
In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.
On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.
Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.
We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.
Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.
We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.