Voron internalsI/O costs analysis
I talked about the details of Voron in the previous posts, how it handles journaling, MVCC and cleaning up after itself. In this post, I want to focus on another aspect that needs to be considered, the various costs of running Voron on production systems. In particular, the competing I/O requirements.
So what do we have with Voron?
- A (potentially very large) memory mapped data file. Buffered writes and fsync once every 1 minute / 2GB.
- Scratch files (small memory mapped files) marked as temporary and delete on close.
- Journal files requiring durable writes.
In terms of priorities, we want to give high priority to the journal files, then to writing to the data file (so it will happen all the time, not just when we call fsync). Scratch files should only be written to disk under memory pressure, and we should strive to avoid that if possible.
On both Windows and Linux, there are ways to ask the system to start flushing the data to disk (Windows uses FlushViewOfFile, Linux uses sync_file_range), but in practice, when we flush the data to disk we need to also ensure durability, so we call FlushViewOfFile + FlushFileBuffers on Windows and msync(MS_SYNC) on Linux to ensure that. Technically speaking, we could do this in two stages, allowing the system some time to do this lazily, then calling FlushFileBuffers / fsync, but we haven’t found that to be advantageous in terms of complexity, and sync_file_range documentation is scary.
Another aspect that we need to consider is the fact that we are not along out there. A typical RavenDB database will have multiple Voron instances running, and a typical RavenDB server will have multiple RavenDB databases running. So we are talking about typically having dozens or more Voron instances in a single process. We need to avoid a conflict between all of those instance, each of which is trying to make use of all the system resources by itself. This kind of disharmony can kill the performance of the server, all the while giving the best performance in any benchmark where you are running a single instance.
We solved this by having a single actor responsible for scheduling the flushing of all the Voron instances inside a process. It accept flush requests and make sure that we aren’t loading the I/O system too much. This means that we might actually defer flushing to disk under load, but in practice, reducing the I/O competition is going to improve throughput anyway, so that is likely to be better in the end. At the same time, we want to take advantage of the parallelism inherit in many high end systems (RAID, cloud, etc) which can handle a lot of IOPS at the same time. So the policy is to give a certain number of Voron instance the chance to run in parallel, with adjustments depending on the current I/O load on the system.
Journal writes, however, happen immediately, have high priority and should take precedent over data file writes, because they have immediate impact on the system.
We are also experimenting with using the operation system I/O priorities, but that is a bit hard, because most of those are about reducing the I/O priorities. Which we sort of want, but not that much.
More posts in "Voron internals" series:
- (13 Sep 2016) The diff is the way
- (07 Sep 2016) Reducing the journal
- (31 Aug 2016) I/O costs analysis
- (30 Aug 2016) The transaction journal & recovery
- (29 Aug 2016) Cleaning up scratch buffers
- (26 Aug 2016) MVCC - All the moving parts
Yes, global IO scheduling can be extremely beneficial. Mostly, when you manage to preserve sequential IO. For random IO usually more is better except for latency.
SQL Server does not schedule IO and the resulting latency is unfathomably bad. A backup operation can easily 100x latency.
Reducing critical ressource management to a single coordinator is quite nice - competing for I/Os can cause serious stalls. Only downside here: As administrator i know how the hardware is set up - concerning a system with several disks i can provide high parallel throughput for I/Os. Setting up multiple databases and distributing them on the disks (manually by assigning locations) might still cause one of these disks (by slow response) stalling I/O for other disks, right? Providing horizontal scalability for databases (or even a single database) on one server might be quite nice.
In general it's not uncommon to distribute load across multiple disks (on Hyper-V with SAN multiple paths to the storage will be used concurrently, where on the storage multiple LUNs will be used). If there's a critical path on the journal files, a large database server might benefit from distribution [at least per database] of these files across multiple disks (not raid, distinct access paths from filesystem to disk).
Daniel, There is a balance here between the level of complexity and the amount of flexibility to offer. We already allow users to have journals and data on different paths, and we have configuration that control how many concurrent flushes we'll accept. The data file is a single file, so having it on multiple volumes pretty requires RAID, and hardware level parallelism.