There is a great article discussing how SQLite is handling transactions at fly.io. Which led me to the great documentation on the WAL mode for SQLite. And that led me to think about the differences between how SQLite does it and how Voron does it.
Both SQLite and Voron share asame behavior, they use Copy on Write and make the modifications for the pages in the database on a copy of the data. That means that readers can continue to operate with zero overhead, reading the stable snapshot of their data.
However, SQLite works by copying the data to the WAL file directly and modifying it there. Voron doesn’t use this approach. Instead, we have the notion of scratch space where this is done. Look at the figure below, which showcase the difference between the databases:
In SQLite, any modifications are written to the WAL file and operated on there. When you want to commit a transaction in SQLite, you’ll compute the checksum of all the pages modified in the transaction and write a commit record to the disk, at which point you’ll need to issue an fsync() call.
Voron, on the other hand, will copy the data that is modified in the transaction into scratch space (essentially, just some memory we allocated). On commit, it will not write the data to the WAL. Instead, it will take the following actions:
- Compute a diff of the current state of the page compared to its initial state, writing only the modifications.
- Compress the resulting output.
- Compute a checksum of all the pages that were modified.
- Write the compressed output and the checksum as a single write call.
Voron opens the file with O_DIRECT | O_DSYNC, the idea is that we don’t need to call fsync() at any stage, and we significantly reduce the number of system calls we have to make to commit a transaction.
Other transactions, at the same time, will use the data in the scratch space, not the WAL, to read the current state of pages. Voron also supports MVCC, so you may have multiple copies of the data in memory at once (for different transactions).
Voron is able to significantly reduce the total amount of I/O we have to use for writes, because we only write the changes in the data between page versions and even that is compressed. We typically can safely trade off the additional CPU work in favor of I/O costs and still come up ahead.
Another reason we went with this route is that we use memory mapped files, and on Windows, those aren’t coherent with file I/O calls. That means that mixing reading via mmap() and writing via file I/O (which is what we want to do to avoid fsync() calls) wouldn’t really work. Voron also benefits from not having to deal with multiple processes running at the same time, since it is typically deployed from within a single process.
Finally, the fact that we use scratch spaces separately from the WAL means that we put that somewhere else. You can have a fast empheral disk (common on the cloud) for scratch files, very fast (but small) disk for the WAL journal and standard disk for the actual data. This sort of configuration gives you more choices on how to deal with the physical layout of your data.