Implementing a file pager in ZigWrite durability and concurrency
In the last blog post I presented the manner in which the Pager can write data to disk. Here is a reminder:
We acquire the writer (and a lock on it), call write on the pages we want to write on (passing the buffer to write on them), and finalize the process by calling flushWrites() to actually write the data to disk. As a reminder, we assume that the caller of the Pager is responsible for coordination. While we are writing to a specific page, it is the responsibility of the caller to ensure that there are no reads to that page.
The API above is intentionally simplistic , it doesn’t give us a lot of knobs to play with. But that is sufficient to do some fairly sophisticated things. One of the interesting observations is that we split the process of updating the data file into discrete steps. There is the part in which we are updating the in memory data, which allows other threads to immediately observe it (since they’ll read the new details from the Pager’s cache). Separately, there is the portion in which we write to the disk. The reason that I built the API in this manner is that it provides me with the flexibility to make decisions.
Here are some of the things that I can do with the current structure:
- I can decide not to write the data to the disk. If the amount of modified pages is small (very common if I’m continuously modifying the same set of pages) I can skip the I/O costs entirely and do everything in memory.
- Flushing the data to disk can be done in an asynchronous manner. In fact, it is already done in an asynchronous manner, but we are waiting for it to complete. That isn’t actually required.
The way the Pager works, we deposit the writes in the pager, and at some future point the Pager will persist them to disk. The durability aspect of a database is not reliant on the Pager, it is a property of the Write Ahead Log, usually.
If I wanted to implement a more sophisticated approach for writing to the disk, I could implement a least recently used cache for the written pages. When the number of pages in memory exceeds a certain size, we’ll start writing the oldest to disk. That keeps the most used pages in memory and avoids needless I/O. At certain points, we can ask the Pager to flush everything to the disk, this gives us a checkpoint, where we can safely trim the Write Ahead Log. A good place to do that is whenever we reach the file size limit of the log and need to create a new one.
So far, by the way, you’ll notice that I’m not actually talking about durability, just writing to the disk. The durability aspect is coming from something we did long ago, but didn’t really pay attention to. Let’s look at how we are opening files, shall we:
Take a look at the flags that we pass to the open() command, we are asking the OS to use direct I/O (bypassing the buffer pool, since we’ll use our own) as well as using DSYNC write mode. The two together means that the write will skip any buffering / caching along the way and hit the disk in a durable manner. The fact that we are using async I/O means that we need to ensure that the buffers we write are not modified while we are saving them. As we currently have the API, there is a strong boundary for consistency. We acquire the writer, write whatever pages we need and flush immediately.
A more complex system would be needed to manage higher performance levels. The issue is that in order to do that, we have to give up a level of control. Instead of knowing exactly where something will happen, we can have a more sophisticated approach, but we’ll need to be aware that we don’t really know at which point the data will be persisted.
At this point, however, there is a good reason to ask, do we even need to write durably? If we are limiting the consistency of the data to specific times requested by the caller (such as when we replace the Write Ahead Log), we can just call fsync() at the appropriate times, no? That would allow us to use buffered writes from most I/O.
I don’t think that this would be a good idea. Remember that we are using multiple files. If we’ll use buffered I/O and fsync(), we’ll need to issue multiple fsync() calls, which can be quite expensive. It also means higher memory usage on the system because of the file system cache, for data we determine is no longer in memory. It is simpler to use direct I/O for the whole thing, after all.
In the next post, I’m going to show how to implement a more sophisticated write-behind algorithm and discuss some of the implications of such a design.
More posts in "Implementing a file pager in Zig" series:
- (24 Jan 2022) Pages, buffers and metadata, oh my!
- (21 Jan 2022) Write behind implementation
- (19 Jan 2022) Write behind policies
- (18 Jan 2022) Write durability and concurrency
- (17 Jan 2022) Writing data
- (12 Jan 2022) Reclaiming memory
- (11 Jan 2022) Reading from the disk
- (10 Jan 2022) Managing the list of files
- (05 Jan 2022) Reading & Writing from the disk
- (04 Jan 2022) Rethinking my approach
- (28 Dec 2021) Managing chunk metadata
- (27 Dec 2021) Overall design
- (24 Dec 2021) Using mmap
- (23 Dec 2021) What do we need?
Comments
Question: what about backups - what is needed to get a consistent backup/snapshot with page writes happening in async manner? Fsync the data to some point in time (in the WAL) and then disable page writes for the duration of backup?
Rafal,
Backup is an entirely separate topic. Mostly because you have to consider what kind of backup do you want? For example, offline backup is easy. Do orderly shutdown and copy the files. For online backup, you have to consider what is going to happen with running transactions? I assume you want a consistent snapshot of the data. In this case, this isn't something that you do at the pager level, but above that. You need some manner of orchestrating things.
That said, a really simple backup mechanism is to do the following: * Take a snapshot of the transaction id that was persisted to disk. * Ensure that the WAL will retain all transactions from that point forward. * Copy the file (as is, we don't care if there are ongoing writes). * Copy all the transactions in the WAL from the start of the backup.
On restore, you copy the file, then re-apply all the stored transactions. You are guaranteed to have a consistent result with the _last committed transaction as of the end of the backup_. On large databases, that can be quite important (backing up a TB database can take many hours).
Comment preview