Up to this point, we focused on reading data from the disk, we can do that up to a point. Eventually we’ll run out of memory (assuming that the database is bigger than memory, which is a pretty safe assumption). That means that we need to decide what to remove from memory. When we use mmap(), the OS gets to decide that for us. In fact, that is probably the best argument against using mmap(). The additional control we get from knowing how to manage the memory is the chief reason to take the plunge and manage our own memory.
There are a lot of algorithms around managing memory, I really like this one, because it is quite elegant. However, that requires quite a lot of states to be dealt with, especially when working with highly concurrent systems. Instead, I chose to look at the clock sweep algorithm. This is also implemented by PostgreSQL, and it is actually far simpler to work with. The idea is that for each page, we maintain a usage count. Each time that we need to get a page, we’ll increment its usage count (up to a small limit). Each time we need to evict a page, we’ll search for a page that can be evicted and has no recent usage. If it has usages, we’ll decrement that value and repeat until we find something.
Our buffer management isn’t actually dealing with pages, however. We are working with 2MB chunks, instead. The principal is the same, but using bigger aggregates is advantageous given typical memory sizes these days. The first thing that we need to do is to modify the ChunkMetadata. I’m showing only the relevant changes.
The major change here is the introduction of the usages field. That is a 3 bits field (0 .. 8 in range) and we reduced the number of references a chunk can have to about 500 million (should be sufficient, I believe). The idea here is that each time that we call addRef(), we’ll increment the usages count, like so:
Zig has a nice feature that I’m using here, saturating addition. In other words, if the value is incremented beyond its limit, it is clamped to the limit. That means that I don’t have to worry about overflows, etc. I took a look at how this is implemented, and the compiler generates the following assembly for this code (x +| 100) :
This may look daunting, but I’ll break it down. First, we add the 100 to the value, as you can expect (the value is currently in the EAX register). Then we store –1 (value of 0xFFFFFFFF) in the ECX register. Finally, we use the CMOV instruction (the CMOVB in the snipper is a variant of CMOV), telling it to store ECX in EAX if the carry flag is set on the last addition instruction. For fun, this also avoids a branch in the code, which is great for performance.
One of the critical functions that we need to consider here is the behavior of the pager when we are accessing rare pages once, and then never again. A really common scenario is when we are doing a scan of some data for a rare query. For that reason, the usages behavior is a bit more complex than one might imagine. Let’s explore this for a bit before moving on. Look at the following code, I marked the important lines with stars:
When we mark a chunk as loaded, we copy the usages from the current record. That starts out as zero, so for the scenario of accessing rare pages, we’ll have a good reason to evict them soon. However, there is a twist, when we remove a chunk from memory, we also set its usage count to 1. That is an interesting issue. The chunk is not loaded, why does it have a usage count? That is because if we removed it from memory, and we load it again, we want it to start with a higher usage count (and less chance to be evicted). In this manner, we are somewhat simulating the 2Q algorithm.
Now, let’s take a look at the actual reclaiming portion, shall we? In the chunk metadata, we have the following behavior:
If we have a value and no outstanding references, we can reclaim it, if we don’t have a value, we’ll reduce the usage count anyway. The idea is that when we unload a page, we’ll set the usages count to 1. The usage counter of an unloaded page will reset this as time goes by, so if there has been a sweep on an empty chunk that has a usage count, we’ll not count that for the next time we load it. Basically, after unloading a chunk, if we reload it soonish, we’ll keep it around longer the next time. But if it is only rarely loaded, we don’t care and will forget that it was loaded previously.
The process for actually reclaiming a chunk is shown here:
We look at a particular index in the chunks, and check if we can claim that. Following out previous behavior, there are a bunch of options here. We can either have an empty chunk that remembers the previous usage, which we can reduce, or we can actually try to reclaim the chunk. Of course, that isn’t that simple, because even if we found a candidate for reclaiming, we still need to reduce its usages count. Only if it has no usages will we be able to actually start the removal process. Finally, we have the actual scanning of the chunks in the file, shown here:
We scan the file from the provided start index and try to reclaim each chunk in turn. That may reduce the usages count, as we previously discussed. The actual process will continue until we have found a chunk to reclaim. Note that we are always claiming just a single chunk and return. This is because the process is repeatable. We start from a given index and we return the last index that we scanned. If the caller needs to free more than a single chunk, they can call us again, passing the last index that we scanned.
That is why this is called the clock sweep algorithm. We are sweeping through the chunks that we have in the system, reaping them as needed. The code so far is all in the same FileChunks instance, but the Pager actually deals with multiple files. How would that work? We start by adding some configuration options to the pager, telling us how much memory we are allowed to use:
We have both soft and hard limits here, because we want to give the users the ability to say “don’t use too much memory, unless you really have to”. The problem is that otherwise, users get nervous when they see 99% memory being used and want to keep some free. The point of soft and hard limits is that this gives us more flexibility, rather than setting a lower than needed limit and getting memory errors with GBs of RAM to spare.
In the Pager, we have the loadChunksToTransaction() that we looked at in the previous post. That is where we read the chunk from the file. We are going to modify this method so will reserve the memory budget before we actually allocate it, like so:
As you can see, we reserve the memory budget, then actually allocate the memory (inside markLoading()). If there is a failure, we release the budget allocation and report the error. To manage the memory budget, we need to add a few fields to the Pager:
You can see that releaseChunkMemoryBudget() is pretty trivial. Simply release the memory budget and move on. Things are a lot more complex when we need to reserve memory budget, however. Before we dive into this, I want to talk a bit about the filesRelcaimSweeps field. That is an interesting one. This is where we’ll keep the last position that we scanned in the Pager (across all pages). However, why is that an array?
The answer is simple. The Pager struct is meant to be used from multiple threads at the same time. Under memory pressure, we are likely to need to evict multiple chunks at once. In order to avoid multiple threads scanning the same range of the Pager to find chunks to remove, I decided that we’ll instead have several sweeps at the same time. On startup, we’ll initialize them to a random initial value, like so:
In this manner, under load, each thread is likely to scan an independent portion of the Pager’s memory, which should avoiding competing on the same memory to evict. And with that behind us, let’s see how we can actually use this to evict memory:
There is a lot of code here, and quite a few comments, because this is choke-full of behavior. Let’s dissect this in detail:
We start by checking the memory budget and see if we are below the soft memory limits. We are doing this optimistically, but if we fail, we reset the state and then start to scan the Pager for chunks we can release. We do that by accessing one of the filesRelcaimSweeps values using the current thread id. In this way, different threads are likely to use different values and move them independently.
We find the relevant file for the index and start scanning for chunks to release. We’ll stop the process when one of the following will happen:
- We released a chunk, in which case we are successful and can return with glory.
- We didn’t find a chunk, but found candidates (whose usage count is too high to discard).
In the case of the second option, we’ll look for better options before getting back to them and discarding them.
If we are completely unable to find anything to release, we’ll check if we exceed the hard memory limit and error, or just accept the soft limit as, well… soft limit and allocate the budget anyway.
This happens as part of the process for loading chunks in the pager, so we’ll only need to release a single chunk at a time. For that reason, we remember the state so the next operation will start from where we left off. You can think about this as a giant set of hands that are scanning the range of chunks in memory as needed.
There are actually a few things that we can implement that would make this faster. For example, we always scan through all the chunks in a file. We could try to maintain some data structure that will tell us which pages have usage count to consider, but that is actually complex (remember, we are concurrent). There is also another factor to consider. The ChunkMetdata is 64 bits in size, and a FilesChunk struct contains an array of 4096 such values, totaling 32KB in size. It is actually cheaper to scan through the entire array and do the relevant computation on each candidate than try to be smart about it.
I think that this is it for now, this post has certainly gone for quite a while. In the next post in the series, I want to tackle writes. So far we only looked at reads, but I think we have all the relevant infrastructure at hand already, so this should be simpler.
More posts in "Implementing a file pager in Zig" series:
- (24 Jan 2022) Pages, buffers and metadata, oh my!
- (21 Jan 2022) Write behind implementation
- (19 Jan 2022) Write behind policies
- (18 Jan 2022) Write durability and concurrency
- (17 Jan 2022) Writing data
- (12 Jan 2022) Reclaiming memory
- (11 Jan 2022) Reading from the disk
- (10 Jan 2022) Managing the list of files
- (05 Jan 2022) Reading & Writing from the disk
- (04 Jan 2022) Rethinking my approach
- (28 Dec 2021) Managing chunk metadata
- (27 Dec 2021) Overall design
- (24 Dec 2021) Using mmap
- (23 Dec 2021) What do we need?