In the previous post I outlined some ideas about how to implement a more efficient write behind. The idea is that whenever we write pages to the Pager, we’ll not trigger an immediate write to the disk. Instead, we’ll keep the data in memory and only write to the disk when we hit a certain threshold. In many write scenarios, there are certain pages that are modified a lot (like the root page in a B+Tree) and pages that are modified rarely (a leaf page that got entries and will not be modified again). There is no point in writing the popular page to the disk, we’ll likely get another write to them shortly anyway. That calls to a Least Frequently Modified approach.
We don’t need to use a more complex approach, (like the Clock Sweep algorithm we use for the Pager), because we don’t have to deal with the same scenarios. There are not likely to be cases similar to scans, which throws a lot of complexities of buffer pool implementations. Writes operations are far more predictable in general and follow a pretty strict power law distribution. The task is simple: we have the list of pages that were modified, and at capacity, we’ll select some to send to the disk. The question is how to make that decision.
The simplest option is to go with the least recently used model. That is trivial to implement, The idea is that we have the following sequence of writes (assuming we have a capacity of 4 pages):
1, 2, 3, 2, 3, 1, 2, 3, 4, 2, 1, 2, 3, 4, 4, 2, 1, 6
In this case, the page that will be selected for eviction is #3, since it wasn’t modified the longest. The other alternative is to use least frequently used, in which case we have the following frequency table:
In this case, we’ll want to select page #4 for eviction. Since it is the one least used. (We don't consider #6 because it is the one we just inserted). I can make arguments for both sides, to be frank. It makes sense that the least frequently used is going to be the most relevant, right? The problem is that we need to also account for decaying usage over time. What do I mean by this?
We may have a page that is very hot, it gets used a lot for a certain period of time. After that point, however, it is no longer being written to, but because it was frequently used, it will take a long time to evict from the pool. A good example of such a scenario is when we have a B+Tree and we are inserting values in ascending orders. All the values for the tree are going to be placed in the same page, so if we have a lot of inserts, that page is going to be hot. Once it is full, however, we’ll start using another page as the target and then the page will reside in memory until some other page will have more usage. A good discussion of least frequency used implementation is in this blog post.
A nice way to deal with the issue of decaying priorities over time in an LFU setup is to use the following formula to compute the priority of the pages:
The idea is that we compute the priority of the page based on the last access, so we are very close to the most recently used option. However, note that we compute the distance between accesses to the page. A page that is infrequently accessed will have low usages and a high delta sum. That will reduce its priority. Conversely, a page that is heavily used will have a low delta sum and high usage, so its value will be near the top.
Another option is to go with another clock sweep option. In this case, we use a simple least recently used model, but we keep count of the frequency of usages. In that case, if we have to evict a page that is heavily used, we can reduce its usages and give it another chance. The advantage here is that this is a far simpler model to work with, but gives roughly the same results. Another option we have is to use the midpoint insertion LRU.
There is also another consideration to take. The I/O cost isn’t linear. If I’m writing page #3 to disk, it is basically free from my perspective to write nearby pages. It is the same exact cost, after all, so why not do that?
We’ll need to write our own doubly linked list. The Zig’s standard library only contains a single linked list. It doesn’t take long to write such a data structure, but it is fun to do so. I absolutely get why implementing linked lists used to be such a common practice in interviews:
There isn’t really much to discuss here, to be honest. There is a bunch of code here, but it is fairly simple. I just had to implement a few operations. The code itself is straightforward. It is a lot more interesting when we see it being used to implement the LRU:
The push() method is where it all happens. We have two options here:
- We have a page already inside the LRU. In that case, we increment its usage counter and move it to the front of the list.
- This is a new page, so we have to add it to the LRU. If there is enough capacity, we can just add it to the front of the list and be done with it.
However, things get interesting when we are at capacity. At that point, we actually need to select a page to evict. How can we do that? We scan the end of the list (the oldest page) and check its usage. If it has more than a single usage, we half its usage counter and move it to the front. We continue to work on the tail in this manner. In essence, high usage counter will get reset rather quickly, but this will still give us a fairly balanced approach, with more popular pages remaining in the pool for longer.
When we evict a page, we can return it back to the caller, which can then write it to the disk. Of course, you probably don’t want to just write a single page. We need to check if we have additional pages nearby, so we can consolidate all of them at once to the disk.
I’ll touch on that in my next post.
More posts in "Implementing a file pager in Zig" series:
- (24 Jan 2022) Pages, buffers and metadata, oh my!
- (21 Jan 2022) Write behind implementation
- (19 Jan 2022) Write behind policies
- (18 Jan 2022) Write durability and concurrency
- (17 Jan 2022) Writing data
- (12 Jan 2022) Reclaiming memory
- (11 Jan 2022) Reading from the disk
- (10 Jan 2022) Managing the list of files
- (05 Jan 2022) Reading & Writing from the disk
- (04 Jan 2022) Rethinking my approach
- (28 Dec 2021) Managing chunk metadata
- (27 Dec 2021) Overall design
- (24 Dec 2021) Using mmap
- (23 Dec 2021) What do we need?