The FASTER is a fast key-value store from Microsoft. It’s paper was published a while ago and now that the source is available, I thought that I would take a peek. You might say that I have a professional interest in such a project .
I decided to read the paper first before reading the code, because I want to figure out what isn’t in the paper. The implementation details beyond the abstract are what I find most interesting, to be honest. But if I’m going to read the paper, I’m also going to add some comments on it. Please note that the comments are probably not going to make much sense unless you read the paper.
The Epoch Basic section explains how they use a global epoch counter with a thread local value and a shared table that marks what epochs are visible to each thread. They use this to provide synchronization points, I assume (don’t know yet). This resonates very strongly with how LMDB’s global transaction table operates.
I like the concept of the drain list which is executed whenever an epoch become safe. I would assume that they use that to let the clients know that their operation was accepted and what was its state.
I wasn’t able to figure out what they use the tag for in the hash bucket entry. I think that the way it works is that you have K hash buckets and then use the first K bits to find the appropriate bucket, then scan for a match on the last 15 bits. I’m not really sure how that work with collisions, though. I assume that this will be clearer when I get to the code. I like the two phase scan approach to ensure that you have no conflicts when updating an entry.
The paper keeps repeating the speed numbers of 160 millions ops/sec and talking about 20 millions ops / sec as being “merely”. Just based on my reading so far, I don’t see how this can happen. What is interesting to me is what is the definition of ops. Is it something like incrementing the same (small) set of counters? If that is the case, than the perf numbers both make more sense and are of less interest. Typically when talking about ops / sec in such scenarios we talk about inserts / updates to individual documents / rows / objects. Again, I don’t know yet, but that is my thinking so far.
One thing that I find sad is that this isn’t a durable store. A failure in the middle of operations would cause some data to be completely lost. It seems like they have support for checkpoints, so you don’t lose everything. However, I’m not sure how often that happens and the benchmarks they are talking about were run without it. Interestingly enough, the benchmarks were run without garbage collection. I haven’t gotten to the discussion on that yet, so I’m not exactly what that means Another missing feature here is that there is no support for atomicity. You cannot ensure that two separate operations will run as a single atomic step.
The benchmark machine is 28 cores with 256GB RAM and 3.2 TB NVMe drive. This is a really nice machine, but from the get go I can tell that this is not going to be a fair comparison to most other storage engines. Faster is explicitly designed to work mostly in memory and with high degree of parallelism. This is great, but it gives us some important features (atomic batches and durability, also known as transactions). The data size they tested are:
- 250 million records with 8 bytes keys & 8 bytes values – Just under 4GB in total.
- 250 million records with 8 bytes keys & 100 bytes values – Just under 32GB in total.
I’m not sure why they think that this is going to provide larger than memory setup. In particular, they mention a memory budget of 2GB, but I assume that this is just internal configuration. There is also going to be quite a lot of memory cached in the OS’ page cache, for example, and I don’t see anything controlling that. Maybe I’ll when I’ll go through the code, though.
Okay, the garbage collection they refer to is related to how they compact the log. They use an interesting approach where they just discard it at a some point, and any pointer to the removed section is considered to be deleted automatically. That is likely to be very efficient, assuming that you don’t care about older data.
All in all, I feel that I have a vague understanding on how Faster works and a lot better grasp on what it does and how it is utilized. I’m looking forward to diving into the code.
More posts in "Reviewing FASTER" series:
- (06 Sep 2018) Summary
- (05 Sep 2018) When the data hits the disk
- (04 Sep 2018) Reading data from disk
- (03 Sep 2018) The hash structure
- (31 Aug 2018) Working with the file system
- (30 Aug 2018) Digging into the C++ impl
- (29 Aug 2018) Let’s check these numbers
- (28 Aug 2018) Let’s start with managed code
- (27 Aug 2018) Reading the paper