Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.
Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.
Here is what it looks like:
Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.
I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.
Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:
As a remind, those two are used together in the DBImpl::Write() method, like so:
From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:
Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).
It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).
Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).
And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?
I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.