I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.
We start from the client code:
This calls the following method:
I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.
Now we have a lot of code to go through. Let us see what conclusions we can draw from this.
The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?
I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:
So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.
Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.
What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.
Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.
In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.
The rest of the code is dealing with the actual write to the log / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.
And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…