Reviewing HyperLevelDB–Compactions
As mentioned earlier, HyperDex has made some changes to LevelDB to make it work faster for their scenarios. I was curious to see what changed, so I took a look at the code.
The fun part is that I can easily check just the things that were changed by HyperDex, I don’t need to go read and understand a completely new engine. I’m currently just looking at the logs, and noting important stuff.
A lot of the work appears to have been concentrated on the notion of compaction. In particular, compaction means that we need to merge multiple files into a single file at a higher level. That is great, until you realize that this may mean that you’ll be writing a lot more data than you though you would, because you keep copying the data from one file to another, as it bubbles deeper into the levels.
The first thing that seemed to have happened is that when you get to a deep enough level, the file sizes that you are allowed to have become much larger, that means that you’ll have more data at higher level and likely reduce the number of times that you need to compact things.
The next step in the process was to create two parallel processes. One to handle the common compaction from memory to level-0 and the other to handle level-0 and down. I assume that this was done to avoid contention between the two. It is far more common to have a memory to level-0 compaction than level-0 down, and having to sync between the two is likely causing some slow down. The next change I noticed was tightening locks. Instead of having a single lock that was used for more writes, there now appears to be more granular locks, so you gain more concurrency under parallel load.
Next, the change was made to the compaction heuristics. I am not sure that I understand yet the changes, especially since I am just going through the commit log now. Once thing I noticed right away is that they removed the seek budget that you had for files. It appears that compactions triggered by reads were a source of too much pain for HyperDex, so it was removed. Note that HyperDex appears to be very interested in a consistent performance for high write scenarios.
Looking at the logs, it appears that there were a lot of strategies attempted for getting a better compaction strategy for what HyperDex is doing. Interestingly enough (and good for them) I see a lot of back & forth, trying something out, then backing out of it, etc. Sometimes over the courses of several weeks or months. A lot of that appears to be more like heuristically setting various things, like number of compacting threads, the various config options, etc. Trying to find the best match.
Another thing that I guess improved performance is that leveldb will intentionally slow down writes when is is doing compaction, or just about to, to reduce the load when heavy background operations are running. HyperLevelDB removed that, so you will have more load on the system, but no artificial governors on the system performance.
But coming back to compactions, it appears that what they ended up doing is to have three levels of background compactions.
- Memory Table –> Level0 files.
- Optimistic Compaction
- Background Compaction
To be honest, I think that at least between the last two there is a lot of code that can be shared, so it is a bit hard to read, but basically. We have an optimistic compaction thread running that waits for enough data to comes through to level 0, at which point it will try to do a low cost, high yield compaction. If that isn’t enough, we have the background compaction that will kick in, but that seems to be for there is really no other alternative.
Given the importance of compaction to performance, it would probably make more sense to split them into a separate concept, but I guess that they didn’t want to diverge too much from the leveldb codebase.
Comments
Really cool, do you think it will be something you find useful for RavenDB or just a learning experience?
Khalid, I don't think that there is any learning that isn't useful.
Comment preview