Ayende @ Rahien

Refunds available at head office

Reviewing Basho’s Leveldb

After taking a look at HyperLevelDB, it is time to see what Basho has changed in leveldb. They were kind enough to write a blog post detailing those changes, unfortunately, unlike HyperLevelDB, they have been pretty general and focused on their own product (which makes total sense). They have called out the reduction of “stalls”, which may or may not be related to issues with the write delay that leveldb intentionally introduce under load.

Okay, no choice about it, I am going to go over the commit log and see if I can find interesting stuff. The first tidbit that caught my eye is improving the compaction process when you have on disk corruption. Instead of stopping, it would move the bad data to the “lost” directory and move on. Note that there is some data loss associated with this, of course, but that won’t necessarily be felt by the users.

As a note, I dislike this code formatting:

image

Like HyperLevelDB, Basho made a lot of changes to compaction, it appears that this is the case for performance reasons:

  • No compactions triggered by reads, that is too slow.
  • There are multiple threads now handling compactions, with various levels of priorities between them. For example, flushing the immutable mem table is high priority, as is level 0 compaction, but standard compactions can wait.
  • Interestingly, when flushing data from memory to level 0, no compression is used.
  • After those were done, they also added additional logic to enforce locks that would give flushing from memory to disk and from level 0 downward much higher priority than everything else.

As an aide, another interesting thing I noticed, Basho also moved closing files and unmmaping memory to a background thread. I am not quite sure why that is the case, I wouldn’t expect that to be very expensive.

Next on the list, improving caching. Mostly by taking into account actual file sizes and by introducing a reader/writer lock.

Like HyperLevelDB, they also went for larger files, although I think that in this case, they went for significantly larger files than even HyperLevelDB did. Throttling, unlike with HyperLevelDB, where they did away with write throttling altogether in favor of concurrent writes, Basho’s leveldb went into a much more complex system of write throttling base on the current load, pending work, etc. The idea is to gain better load distribution overall. (Or maybe they didn’t think about the concurrent write strategy).

I wonder (but didn’t check) if some of the changes were pulled back into the leveldb project. Because there is some code here that I am pretty sure duplicate work already done in leveldb. In this case, the retiring of data that has already been superseded.

There is a lot of stuff that appears to relate to maintenance. Scanning SST files for errors, perf counters, etc. It also look like the decided to go to assembly for actually implementing CRC32. In fact, I am pretty sure that the asm is for calling hardware CRC inside the CPU. But I am unable to decipher that.

What I find funny is that another change I just run into is the introduction of a way to avoid copying data when Get()ing data from leveldb. If you’ll recall, I pointed that out as an issue a while ago in my first review of leveldb.

And here is another pretty drastic change. In leveldb, only level 0 can have overalapping files, but Basho’s changed things so the first 3 levels would have overlapping files. The idea is that you can do cheaper compactions this way, I am guessing.

I am aware that this is a bit of a mess, with regards to the review, but I just went over the code and wrote down the notes as I saw them. Overall, I think that I like HyperLevelDB changes better, but they have the advantage of using a much later codebase.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

njy
06/20/2013 12:48 PM by
njy

Reading Oren saying "But I am unable to decipher that" is one of the scariest thing I've ever read :-)

Alois Kraus
06/20/2013 06:45 PM by
Alois Kraus

Closing files on another thread is not unreasonable when you profile for CPU consumption. Here is a flame graph I have made for a small sample app that does nothing else than to read a 200 byte file bytewise in an endless loop. The most expensive thing is indeed opening and closing the file.

http://geekswithblogs.net/akraus1/archive/2013/06/10/153104.aspx

Ayende Rahien
06/21/2013 05:59 AM by
Ayende Rahien

Alois, Thanks for that, really cool technique. I wonder if it would make sense in a managed system to just let the files drop to the GC?

Karg
06/21/2013 05:18 PM by
Karg

Regarding letting the GC finalize unmanaged resources, I recently saw a presentation where Jeffery Richter suggested just that. The exception was if there would be contention or resource limitation issues with having the resources stick around a bit longer.

As long as you don't have potential for contention (file lock or limited SQL connections, for instance), he says to just allow the GC to finalize the resources.

I still don't know what I think of that, but it's interesting to hear that recommendation from Jeff.

MatthewVon
07/01/2013 03:11 PM by
MatthewVon

Hardware CRC32c is measured / known to be 10x faster than software implementation. That is why Intel now puts it inside the CPU. leveldb really depends upon that CRC to verify data is not corrupted ... because it has almost no defense against corrupted data (i.e. segfaults).

________
07/01/2013 03:07 PM by
________

Here is Matthew's original reply when HyperLevelDB came up on HN: https://news.ycombinator.com/item?id=5835098 that will explain briefly why Riak's use case is so much different than Hyperdex's.

MatthewVon
07/01/2013 06:10 PM by
MatthewVon

The rationale for some of the more recent changes are presented on the github wiki page for basho/leveldb repository:

https://github.com/basho/leveldb/wiki

Basho has customers that write 100s of gigabytes of data to a server per day. This data is written across 8 to 64 simultaneously active leveldb databases. The compaction tunings and write throttle are based upon the needs of this high volume, multi-database environment.

Comments have been closed on this topic.