Reviewing LevelDBPart VIII–What are the levels all about?

time to read 3 min | 567 words

When I last left things, we were still digging into the code for Version and figure out that this is a way to handle the list of files, and that it is the table cache that is actually doing IO. For this post, I want to see where continuing to read the Version code will take us.

There are some interesting things going on in here. We use the version to do a lot more. For example, because it holds all of the files, we can use it to check if they are overlaps in key spaces, which leveldb can use to optimize things later on.

I am not sure that I am following everything that is going on there, but I start to get the shape of things.

As I understand things right now. Data is stored in the following levels:

  • MemTable – where all the current updates are going
  • Immutable MemTable – which is frozen, and probably in the process of being written out
  • Level 0 – probably just a dump of the mem table to disk, key ranges may overlap, so you may need to read multiple files to know where a particular value is.
  • Level 1 to 6 – files where keys cannot overlap, so you probably only need one seek to find things. But seeks are still charged if you need to go through multiple levels to find a value.

That make sense, and it explains things like charging seeks, etc.

Now, I am currently going over VersionSet::Builder, which appears to be applying something called VersionEdit efficiently. I don't know what any of those things are yet...

VersionEdit appears to be a way to hold (and maybe store on disk?) data about pending changes to a version. This is currently not really useful because I still don't know how that is used. This means that I need to go deeper into the VersionSet code now.

VersionSet::LogAndApply looks like it is an interesting candidate for behavior.

I also found why we need VersionEdit to be persistable, we write it to the manifest file. Probably as a way to figure out what is going on when we recover. LogAndApply write the VersionEdit to a manifest file, then set the current file to point to that manifest file.

I am currently reading the Recover function, and it seems to be applying things in reverse. It reads the CURRENT file which points to the current manifest, which contains the serialized VersionEdit contents. Those are read in turn, after which we apply the edit to the in memory state.

Toward the end of file, we start seeing things regards compaction. Like:

Iterator* VersionSet::MakeInputIterator(Compaction* c)

Compaction* VersionSet::PickCompaction()

I sort of follow and sort of doesn't follow what the code is doing there. It selects the appropriate set of files that need to be compacted. I don't really get the actual logic, I'll admit, but hopefully that will become better when I read the rest of the code.

Compaction appears to work on level and level+1 files. I assume that it is going to read all of those files, merge them into level+1 and then delete (or mark for deletion) the existing files.

This is now close to midnight, and my eyes started building and the code started to do gymnastics on the screen, so I'll call it a post for now.

More posts in "Reviewing LevelDB" series:

  1. (26 Apr 2013) Part XVIII–Summary
  2. (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
  3. (12 Apr 2013) Part XV–MemTables gets compacted too
  4. (11 Apr 2013) Part XVI–Recovery ain’t so tough?
  5. (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
  6. (09 Apr 2013) Part XIII–Smile, and here is your snapshot
  7. (08 Apr 2013) Part XII–Reading an SST
  8. (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
  9. (04 Apr 2013) Part X–table building is all fun and games until…
  10. (03 Apr 2013) Part IX- Compaction is the new black
  11. (02 Apr 2013) Part VIII–What are the levels all about?
  12. (29 Mar 2013) Part VII–The version is where the levels are
  13. (28 Mar 2013) Part VI, the Log is base for Atomicity
  14. (27 Mar 2013) Part V, into the MemTables we go
  15. (26 Mar 2013) Part IV
  16. (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
  17. (21 Mar 2013) Part II, Put some data on the disk, dude
  18. (20 Mar 2013) Part I, What is this all about?