Reviewing LevelDB: Part X–table building is all fun and games until…

Apr 04 2013

Reviewing LevelDBPart X–table building is all fun and games until…

time to read 3 min | 502 words

It seems like I have been trying to get to the piece that actually write to disk for a very long while.

The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);
builder.Finish();

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:

I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}
/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:

0,19,2
/users/ayende/data
{}
15, 9, 3
metadata
{}

Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

Tweet Share Share 6 comments

Tags:

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB