Ayende @ Rahien

It's a girl

Reviewing LevelDB: Part X–table building is all fun and games until…

It seems like I have been trying to get to the piece that actually write to disk for a very long while.

image

The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);
builder.Finish();

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:

image

I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}

/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:

0,19,2

/users/ayende/data

{}

15, 9, 3

metadata

{}

Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

Comments

tobi
04/04/2013 11:00 AM by
tobi

This series is like reading "literate programming" piece of code. I really enjoy it.

Chris
04/04/2013 02:23 PM by
Chris

When I count the length of "/users/ayende/", I only get 14.

When you say you can reuse the first 15 characters of the previous entry, do you mean 14 or am I missing something?

Ayende Rahien
04/04/2013 02:25 PM by
Ayende Rahien

Tobi, "literate programming" ? I don't think that I am familiar with the phrase.

Ayende Rahien
04/04/2013 02:26 PM by
Ayende Rahien

Chris, Yes, I looked at the editor to see the length, and it is 1 based, not 0 based.

tobi
04/04/2013 06:20 PM by
tobi

Ayende, it is kind of writing a program in a way that allows it to be read front-to-back. http://en.wikipedia.org/wiki/Literate_programming

Swiss
04/04/2013 03:24 PM by
Swiss

Ayende, literate Programming was introduced by Donald Knuth, a proposal to write code like a book, the code is an integral part of it.

He tought of it as a good book you would read, enjoy and learn from it.

Not Comments in code, but code inside the book.

Comments have been closed on this topic.