Reviewing LevelDBPart X–table building is all fun and games until…
It seems like I have been trying to get to the piece that actually write to disk for a very long while.
The API from the header file looks to be pretty nice. I would assume that you would call it something like this:
TableBuilder builder(options,writableFile); builder.Add(“one”, “bar”); builder.Add(“two”, “foo”); builder.Finish();
There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.
Oh well, maybe I need to actually read the code first?
And immediately we have this:
I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.
There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.
Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.
When considering leveldb for RavenDB, we decided that we would store the data in this manner:
/users/ayende/data = {}
/users/ayende/metadata = {}
Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.
We can arrange things so we would output:
0,19,2
/users/ayende/data
{}
15, 9, 3
metadata
{}
Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.
Now, blocks are held in memory, and by default they are flushed every 4 KB or so.
Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.
Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...
More posts in "Reviewing LevelDB" series:
- (26 Apr 2013) Part XVIII–Summary
- (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
- (12 Apr 2013) Part XV–MemTables gets compacted too
- (11 Apr 2013) Part XVI–Recovery ain’t so tough?
- (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
- (09 Apr 2013) Part XIII–Smile, and here is your snapshot
- (08 Apr 2013) Part XII–Reading an SST
- (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
- (04 Apr 2013) Part X–table building is all fun and games until…
- (03 Apr 2013) Part IX- Compaction is the new black
- (02 Apr 2013) Part VIII–What are the levels all about?
- (29 Mar 2013) Part VII–The version is where the levels are
- (28 Mar 2013) Part VI, the Log is base for Atomicity
- (27 Mar 2013) Part V, into the MemTables we go
- (26 Mar 2013) Part IV
- (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
- (21 Mar 2013) Part II, Put some data on the disk, dude
- (20 Mar 2013) Part I, What is this all about?
Comments
This series is like reading "literate programming" piece of code. I really enjoy it.
When I count the length of "/users/ayende/", I only get 14.
When you say you can reuse the first 15 characters of the previous entry, do you mean 14 or am I missing something?
Tobi, "literate programming" ? I don't think that I am familiar with the phrase.
Chris, Yes, I looked at the editor to see the length, and it is 1 based, not 0 based.
Ayende, literate Programming was introduced by Donald Knuth, a proposal to write code like a book, the code is an integral part of it.
He tought of it as a good book you would read, enjoy and learn from it.
Not Comments in code, but code inside the book.
Ayende, it is kind of writing a program in a way that allows it to be read front-to-back. http://en.wikipedia.org/wiki/Literate_programming
Comment preview