Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 72

filter by tags archive

Reviewing LevelDBPart X–table building is all fun and games until…

time to read 3 min | 502 words

It seems like I have been trying to get to the piece that actually write to disk for a very long while.

image

The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);
builder.Finish();

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:

image

I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}

/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:

0,19,2

/users/ayende/data

{}

15, 9, 3

metadata

{}

Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

More posts in "Reviewing LevelDB" series:

  1. (26 Apr 2013) Part XVIII–Summary
  2. (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
  3. (12 Apr 2013) Part XV–MemTables gets compacted too
  4. (11 Apr 2013) Part XVI–Recovery ain’t so tough?
  5. (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
  6. (09 Apr 2013) Part XIII–Smile, and here is your snapshot
  7. (08 Apr 2013) Part XII–Reading an SST
  8. (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
  9. (04 Apr 2013) Part X–table building is all fun and games until…
  10. (03 Apr 2013) Part IX- Compaction is the new black
  11. (02 Apr 2013) Part VIII–What are the levels all about?
  12. (29 Mar 2013) Part VII–The version is where the levels are
  13. (28 Mar 2013) Part VI, the Log is base for Atomicity
  14. (27 Mar 2013) Part V, into the MemTables we go
  15. (26 Mar 2013) Part IV
  16. (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
  17. (21 Mar 2013) Part II, Put some data on the disk, dude
  18. (20 Mar 2013) Part I, What is this all about?

Comments

tobi

This series is like reading "literate programming" piece of code. I really enjoy it.

Chris

When I count the length of "/users/ayende/", I only get 14.

When you say you can reuse the first 15 characters of the previous entry, do you mean 14 or am I missing something?

Ayende Rahien

Tobi, "literate programming" ? I don't think that I am familiar with the phrase.

Ayende Rahien

Chris, Yes, I looked at the editor to see the length, and it is 1 based, not 0 based.

tobi

Ayende, it is kind of writing a program in a way that allows it to be read front-to-back. http://en.wikipedia.org/wiki/Literate_programming

Swiss

Ayende, literate Programming was introduced by Donald Knuth, a proposal to write code like a book, the code is an integral part of it.

He tought of it as a good book you would read, enjoy and learn from it.

Not Comments in code, but code inside the book.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. RavenDB 3.0 New Stable Release - 13 hours from now
  2. Production postmortem: The industry at large - about one day from now
  3. The insidious cost of allocations - 3 days from now
  4. Buffer allocation strategies: A possible solution - 6 days from now
  5. Buffer allocation strategies: Explaining the solution - 7 days from now

And 3 more posts are pending...

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    01 Sep 2015 - The case of the lying configuration file
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats