Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 5,971 | Comments: 44,508

filter by tags archive

Raven’s StorageReading a Sorted String Table


When reading a SST, we have to deal with values of potentially large sizes. I want to avoid loading anything into managed memory if I can possible avoid it. That, along with other considerations has led me to use memory mapped files as the underlying abstractions for reading from the table.

Along the way, I think that I made the following assumptions:

  • Work in little endian only.
  • Can work in 32 bits, but 64 bits are preferred.
  • Doesn’t put pressure on the .NET memory manager.

In particular, I am worried about users wanting to do store large values, or large number of keys.

As it turned out, .NET’s memory mapped files are a pretty good answer for what I wanted to do. Sure, it is a bit of a pain with regards to how to handle things like WinRT / Silverlight, etc. But I am mostly focused on server side for now. And I got some ideas on how to provide the mmap illusion on top regular streams for platforms that don’t support it.

The fact that SST is written by a single thread, and once it is written it is immutable has drastically simplified the codebase. Although I have to admit that looking at hex dump to figure out that I wrote to the wrong position is a bit of a bother, but more on that later. A lot of the code is basically the leveldb code, tweaked for .NET uses. One important difference that I made was with regards to the actual API.

  • Keys are assumed to be small. Most of the time, less than 2KB in size, and there are optimizations in place to take advantage of that. (It will still work with keys bigger than that, but will consume more memory).
  • In general, through the codebase I tried to put major emphasis on performance and memory use even at this early stage.
  • Values are not assumed to be small.

What does the last one mean?

Well, to start with, we aren’t going to map the entire file into our memory space, to start with, it might be big enough to start fragmenting our virtual address space, but mostly because there is no need. We always map just a single blokc at at time, and usually we never bother to read the values into managed memory, instead just accessing the data directly from the memory mapped file.

Here is an example of reading the key for an entry from the SST:

image

As you can see, we need to read just the key itself to memory, the value itself is not even touched. Also, there is some buffer management going on to make sure that we don’t need to re-allocate buffers as we are scanning through the table.

When you want to get a value, you call CreateValueStream, which gives you a Stream that you can work with. Here is how you use the API:

image

This is actually part of the internal codebase, we are actually storing data inside the SST that will later help us optimize things, but that is something I’ll touch on a later point in time.

Except for the slight worry that I am going to have to change the underlying approach from memory mapped files to streams if I need to run it outside the server/client, this is very cool.

Next on my list is to think on how to implement the memtable, again, without impacting too much on the managed memory.

More posts in "Raven’s Storage" series:

  1. (06 May 2013) Memtables are tough
  2. (03 May 2013) Understanding the SST file format
  3. (02 May 2013) Reading a Sorted String Table
  4. (29 Apr 2013) Building a Sorted String Table

Comments

Frank Quednau

What is the underlying driver to analyzing leveldb and starting to implement it in RavenDB? Do you want to get rid of Esent so you can run on more platforms, or...?

Mircea Chirea

Frank, I doubt Esent is going anywhere, but IIRC Oren is looking at several other storage systems for compatibility with Mono.

Sergey Shumov

I probably missed something but I don't understand why are you using 7BitEncodedInt instead of a regular int?

Thomas Krause

What are the advantages of using memory mapped files instead of a regular FileStream?

Your decode function looks like it's working with an incrementing offset, which is what a Stream already does for free and iterating entries also seems to match that perfectly.

Is there some performance benefit that I'm not aware of?

In case you really need random access to the file, it seems that a simple Seek would work.

And you already know you need to support Streams for Silverlight...

Matt Warren

@Sergey Shumov

I guess it's mostly because that how LevelDB does it, so the SST files will remain compatible. See http://code.google.com/p/leveldb/source/browse/table/block_builder.cc?r=da7990950787257cb312ca562ce5977749afc3e9#16.

And LevelDB does it that way because it can save space depending on the size of the number being stored. See https://developers.google.com/protocol-buffers/docs/encoding#varints for more info.

Ryan Heath

Is it not possible to move the filtername check into iterator.isvalid? Or maybe perform the check when seeking?

Anyway, good job to get leveldb-ish behavior into ravendb.

// Ryan

Ayende Rahien

Thomas, The major thing that you get from mmap files is that the OS is the one that is going to do the caching for you. Note that I am reading a single entry, sure. But I may be reading multiple entries at the same time, concurrently. You can't really do that with Streams. And you won't get as good a benefit from the page cache for the OS.

Ayende Rahien

Ryan, I don't follow your issue here.

Ryan Heath

It's not big but to me it seems the check should go into iterator.isvalid (or an isvalidkey). You have given the iterator all the information already: casinsensitive compare in the constructor, the filtername or key in the Seek method. The code here is very small so probably it isn't a big deal but I have seen mismatched comparison and/or key comparison too often.

// Ryan

Alexei K

Ayende, what tech is underneath this? Last you mentioned leveldb, you still couldn't compile it on Windows. Is this a managed re-implementation?

Ayende Rahien

Alexei, Yes, this is managed re-impl

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. Paying the rent online - 2 days from now

There are posts all the way to Aug 03, 2015

RECENT SERIES

  1. Production postmortem (5):
    29 Jul 2015 - The evil licensing code
  2. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
  3. API Design (7):
    20 Jul 2015 - We’ll let the users sort it out
  4. What is new in RavenDB 3.5 (3):
    15 Jul 2015 - Exploring data in the dark
  5. The RavenDB Comic Strip (3):
    28 May 2015 - Part III – High availability & sleeping soundly
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats