Ayende @ Rahien

It's a girl

Raven’s Storage: Reading a Sorted String Table

When reading a SST, we have to deal with values of potentially large sizes. I want to avoid loading anything into managed memory if I can possible avoid it. That, along with other considerations has led me to use memory mapped files as the underlying abstractions for reading from the table.

Along the way, I think that I made the following assumptions:

  • Work in little endian only.
  • Can work in 32 bits, but 64 bits are preferred.
  • Doesn’t put pressure on the .NET memory manager.

In particular, I am worried about users wanting to do store large values, or large number of keys.

As it turned out, .NET’s memory mapped files are a pretty good answer for what I wanted to do. Sure, it is a bit of a pain with regards to how to handle things like WinRT / Silverlight, etc. But I am mostly focused on server side for now. And I got some ideas on how to provide the mmap illusion on top regular streams for platforms that don’t support it.

The fact that SST is written by a single thread, and once it is written it is immutable has drastically simplified the codebase. Although I have to admit that looking at hex dump to figure out that I wrote to the wrong position is a bit of a bother, but more on that later. A lot of the code is basically the leveldb code, tweaked for .NET uses. One important difference that I made was with regards to the actual API.

  • Keys are assumed to be small. Most of the time, less than 2KB in size, and there are optimizations in place to take advantage of that. (It will still work with keys bigger than that, but will consume more memory).
  • In general, through the codebase I tried to put major emphasis on performance and memory use even at this early stage.
  • Values are not assumed to be small.

What does the last one mean?

Well, to start with, we aren’t going to map the entire file into our memory space, to start with, it might be big enough to start fragmenting our virtual address space, but mostly because there is no need. We always map just a single blokc at at time, and usually we never bother to read the values into managed memory, instead just accessing the data directly from the memory mapped file.

Here is an example of reading the key for an entry from the SST:

image

As you can see, we need to read just the key itself to memory, the value itself is not even touched. Also, there is some buffer management going on to make sure that we don’t need to re-allocate buffers as we are scanning through the table.

When you want to get a value, you call CreateValueStream, which gives you a Stream that you can work with. Here is how you use the API:

image

This is actually part of the internal codebase, we are actually storing data inside the SST that will later help us optimize things, but that is something I’ll touch on a later point in time.

Except for the slight worry that I am going to have to change the underlying approach from memory mapped files to streams if I need to run it outside the server/client, this is very cool.

Next on my list is to think on how to implement the memtable, again, without impacting too much on the managed memory.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Frank Quednau
05/02/2013 09:31 AM by
Frank Quednau

What is the underlying driver to analyzing leveldb and starting to implement it in RavenDB? Do you want to get rid of Esent so you can run on more platforms, or...?

Mircea Chirea
05/02/2013 10:13 AM by
Mircea Chirea

Frank, I doubt Esent is going anywhere, but IIRC Oren is looking at several other storage systems for compatibility with Mono.

Sergey Shumov
05/02/2013 10:51 AM by
Sergey Shumov

I probably missed something but I don't understand why are you using 7BitEncodedInt instead of a regular int?

Thomas Krause
05/02/2013 11:32 AM by
Thomas Krause

What are the advantages of using memory mapped files instead of a regular FileStream?

Your decode function looks like it's working with an incrementing offset, which is what a Stream already does for free and iterating entries also seems to match that perfectly.

Is there some performance benefit that I'm not aware of?

In case you really need random access to the file, it seems that a simple Seek would work.

And you already know you need to support Streams for Silverlight...

Matt Warren
05/02/2013 01:03 PM by
Matt Warren

@Sergey Shumov

I guess it's mostly because that how LevelDB does it, so the SST files will remain compatible. See http://code.google.com/p/leveldb/source/browse/table/block_builder.cc?r=da7990950787257cb312ca562ce5977749afc3e9#16.

And LevelDB does it that way because it can save space depending on the size of the number being stored. See https://developers.google.com/protocol-buffers/docs/encoding#varints for more info.

Ryan Heath
05/03/2013 07:59 AM by
Ryan Heath

Is it not possible to move the filtername check into iterator.isvalid? Or maybe perform the check when seeking?

Anyway, good job to get leveldb-ish behavior into ravendb.

// Ryan

Ayende Rahien
05/03/2013 08:01 AM by
Ayende Rahien

Thomas, The major thing that you get from mmap files is that the OS is the one that is going to do the caching for you. Note that I am reading a single entry, sure. But I may be reading multiple entries at the same time, concurrently. You can't really do that with Streams. And you won't get as good a benefit from the page cache for the OS.

Ayende Rahien
05/03/2013 08:01 AM by
Ayende Rahien

Ryan, I don't follow your issue here.

Ryan Heath
05/03/2013 11:44 AM by
Ryan Heath

It's not big but to me it seems the check should go into iterator.isvalid (or an isvalidkey). You have given the iterator all the information already: casinsensitive compare in the constructor, the filtername or key in the Seek method. The code here is very small so probably it isn't a big deal but I have seen mismatched comparison and/or key comparison too often.

// Ryan

Alexei K
05/05/2013 04:02 PM by
Alexei K

Ayende, what tech is underneath this? Last you mentioned leveldb, you still couldn't compile it on Windows. Is this a managed re-implementation?

Ayende Rahien
05/05/2013 07:30 PM by
Ayende Rahien

Alexei, Yes, this is managed re-impl

Comments have been closed on this topic.