Ayende @ Rahien

It's a girl

Reviewing LevelDB: Part XVIII–Summary

Well, I am very happy at the conclusion of this blog post series. Beside being one of the longest that I have done, this actually stretched my ability to count using roman numerals.

In summary, I am quite happy that I spent the time reading all of this code. The LevelDB codebase is really simple, when you grok what it actually does. There is nothing there that would totally stun a person. What there is there, however, is a lot of accumulated experience in building those sort of things.

You see this all over the place, in the format of the SST, in the way compaction is working, in the ability to add filters, write merging, etc. The leveldb codebase is a really good codebase to read, and I am very happy to have done so. Especially since doing this in C++ is way out of m comfort zone. It was also interesting to read what I believe is idiomatic C++ code.

Another observation about leveldb is that it is a hard core C++ product. You can’t really take that and use the same approach in a managed environment. In particular, efforts to port leveldb to java (https://github.com/dain/leveldb/) are going to run into hard issues with problems like managing the memory. Java, like .NET, has issues with allocating large byte arrays, and even from the brief look I took, working with leveldb on java using the codebase as it is would likely put a lot of pressure there.

Initially, I wanted to use leveldb as a storage engine for RavenDB. Since I couldn’t get it compiling & working on Windows (and yes, that is a hard requirement. And yes, it has to be compiling on my machine to apply), I thought about just porting it. That isn’t going to be possible. At least not in the trivial sense. Too much work is require to make it work properly.

Yes, I have an idea, no, you don’t get to hear it at this time Smile.

Reviewing LevelDB: Part XVII– Filters? What filters? Oh, those filters…

I was all set to finish this series, when I realized that I missed something important, I didn’t covered, and indeed, I don’t currently understand, what filters are, and how are they used.

As usual, once I actually got to the part of the code where this is actually handled (instead of ignoring it), it made a lot more sense:


And that, following with the rest of the code that I read makes a lot more sense.

A SST is a sorted table, essentially. We have the keys and the blocks, etc. And we can find a value very quickly. But what if you could do this even more cheaply?

Using a bloom filter is a good way to never get false negative, and it will reduce the amount of work we have to do if we can’t find the key in the SST drastically (only need to read the filter value, don’t even need to do any additional checks). Quite elegant, even if I say so myself.

There is one thing to pay attention to and that is that you can define your own comparator, in which case you must also define you own filter policy. If you use a comparator that ignore casing, you also need to provider a filter that ignore casing.

Reviewing LevelDB: Part XV–MemTables gets compacted too

As mentioned before, every 4 MB or so, we will move the current memtable and switch to a new one, turning the current one into an immutable memtable, which will then be compacted to disk. This post is about that process. The work there is done by the surprisingly name: DBImpl::CompactMemTable, which then calls to WriteLevel0Table.

That method build the actual table, using the same approach I already covered. Honestly, the code is pretty obvious now that I spent enough time reading this. The only thing of interest is DeleteObsoleteFiles() method, which looks interesting if only because it is one of the few places leveldb actually does real work with files. Mostly, however, ParseFileName is interesting:


It give me a good way to look at the actual files being used. This will be important when we will investigate how leveldb handles recovery.

Leveldb will consider files obsolete if:

  • They are log files and aren't of the current/previous log number.
  • If it is a manifest file, keep the current one (or later if there are any).
  • If it is a table file, keep it if it is in use. A table file may have been compacted away.
  • If it is a lock file, info log or the current db file.

All of this is interesting, but on a minor level. I think that we are nearing the end of the codebase now. The only thing that I really want to go into is the recovery parts, and that would pretty much be it.

Published at

Originally posted at

Reviewing LevelDB: Part XVI–Recovery ain’t so tough?

This is how leveldb starts up a new db:


As you can imagine, this is quite useful to find out, since it means that everything we do on startup is recover. I do wonder why we have to take a lock immediately, though. I don't imagine that anything else can try to make use of the db yet, it hasn't been published to the outside world.

Recover is interesting (and yes, I know I wrote that a lot lately).

  • It starts by taking a lock on the db (File System lock, by holding to LOCK file).
  • If the db does not exists, we call NewDB(), which will create a default MANIFEST-00001 file and a CURRENT file pointing at that.
  • If the db exists, we call VersionSet::Recover(), this starts by reading the CURRENT file, which points to the appropriate MANIFEST file.
  • This then gives us the latest status of the db versions, in particular, it tells us what files belong to what levels and what they ranges are.
  • We check that the comparators we use is identical to the one used when creating the db.
  • The code make it looks like there might be multiple version records in the MANIFEST file, but I don't recall that. It might be just the way the API is structured, though. I just checked with the part that write it, and yes, it should have just one entry.
  • Once we have our versions, what next? We check that all of the expected files are actually there, because otherwise we might have a corrupted install.
  • The final thing we do is look for log files that are later than the latest we have in the MANIFEST. I am not sure if this indicates a recover from a crash or a way to be compatible with an older version. I guess that one way this can happen is if you crashed midway while committing a transaction.

When recovering, we are forcing checksum checks, to make sure that we don't get corrupt data (which might very well be the case, since the data can just stop at any point here. The relevant code here is leveldb::log:Reader, which takes care of reading potentially corrupt log file and reporting on its finding. I already went over how the log file is structured, the log reader just does the same thing in reverse, with a lot of paranoia. While reading from the log file, we build a memtable with the committed & safe changes. Here, too, we need to handle with memtable sizes, so we might generate multiple level 0 files during this process.

And... that is pretty much it.

I'll have another post summarizing this, and maybe another on what I intend to do with this information, but I'll keep that close to the chest for now.

Reviewing LevelDB: Part XIV– there is the mem table and then there is the immutable memtable

In leveldb we have the memtable, into which all of the current writes are going and the immutable memtable. So far, I haven't really checked what it actually means.

It looks like this happens on the write, by default, a mem table is limited to about 4 MB of space. We write to the memtable (backed by the tx log) until we get to the point where we go beyond the 4MB limit (note that again, if you have large values, you might go much higher than 4MB and then we switch to another memtable, change the existing memtable to be the immutable one, and move on.

Something that might trip you is that if you write 2 big values one after the other, in separate batches, the second one might need to wait for the compaction to complete.

Here is the test code:

   std::string big(1024 * 1024 * 5, 'a');
     for(int i=0; i < 10; i++ ){
         std::clock_t start = std::clock();
         std::ostringstream o;
         o << "test" << i;
         std::string key =  o.str();

         db->Put(writeOptions, key, big);
         std::clock_t end = std::clock();
         std::cout << i << ": " << 1000.0 *(end - start) / CLOCKS_PER_SEC << "ms " << std::endl;

And the output is:

0: 20ms

1: 30ms

2: 50ms

3: 50ms

4: 50ms

5: 50ms

6: 40ms

7: 60ms

8: 50ms

9: 50ms

Not really that interesting, I'll admit. But when I tried it with 50 mb for each value, it felt like the entire machine grind to a halt. Considering the amount of times memory is copied around, and that it needs to be saved to at least two locations (log file & sst), that makes a lot of sense.

For references ,those were my timing, but I am not sure that I trust them.

0: 170ms

1: 310ms

2: 350ms

3: 460ms

4: 340ms

5: 340ms

6: 280ms

7: 530ms

8: 400ms

9: 1200ms

Published at

Originally posted at

Reviewing LevelDB: Part XIII–Smile, and here is your snapshot

Okay, now that I know how data actually gets to the disk and from it, it is time to read how leveldb handles snapshots. Snapshots seems to be very simple on the surface. On every write, we generate a sequence number. We store the number locally and use the oldest still living snapshot as the oldest version that we have to retain when doing compaction work.

I had hard time figuring out how it worked out with regards to using this in memory. Consider the following code:

     leveldb::WriteOptions writeOptions;
     writeOptions.sync = true;
     db->Put(writeOptions, "test", "one");
     const leveldb::Snapshot* snapshot = db->GetSnapshot();
     db->Put(writeOptions, "test", "two");
    leveldb::ReadOptions readOptions;
    readOptions.snapshot = snapshot;
    std::string val;
    status = db->Get(readOptions, "test", &val);

This will properly give val == “one” when we execute it.

As it turned out, I missed something when I read the code earlier for MemTables. The value that is actually stored as the key is [key][tag]. And the tag is the sequence number + write type. And because of the way it is sorted (little endian, always), it means that higher values are sorted first. And what that means in turn is that unless you specify a specific snapshot number (which is what this tag contains, most of the time), you are going to get the latest version. But if you specify a snapshot number, you’ll get the value that was there as of that snapshot.

And that, in turn, means that we can write code like this:


Where key.memtable_key() contains the required snapshot value. So we can just skip all the ones larger than what we want.

That is really cool, but what about when we go to disk? Pretty much in the same way. The actual key value include the sequence & tag value. But the comparator knows to filter it out when needed. This is quite nice, and an elegant way to handle this situation.

Reviewing RavenDB: Part XII–Reading an SST

After figuring out how the TableCache works, I now want to dig a bit deeper into what the Tables are. That means that we are starting out in Table::Open. And this starts with:


So we start by reading the footer. Then we read the index block:


Construct a new table, and we are pretty much set. I went back to look at the TableBuilder, and I think that I am getting things better. There is the BlockHandle, which is basically just the offset / size in the file.

Then we have the actual index, which is located at the end of the file. This one has the format of:

key, offset, size

key, offset, size

key, offset, size

The only problem is that I don't see yet where this is actually used. And here it is, inside Table::InternalGet.


So we seek into the block data. And that matches pretty closely what I would expect here. And then we have this:



Some notes about this code, it looks like we are going to be loading the entire block into memory. But what happens if we have a very large value? For that matter, what happens if we have a very large value next to a very small value on the same block, but we wanted the small value?

I think that I am wrong here. Looking at the code ,I found this comment, which I previously missed:


And that explains it really nicely with regards to how blocks works in general. The cool part happens inside Block::Iter::Seek, we first do a binary search for the prefixes that we have inside the block, and then a linear search inside the restart section. By default, there are 16 items per restart, so that is going to generate a really fast turnaround in general.

One key point I think that bear repeating is with regards to my previous comment about sizes. We don't actually ever copy the entire block to memory, instead, we rely on the OS to do so for us, because the files are actually memory mapped. That means that we can easily access them with very little cost, even if we have mixed big & small values on the same block. That is because we can just skip the large value and not touch it,and rely on the OS to page the right pages for us. That said, when reading a block, not just iterating it, we are going to allocate it in memory:


So I am not sure about that. I think that I was mistaken previously. The only problem is that the buf is actually used for scratch purposes.  I think that this is right on both ends, looking at the code, we have PosixRandomAccessFile:


This is the standard approach, actually. Read it from the disk into the buffer, and go on with your life.  But we also have PosixMMapReadableFile implementation:


And this is a lot more reasonable and in line with what I was thinking. We don't use the scratch at all.

However, we still need to allocate this scratch buffer on every single write. Looking at the code, the decision on what to allocate seems to be done here:


As you can see, it is mmap_limit_ that will make that determination. That is basically limiting us to no memory maps on 32 bits (make sense, the 2 GB virtual address space is really small) and to a max of 1,000 mapped files for 64 bits. Given that I am assuming that you are going to run this on 64 bits a lot more often than on 32 bits (at least for server apps), it would make more sense...

Stopping right here. Leveldb is being used in the browser as well, and presumably on mobile devices, etc. That means that you can't really assume/require 64 bits. And given that most of the time, you are going to have blocks of up to 4 KB in size (except if you have very large keys), I think that this is reasonable. I would probably have done away with allocating the buffer in the happy case, but that is beside the point, probably, since most of the time I assume that the value are small enough.

I am looking at this through the eyes of someone who deals with larger values all the time, so it triggers a lot of introspection for me. And this is how we actually read a value from disk, I actually managed to also figure out how we write, and in what format. All together good day's (actually, it was mostly at night) fun.

Reviewing LevelDB: Part XI–Reading from Sort String Tables via the TableCache

In the previous post, I focused mostly on reading the code for writing a SST to disk. But I have to admit that I am not really following how you read them back in a way that would be easy to read.

In order to understand that, I think that the right place in the code would be the TableCache. The API for that is pretty slick, here are just the header (comments were stripped).

class TableCache {
  TableCache(const std::string& dbname, const Options* options, int entries);

  Iterator* NewIterator(const ReadOptions& options,
                        uint64_t file_number,
                        uint64_t file_size,
                        Table** tableptr = NULL);

  Status Get(const ReadOptions& options,
             uint64_t file_number,
             uint64_t file_size,
             const Slice& k,
             void* arg,
             void (*handle_result)(void*, const Slice&, const Slice&));

  void Evict(uint64_t file_number);


  Status FindTable(uint64_t file_number, uint64_t file_size, Cache::Handle**);

There are a couple of interesting things here that show up right away. How do you know what is the number of entries. I guess that this is stored externally, but I am not sure where. I am going to figure that question out first.

And the answer is strange:


It isn't number of entries in the table, it is the number of files? That actually say something very important, since this means that the table cache is reading multiple SST files, rather than just one per cache. Looking at the rest of the API, it makes even more sense. We need to pass the file number that we are going to be reading. That one is going to be got from the Version we are using.

Side note: I just spend an hour or so setting up a tryout project for leveldb, so I can actually execute the code and see what it does. This had me learning cmake (I am using KDevelop to read the code) and I finally got it. Still haven't figure out how to step into leveldb's code. But that is a good step.

Also, the debug experience is about 2/10 compared to VS.

And damn, I am so not a C++ developer any longer. Then again, never was a real dev on linux, anyway.

The first thing the TableCache does is to setup a cache. This is interesting, and I followed the code, it create 16 internal caches and has between them. I think that this is done to get concurrency because all the internal cache methods looks like this:


Note the mutex lock. That is pretty much how it works for all of the cache work. Looking deeper into the code, I can see another reason why I should be grateful for staying away from C++. leveldb comes with its own hash table functionality. At any rate, the cache it using LRU to evict items, and it get the number of entries from the value that was passed in the ctor. That make me think that it hold the opened files.

Speaking of the cache, here is an example of the code using it:


The cache is also used to do block caching, this is why it takes a slice as an argument. I'm going to go over that later, because this looks quite interesting. The rest of this method looks like this:


So the only very interesting bit is the Table::Open. The rest is just opening the mmap file and storing it in the cache. Note that the actual file size is passed externally. I am not really sure what this means yet. I'll go over the table code later, right now I want to focus on the table cache.

Looking over the TableCache interface, we can see that we always get the file size from outside. That got me curious enough to figure out why. And the answer is that we always have it in the FileMetaData that we previously explored. I am not sure why that is so important, but I'll ignore it for now.

The rest of TableCache is pretty simple, although this made me grin:


More specifically, look at the RegisterCleanup, this is basically registering for the delete event, so they can unref the cache. Pretty nice, all around.

The rest of the code is just forwarding calls to the Table, so that is what I'll be reading next...

Reviewing LevelDB: Part X–table building is all fun and games until…

It seems like I have been trying to get to the piece that actually write to disk for a very long while.


The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:


I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}

/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:




15, 9, 3



Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

Reviewing RaveDB: Part IX- Compaction is the new black

After going over the VersionSet, I understand how leveldb decide when to compact and what it decide to compact. More or less, at least.

This means that I now mostly can figure out what this does:

Status DBImpl::BackgroundCompaction()

A user may force manual compaction of a range, or we may have reasons of our own to decide to compact, based on leveldb heuristics. Either way, we get the Compaction object, which tells us what files we need to merge.

There is a check there whatever we can do a trivial compaction, that is, just move the file from the current level to level+1. The interesting thing is that we avoid doing that if this is going to cause issues in level+2 (require more expensive compaction later on).

But the interesting work is done in DoCompactionWork, where we actually do compaction of complex data.

The code refers to snapshots for the first time that I see. We only merge values that are in a valid snapshot. So data doesn't “go away” for users. While holding a snapshot active.

The actual work starts in here:


This give us the ability to iterate over all of the entries in all of the files that need compaction.

And then we go over it:


But note that on each iteration we do a check if we need to CompactMemTable(); I followed the code there, and we finally write stuff to disk! I am quite excited about this, but I'll handle that in a future blog post. I want to see how actual compaction works.

We then have this:


This is there to stop a compaction that would force a very expensive compaction next time, I think.

As a side note, this really drive me crazy:


Note that we have current_output() and FileSize() in here. I don't really care what naming convention you use, but I would really rather that you had one. If there is one for the leveldb code base, I can't say that I figured it out. It seems like mostly it is PascalCase, but every now & then we have this_style methods.

Back to the code, it took me a while to figure it out.


Will return values in sorted key order, that means that if you have the same key in multiple levels, we need to ignore the older values. After this is happening, we now have this:


This is where we are actually writing stuff out to the SST file! This is quite exciting :-). I have been trying to figure that out for a while now.

The rest of the code in the function is mostly stuff related to stats book keeping, but this looks important:


This generate the actual VersionEdit, which will remove all of the files that were compacted and add the new file that was created as a new Version to the VersionSet.

Good work so far, even if I say so myself. We actually go to where we are building the SST files. Now it is time to look at the code that build those table. Next post, Table Builder...

Reviewing LevelDB: Part VIII–What are the levels all about?

When I last left things, we were still digging into the code for Version and figure out that this is a way to handle the list of files, and that it is the table cache that is actually doing IO. For this post, I want to see where continuing to read the Version code will take us.

There are some interesting things going on in here. We use the version to do a lot more. For example, because it holds all of the files, we can use it to check if they are overlaps in key spaces, which leveldb can use to optimize things later on.

I am not sure that I am following everything that is going on there, but I start to get the shape of things.

As I understand things right now. Data is stored in the following levels:

  • MemTable – where all the current updates are going
  • Immutable MemTable – which is frozen, and probably in the process of being written out
  • Level 0 – probably just a dump of the mem table to disk, key ranges may overlap, so you may need to read multiple files to know where a particular value is.
  • Level 1 to 6 – files where keys cannot overlap, so you probably only need one seek to find things. But seeks are still charged if you need to go through multiple levels to find a value.

That make sense, and it explains things like charging seeks, etc.

Now, I am currently going over VersionSet::Builder, which appears to be applying something called VersionEdit efficiently. I don't know what any of those things are yet...

VersionEdit appears to be a way to hold (and maybe store on disk?) data about pending changes to a version. This is currently not really useful because I still don't know how that is used. This means that I need to go deeper into the VersionSet code now.

VersionSet::LogAndApply looks like it is an interesting candidate for behavior.

I also found why we need VersionEdit to be persistable, we write it to the manifest file. Probably as a way to figure out what is going on when we recover. LogAndApply write the VersionEdit to a manifest file, then set the current file to point to that manifest file.

I am currently reading the Recover function, and it seems to be applying things in reverse. It reads the CURRENT file which points to the current manifest, which contains the serialized VersionEdit contents. Those are read in turn, after which we apply the edit to the in memory state.

Toward the end of VersionSet.cc file, we start seeing things regards compaction. Like:

Iterator* VersionSet::MakeInputIterator(Compaction* c)

Compaction* VersionSet::PickCompaction()

I sort of follow and sort of doesn't follow what the code is doing there. It selects the appropriate set of files that need to be compacted. I don't really get the actual logic, I'll admit, but hopefully that will become better when I read the rest of the code.

Compaction appears to work on level and level+1 files. I assume that it is going to read all of those files, merge them into level+1 and then delete (or mark for deletion) the existing files.

This is now close to midnight, and my eyes started building and the code started to do gymnastics on the screen, so I'll call it a post for now.

Published at

Originally posted at

Reviewing LevelDB: Part VII–The version is where the levels are

Okay, so far I have written 6 parts, and the only thing that happened is that we wrote some stuff to the log file. That is cool, but I am assuming that there has got to be more. I started tracking the code, and I think that what happens is that we have compactions of the MemTable, at which point we flush it to disk.

I think that what happens is this, we have a method called MaybeScheduleCompaction, in db_impl.cc, which is kicking of the actual process for the MemTable compaction. This is getting called from a few places, but most importantly, it is called during the Write() call. Reading the code, it seems that before we can go to the actual compaction work, we need to look at something called a VersionSet. This looks like it holds all of the versions of the database at a particular point in time. Including all the files that it is using, etc.

A lot of what it (and its associate, the Version class) is about managing lists of this structure:


I am not sure what allowed_seeks mean, I assume it is there to force compaction for the next level.

Okay, moving on to version, it looks this is where all the actual lookups are done. We have a list of file metadata, including smallest & largest keys in each file. That allows us to find the appropriate files to look at quite easily. There seems to be some interaction between Version and TableCache, but I’m not going into that now.

A version is holding an array of 7 levels, and at each level we have the associated files. I am going to continue digging into Version & VersionSet for the moment.

Side Note: In fact, I got frustrated enough with trying to figure out leveldb on Windows that I setup a Ubunto machine with KDevelop just to get things done. This blog post is actually written on the Ubunto machine (later to be copy into live writer :-)).

I am still in the process of going through the code. It is a really much easier to do this in an IDE that can actually build & understand the code.

Once thing that I can tell you right now is that C++ programmers are strange. I mean, take a look at the following code, from Version::LevelFileNumIterator :


This returns a byte array containing encoded file num & size in a buffer. Would it be so hard to create a struct for that or use std::pair ? Seems like this would complicate the client code. Then again, maybe there is a perf reason that I am not seeing?

Then again, here is the client code:


And that seems pretty clear.

So far, it appears as if the Version is the current state of all of the files in a particular point in time. I think that this is how leveldb implements snapshots. The files are SSTables, which are pretty much write once only. A version belong to a set (not sure exactly what that means yet) and is part of a linked list. Again, I am not sure what is the purpose of that yet.

I'll need to do a deeper dive into snapshots in general, later on, because it is interesting to see how that is implemented with regards to the memtable.

Moving back to the actual code, we have this code:


This seems to me to indicate that the table_cache is the part of the code that is actually manages the SSTables, probably using some variant of the page pool.

Now, let us get to the good parts, Version::Get:


This looks like this is actually doing something useful. In fact, it find the relevant files to look for that particular key, once it did that, it calls:


So the data is actually retrieved from the cache, as expected. But there was an interesting comment there about “charging” seeks for files, so I am going to be looking at who is calling Version::Get right now, then come back to the cache in another post.

What is interesting is that we have this guy:


And that in turn all make sense now. allowed_seeks is something that is set when we apply a VersionEdit, it seems. No idea what this is now, but there is a comment there that explains that we use this as a way to trigger compaction when it is cheaper to do do compaction than continue doing those seeks. Interestingly enough, seeks are only counted if we have to go through more than one file to find a value, which makes sense, I guess.

Okay, now let us back up a bit and see who is calling Version::Get. And as it turned out, it is our dear friend, DBImpl::Get().

There, we first look in the current memtable, then in the immutable memtable (which is probably on its way to become a SSTable now. And then we are looking at the current Version, calling Version::Get. If we actually hit the version, we also call Version::UpdateStats, and if we need to, we then call MaybeScheduleCompaction(), which is where we started this post.

And... that is it for this post, we still have managed to find where we actually save to disk (they hid it really deep), but I think I'll probably be able to figure this out in this sitting, watch out for the next post.

Reviewing LevelDB: Part VI, the Log is base for Atomicity

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

   1: // A file abstraction for sequential writing.  The implementation
   2: // must provide buffering since callers may append small fragments
   3: // at a time to the file.
   4: class WritableFile {
   5:  public:
   6:   WritableFile() { }
   7:   virtual ~WritableFile();
   9:   virtual Status Append(const Slice& data) = 0;
  10:   virtual Status Close() = 0;
  11:   virtual Status Flush() = 0;
  12:   virtual Status Sync() = 0;
  14:  private:
  15:   // No copying allowed
  16:   WritableFile(const WritableFile&);
  17:   void operator=(const WritableFile&);
  18: };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

   1: Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

   1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));
   2: if (status.ok() && options.sync) {
   3:  status = logfile_->Sync();
   4: }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

   1: Status Writer::AddRecord(const Slice& slice) {
   2:   const char* ptr = slice.data();
   3:   size_t left = slice.size();
   5:   // Fragment the record if necessary and emit it.  Note that if slice
   6:   // is empty, we still want to iterate once to emit a single
   7:   // zero-length record
   8:   Status s;
   9:   bool begin = true;
  10:   do {
  11:     const int leftover = kBlockSize - block_offset_;
  12:     assert(leftover >= 0);
  13:     if (leftover < kHeaderSize) {
  14:       // Switch to a new block
  15:       if (leftover > 0) {
  16:         // Fill the trailer (literal below relies on kHeaderSize being 7)
  17:         assert(kHeaderSize == 7);
  18:         dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));
  19:       }
  20:       block_offset_ = 0;
  21:     }
  23:     // Invariant: we never leave < kHeaderSize bytes in a block.
  24:     assert(kBlockSize - block_offset_ - kHeaderSize >= 0);
  26:     const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
  27:     const size_t fragment_length = (left < avail) ? left : avail;
  29:     RecordType type;
  30:     const bool end = (left == fragment_length);
  31:     if (begin && end) {
  32:       type = kFullType;
  33:     } else if (begin) {
  34:       type = kFirstType;
  35:     } else if (end) {
  36:       type = kLastType;
  37:     } else {
  38:       type = kMiddleType;
  39:     }
  41:     s = EmitPhysicalRecord(type, ptr, fragment_length);
  42:     ptr += fragment_length;
  43:     left -= fragment_length;
  44:     begin = false;
  45:   } while (s.ok() && left > 0);
  46:   return s;
  47: }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.

Reviewing LevelDB: Part V, into the MemTables we go

You can read about the theory of Sorted Strings Tables and Memtables here. In this case, what I am interested in is going a bit deeper into the leveldb codebase, and understanding how the data is actually kept in memory and what is it doing there.

In order to do that, we are going to investigate MemTable. As it turned out, this is actually a very simple data structure. A MemTable just hold a SkipList, whish is a sorted data structure that allows O(log N) access and modifications. The interesting thing about Skip List in contrast to Binary Trees, is that it is much easier to create a performant solution of concurrent skip list (either with or without locks) over a concurrently binary tree.

The data in the table is just a list of key & value (or delete marker). And that means that searches through this can give you three results:

  • Here is the value for the key (exists)
  • The value for the key was remove (deleted)
  • The value is not in the memory table (missing)

It is the last part where we get involved with the more interesting aspect of LevelDB (and the reason it is called leveldb in the first place). The notion that you have multiple levels. The mem table is the first one, and then you spill the output out to disk (the Sorted Strings Table). Now that I figure out how simple MemTable is really is, I am going to take a look at the leveldb log, and then dive into Sorted Strings Table.

Reviewing LevelDB: Part IV: On std::string, buffers and memory management in C++

his is a bit of a side track. One of the things that is quite clear to me when I am reading the leveldb code is that I was never really any good at C++. I was a C/C++ developer. And that is a pretty derogatory term. C & C++ share a lot of the same syntax and underlying assumption, but the moment you want to start writing non trivial stuff, they are quite different. And no, I am not talking about OO or templates.

I am talking about things that came out of that. In particular, throughout the leveldb codebase, they are very rarely, if at all, allocate memory directly. Pretty much the whole codebase rely on std::string to handle buffer allocations and management. This make sense, since RAII is still the watch ward for good C++ code. Being able to utilize std::string for memory management also means that the memory will be properly released without having to deal with it explicitly.

More interestingly, the leveldb codebase is also using std::string as a general buffer. I wonder why it is std::string vs. std::vector<char>,  which would bet more reasonable, but I guess that this is because most of the time, users will want to pass strings as keys, and likely this is easier to manage, given the type of operations available on std::string (such as append).

It is actually quite fun to go over the codebase and discover those sort of things. Especially if I can figure them out on my own Smile.

This is quite interesting because from my point of view, buffers are a whole different set of problems. We don’t have to worry about the memory just going away in .NET (although we do have to worry about someone changing the buffer behind our backs), but we have to worry a lot about buffer size. This is because at some point (80Kb), buffers graduate to the large object heap, and stay there. Which means, in turn, that every time that you want to deal with buffers you have to take that into account, usually with a buffer pool.

Another aspect that is interesting with regards to memory usage is the explicit handling of copying. There are various places in the code where the copy constructor was made private, to avoid this. Or a comment is left about making a type copy-able intentionally. I get the reason why, because it is a common failing point in C++, but I forgot (although I am pretty sure that I used to know) the actual semantics of when/ how you want to do that in all cases.

Reviewing LevelDB: Part III, WriteBatch isn’t what you think it is

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;
   2: batch.Delete(key1);
   3: batch.Put(key2, value);
   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch
   2: {
   3:    List<Operation> Operations;
   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=
   2: //    sequence: fixed64
   3: //    count: fixed32
   4: //    data: record[count]
   5: // record :=
   6: //    kTypeValue varstring varstring         |
   7: //    kTypeDeletion varstring
   8: // varstring :=
   9: //    len: varint32
  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.
   2: class Handler {
   3:  public:
   4:   virtual ~Handler();
   5:   virtual void Put(const Slice& key, const Slice& value) = 0;
   6:   virtual void Delete(const Slice& key) = 0;
   7: };
   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {
   2:   Slice input(rep_);
   3:   if (input.size() < kHeader) {
   4:     return Status::Corruption("malformed WriteBatch (too small)");
   5:   }
   7:   input.remove_prefix(kHeader);
   8:   Slice key, value;
   9:   int found = 0;
  10:   while (!input.empty()) {
  11:     found++;
  12:     char tag = input[0];
  13:     input.remove_prefix(1);
  14:     switch (tag) {
  15:       case kTypeValue:
  16:         if (GetLengthPrefixedSlice(&input, &key) &&
  17:             GetLengthPrefixedSlice(&input, &value)) {
  18:           handler->Put(key, value);
  19:         } else {
  20:           return Status::Corruption("bad WriteBatch Put");
  21:         }
  22:         break;
  23:       case kTypeDeletion:
  24:         if (GetLengthPrefixedSlice(&input, &key)) {
  25:           handler->Delete(key);
  26:         } else {
  27:           return Status::Corruption("bad WriteBatch Delete");
  28:         }
  29:         break;
  30:       default:
  31:         return Status::Corruption("unknown WriteBatch tag");
  32:     }
  33:   }
  34:   if (found != WriteBatchInternal::Count(this)) {
  35:     return Status::Corruption("WriteBatch has wrong count");
  36:   } else {
  37:     return Status::OK();
  38:   }
  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {
   2:  public:
   3:   SequenceNumber sequence_;
   4:   MemTable* mem_;
   6:   virtual void Put(const Slice& key, const Slice& value) {
   7:     mem_->Add(sequence_, kTypeValue, key, value);
   8:     sequence_++;
   9:   }
  10:   virtual void Delete(const Slice& key) {
  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());
  12:     sequence_++;
  13:   }
  14: };
  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,
  17:                                       MemTable* memtable) {
  18:   MemTableInserter inserter;
  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);
  20:   inserter.mem_ = memtable;
  21:   return b->Iterate(&inserter);
  22: }

As I can figure it so far, we have the following steps:

  • WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
  • We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

Reviewing LevelDB: Part II, Put some data on the disk, dude

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

   1: leveldb::DB* db;
   2: leveldb::DB::Open(options, "play/testdb", &db);
   3: status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

   1: // Default implementations of convenience methods that subclasses of DB
   2: // can call if they wish
   3: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
   4:   WriteBatch batch;
   5:   batch.Put(key, value);
   6:   return Write(opt, &batch);
   7: }
   9: Status DB::Delete(const WriteOptions& opt, const Slice& key) {
  10:   WriteBatch batch;
  11:   batch.Delete(key);
  12:   return Write(opt, &batch);
  13: }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

   1: Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
   2:   Writer w(&mutex_);
   3:   w.batch = my_batch;
   4:   w.sync = options.sync;
   5:   w.done = false;
   7:   MutexLock l(&mutex_);
   8:   writers_.push_back(&w);
   9:   while (!w.done && &w != writers_.front()) {
  10:     w.cv.Wait();
  11:   }
  12:   if (w.done) {
  13:     return w.status;
  14:   }
  16:   // May temporarily unlock and wait.
  17:   Status status = MakeRoomForWrite(my_batch == NULL);
  18:   uint64_t last_sequence = versions_->LastSequence();
  19:   Writer* last_writer = &w;
  20:   if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions
  21:     WriteBatch* updates = BuildBatchGroup(&last_writer);
  22:     WriteBatchInternal::SetSequence(updates, last_sequence + 1);
  23:     last_sequence += WriteBatchInternal::Count(updates);
  25:     // Add to log and apply to memtable.  We can release the lock
  26:     // during this phase since &w is currently responsible for logging
  27:     // and protects against concurrent loggers and concurrent writes
  28:     // into mem_.
  29:     {
  30:       mutex_.Unlock();
  31:       status = log_->AddRecord(WriteBatchInternal::Contents(updates));
  32:       if (status.ok() && options.sync) {
  33:         status = logfile_->Sync();
  34:       }
  35:       if (status.ok()) {
  36:         status = WriteBatchInternal::InsertInto(updates, mem_);
  37:       }
  38:       mutex_.Lock();
  39:     }
  40:     if (updates == tmp_batch_) tmp_batch_->Clear();
  42:     versions_->SetLastSequence(last_sequence);
  43:   }
  45:   while (true) {
  46:     Writer* ready = writers_.front();
  47:     writers_.pop_front();
  48:     if (ready != &w) {
  49:       ready->status = status;
  50:       ready->done = true;
  51:       ready->cv.Signal();
  52:     }
  53:     if (ready == last_writer) break;
  54:   }
  56:   // Notify new head of write queue
  57:   if (!writers_.empty()) {
  58:     writers_.front()->cv.Signal();
  59:   }
  61:   return status;
  62: }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

   1: struct DBImpl::Writer {
   2:   Status status;
   3:   WriteBatch* batch;
   4:   bool sync;
   5:   bool done;
   6:   port::CondVar cv;
   8:   explicit Writer(port::Mutex* mu) : cv(mu) { }
   9: };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…

Reviewing LevelDB, Part I: What is this all about?

LevelDB is…

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

That is the project’s own definition. Basically, it is a way for users to store data in an efficient manner. It isn’t a SQL database. It isn’t even a real database in any sense of the word. What it is is a building block for building databases. It handles writing and reading to disk, and it supports atomicity. But anything else is on you (from transaction management to more complex items).

As such, it appears perfect for the kind of things that we need to do. I decided that I wanted to get to know the codebase, especially since at this time, I can’t even get it to compile Sad smile. The fact that this is a C++ codebase, written by people who eat & breath C++ for a living is another reason why. I expect that this would be a good codebase, so I might as well sharpen my C++-foo at the same time that I grok what this is doing.

The first thing to do is to look at the interface that the database provides us with:


That is a very small surface area, and as you can imagine, this is something that I highly approve of. It make it much easier to understand and reason about. And there is some pretty complex behavior behind this, which I’ll be exploring soon.

LevelDB & Windows: It ain’t a love story

I have been investigating the LevelDB project for the purpose of adding another storage engine to RavenDB. The good news is that there is a very strong likelihood that we can actually use that as a basis for what we want.

The bad news is that it is insanely easy to get LevelDB to compile and work on Linux, and appears to be an insurmountable barrier to do the same on Windows.

Yes, I know that I can get it working by just using a precompiled binary, but that won’t work. I actually want to make some changes there (mostly in the C API, right now).

This instructions appears to be no longer current. And this thread was promising, but didn’t lead anywhere.

I am going to go over the codebase with a fine tooth comb, but I am no longer a C++ programmer, and the intricacies of the build system is putting a very high roadblock of frustration.


Published at

Originally posted at

Comments (48)

Rob’s RavenDB Sprint

Rob Ashton is a great developer.   We invited him to Hibernating Rhinos as part of his Big World Tour.  I had the chance to work with him in the past on RavenDB, and I really liked working with him, and I liked the output even better. So we prepared some stuff for him to do.

This is the status of those issues midway through the second day.


And yes, I am giddy.

Open Source Application Review: BitShuva Radio

As part of my ongoing reviews efforts, I am going to review the BitShuva Radio application.

BitShuva Radio is a framework for building internet radio stations with intelligent social features like community rank, thumb-up/down songs, community song requests, and machine learning that responds to the user's likes and dislikes and plays more of the good stuff.

I just cloned the repository and opened it in VS, without reading anything beyond the first line. As usual, I am going to start from the top and move on down:


We already have some really good indications:

  • There is just one project, not a gazillion of them.
  • The folders seems to be pretty much the standard ASP.NET MVC ones, so that should be easy to work with.

Some bad indications:

  • Data & Common folders are likely to be troublesome spots.

Hit Ctrl+F5, and I got this screen, which is a really good indication. There wasn’t a lot of setup required.


Okay, enough with the UI, I can’t really tell if this is good or bad anyway. Let us dive into the code. App_Start, here I come.


I get the feeling that WebAPI and Ninject are used here. I looked in the NinjectWebCommon file, and found:


Okay, I am biased, I’ll admit, but this is good.

Other than the RavenDB stuff, it is pretty boring, standard and normal codebase. No comments so far. Let us see what is this RavenStore all about, which leads us to the Data directory:


So it looks like we have the RavenStore and a couple of indexes. And the code itself:

   1: public class RavenStore
   2: {
   3:     public IDocumentStore CreateDocumentStore()
   4:     {
   5:         var hasRavenConnectionString = ConfigurationManager.ConnectionStrings["RavenDB"] != null;
   6:         var store = default(IDocumentStore);            
   7:         if (hasRavenConnectionString)
   8:         {
   9:             store = new DocumentStore { ConnectionStringName = "RavenDB" };
  10:         }
  11:         else
  12:         {
  13:             store = new EmbeddableDocumentStore { DataDirectory = "~/App_Data/Raven" };
  14:         }
  16:         store.Initialize();
  17:         IndexCreation.CreateIndexes(typeof(RavenStore).Assembly, store);
  18:         return store;
  19:     }
  20: }

I think that this code need to be improved, to start with, there is no need for this to be an instance. And there is no reason why you can’t use EmbeddableDocumentStore to use remote stuff.

I would probably write it like this, but yes, this is stretching things:

   1: public static class RavenStore
   2: {
   3:     public static IDocumentStore CreateDocumentStore()
   4:     {
   5:         var store = new EmbeddableDocumentStore
   6:             {
   7:                 DataDirectory = "~/App_Data/Raven"
   8:             };
  10:         if (ConfigurationManager.ConnectionStrings["RavenDB"] != null)
  11:         {
  12:             store.ConnectionStringName = "RavenDB";
  13:         }
  14:         store.Initialize();
  15:         IndexCreation.CreateIndexes(typeof(RavenStore).Assembly, store);
  16:         return store;
  17:     }
  18: }

I intended to just glance at the indexes, but this one caught my eye:


This index effectively gives you random output. It will group by the count of documents, and since we reduce things multiple times, the output is going to be… strange.

I am not really sure what this is meant to do, but it is strange and probably not what the author intended.

The Common directory contains nothing of interest beyond some util stuff. Moving on to the Controllers part of the application:


So this is a relatively small application, but an interesting one. We will start with what I expect o be a very simple part of the code .The HomeController:

   1: public class HomeController : Controller
   2: {
   3:     public ActionResult Index()
   4:     {
   5:         var userCookie = HttpContext.Request.Cookies["userId"];
   6:         if (userCookie == null)
   7:         {
   8:             var raven = Get.A<IDocumentStore>();
   9:             using (var session = raven.OpenSession())
  10:             {
  11:                 var user = new User();
  12:                 session.Store(user);
  13:                 session.SaveChanges();
  15:                 HttpContext.Response.SetCookie(new HttpCookie("userId", user.Id));
  16:             }
  17:         }
  19:         // If we don't have any songs, redirect to admin.
  20:         using (var session = Get.A<IDocumentStore>().OpenSession())
  21:         {
  22:             if (!session.Query<Song>().Any())
  23:             {
  24:                 return Redirect("/admin");
  25:             }
  26:         }
  28:         ViewBag.Title = "BitShuva Radio";
  29:         return View();
  30:     }
  31: }

There are a number of things in here that I don’t like. First of all, let us look at the user creation part. You look at the cookies and create a user if it isn’t there, setting the cookie afterward.

This has the smell of something that you want to do in the infrastructure. I did  a search for “userId” in the code and found the following in the SongsController:

   1: private User GetOrCreateUser(IDocumentSession session)
   2: {
   3:     var userCookie = HttpContext.Current.Request.Cookies["userId"];
   4:     var user = userCookie != null ? session.Load<User>(userCookie.Value) : CreateNewUser(session);
   5:     if (user == null)
   6:     {
   7:         user = CreateNewUser(session);
   8:     }
  10:     return user;
  11: }
  13: private static User CreateNewUser(IDocumentSession session)
  14: {
  15:     var user = new User();
  16:     session.Store(user);
  18:     HttpContext.Current.Response.SetCookie(new HttpCookie("userId", user.Id));
  19:     return user;
  20: }

That is code duplication with slightly different semantics, yeah!

Another issue with the HomeController.Index method is that we have direct IoC calls (Get.As<T>) and multiple sessions per request. I would much rather do this in the infrastructure, which would also give us a place for the GetOrCreateUser method to hang from.

SongsController is actually an Api Controller, so I assume that it is called from JS on the page. Most of the code there looks like this:

   1: public Song GetSongForSongRequest(string songId)
   2: {
   3:     using (var session = raven.OpenSession())
   4:     {
   5:         var user = GetOrCreateUser(session);
   6:         var songRequest = new SongRequest
   7:         {
   8:             DateTime = DateTime.UtcNow,
   9:             SongId = songId,
  10:             UserId = user.Id
  11:         };
  12:         session.Store(songRequest);
  13:         session.SaveChanges();
  14:     }
  16:     return GetSongById(songId);
  17: }

GetSongById will use its own session, and I think it would be better to have just one session per request, but that is about the sum of my comments.

One thing that did bug me was the song search:

   1: public IEnumerable<Song> GetSongMatches(string searchText)
   2: {
   3:     using (var session = raven.OpenSession())
   4:     {
   5:         return session
   6:             .Query<Song>()
   7:             .Where(s =>
   8:                 s.Name.StartsWith(searchText) ||
   9:                 s.Artist.StartsWith(searchText) ||
  10:                 s.Album.StartsWith(searchText))
  11:             .Take(50)
  12:             .AsEnumerable()
  13:             .Select(s => s.ToDto());
  14:     }
  15: }

RavenDB has a really good full text support. And we could be using that, instead. It would give you better results and be easier to work with, to boot.

Overall, this is a pretty neat little app.

This code ain’t production ready

Greg Young has a comment on my Rhino Events post that deserves to be read in full. Go ahead, read it, I’ll wait.

Since you didn’t, I’ll summarize. Greg points out numerous faults and issues that aren’t handled or could be handled better in the code.

That is excellent, from my point of view, if only because it gives me more stuff to think about for the next time.

But the most important thing to note here is that Greg is absolutely correct about something:

I have always said an event store is a fun project because you can go anywhere from an afternoon to years on an implementation.

Rhino Events is a fun project, and I’ve learned some stuff there that I’ll likely use again letter on. But above everything else, this is not production worthy code .It is just some fun code that I liked. You may take and do whatever you like with it, but mostly I was concerned with finding the right ways to actually get things done, and not considering all of the issues that might arise in a real production environment.

Introducing Rhino.Events

After talking so often about how much I consider OSS work to be indicative of passion, I got bummed when I realized that I didn’t actually did any OSS work for a while, if you exclude RavenDB.

I was recently at lunch at a client, when something he said triggered a bunch of ideas in my head. I am afraid that I made for poor lunch conversation, because all I could see in my head was code and IO blocks moving around in interesting ways.

At any rate, I sat down at the end of the day and wrote a few spikes, then I decided to write the actual thing in a way that would actually be useful.

What is Rhino Events?

It is a small .NET library that gives you embeddable event store. Also, it is freakishly fast.

How fast is that?


Well, this code writes a fairly non trivial events 10,000,000 (that is ten million times) to disk.

It does this at a rate of about 60,000 events per second. And that include the full life cycle (serializing the data, flushing to disk, etc).

Rhino.Events has the following external API:


As you can see, we have several ways of writing events to disk, always associating to a stream, or just writing the latest snapshot.

Note that the write methods actually return a Task. You can ignore that Task, if you wish, but this is part of how Rhino Events gets to be so fast.

When you call EnqueueEventAsync, we register the value in a queue and have a background process write all of the pending events to disk. This means that we actually have only one thread that is actually doing writes, which means that we can batch all of those writes to get really nice performance from being able to handle all of that.

We can also reduce on the number of times that we have to actually Flush to disk (fsync), so we only do that when we run out of things to write or at a predefined times (usually after a full 200 ms of non stop writes. Only after the information was fully flushed to disk will we set the task status to completed.

This is actually a really interesting approach from my point of view, and it makes the entire thing transactional, in the sense that you can wait to be sure that the event has been persisted to disk (and yes, Rhino Events is fully ACID) or you can fire & forget it, and move on with your life.

A few words before I let you go off and play with the bits.

This is a Rhino project, which means that it is a fully OSS one. You can take the code and do pretty much whatever you want with it. But I , or Hibernating Rhinos, will not be providing support for that.

You can get the bits here: https://github.com/ayende/Rhino.Events


Published at

Originally posted at

Comments (7)

Thou shall not do threading unless you know what you are doing

I had a really bad couple of days. I am pissed, annoyed and angry, for totally not technical reasons.

And then I run into this issue, and I just want to throw something really hard at someone, repeatedly.

The issue started from this bug report:

   1: NetTopologySuite.Geometries.TopologyException was unhandled
   2:   HResult=-2146232832
   3:   Message= ... trimmed ...
   4:   Source=NetTopologySuite
   5:   StackTrace:
   6:        at NetTopologySuite.Operation.Overlay.Snap.SnapIfNeededOverlayOp.GetResultGeometry(SpatialFunction opCode)
   7:        at NetTopologySuite.Operation.Union.CascadedPolygonUnion.UnionActual(IGeometry g0, IGeometry g1)
   8:        at NetTopologySuite.Operation.Union.CascadedPolygonUnion.Worker.Execute()
   9:        at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
  10:        at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
  11:        at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
  12:        at System.Threading.ThreadHelper.ThreadStart()

At first, I didn’t really realized why it was my problem. I mean, it is NTS problem, isn’t it?

Except that this particular issue actually crashed ravendb (don’t worry, it is unstable builds only). The reason it crashed RavenDB? An unhandled thread exception.

What I can’t figure out is what on earth is going on. So I took a look at the code, have a look:


I checked, and this isn’t code that has been ported from the Java code. You can see the commented code there? That is from the Java version.

And let us look at what the execute method does?


So let me see if I understand. We have a list of stuff to do, so we spin out threads, reclusively, then we wait on them. I think that the point was to optimize things in some manner by parallelizing the work between the two halves.

Do you know what the real killer is? If we assume that we have a geometry with just 20 items on it, this will generate twenty two threads.

Leaving aside the issue of not handling errors properly (and killing the entire process because of this), the sheer cost of creating the threads is going to kill this program.

Libraries should be made to be thread safe (I already had to fix a thread safety bug there),  but they should not be creating their own threads unless it is quite clear that they need to do so.

I believe that this is a case of a local optimization for a specific scenario, it also carry all of the issues associated with local optimizations. It solves one problem and opens up seven other ones.

Lucene is beautiful

So, after I finished telling you how much I don’t like the lucene.net codebase, what is this post about?

Well, I don’t like the code, but then again, I generally don’t like to read low level code. The ideas behind Lucene are actually quite amazingly powerful in their simplicity.

At its core, Lucene is just a set of sorted dictionaries on disk (greatly simplified, I know). Everything else is build on top of that, and if you grok what is going on there, you would be quite amazed at the number of things that this has made possible.

Indexing in Lucene is done by a pipeline of documents and fields and analyzers, which all participate together to generate those dictionaries. Searching in lucene is done by traversing those dictionaries in various ways, and combining the results in interesting ways.

I am not going to go into details about how it works, you can read all about that here. The important thing is that once you have grasped the essential structure inside lucene, the rest are just details.

The concept and the way the implementation fell out are quite beautiful.


Published at

Originally posted at

Comments (4)