Ayende @ Rahien

It's a girl

Raven.Storage just passed its first “test”

Take a look at the code below. This actually completed as expected, and was working beautifully. As I probably mentioned, the architecture of this is really nice, and I think I was able to translate this into .NET code in a way that is both idiomatic and useful. 4:30 AM now, and I think that this is bed time for me now. But I just couldn’t leave this alone.



Published at

Originally posted at

Comments (2)

RavenDB Course Updates

I am updating the RavenDB Course for 2.5. It is now a 3 days affair that include quite a lot of stuff. It was quite cramp at 2 days, so moving to 3 days will allow us to do a lot more and do more interesting stuff.

I’ll be giving the first 3 days RavenDB Course in about two weeks in London. Currently this is what I have:


Can you think of additional stuff that you’ll like me to cover?


Published at

Originally posted at

Raven’s Storage: Building a Sorted String Table

Applying the lessons from the leveldb codebase isn’t going to be a straightforward task. I already  discussed why we can’t just use leveldb directly, so I won’t go into that.

I decided to see if I can implement a leveldb inspired storage for RavenDB. And the first thing that I wanted to try is to build an SST. That seems like an easy first step. SST are just sorted data in a file, and it require nothing much to write to them. As mentioned, we can’t just do a line by line port of the code. As easy as that would be. The way leveldb manages memory is… not really suitable for what we can / should do in .NET.

  • I want this to be an efficient .NET implementation.
  • It is inspired, but not intended to be compatible with leveldb.

The leveldb API deals with byte arrays all over the place. This makes sense for a C++ codebase, but it is horrible for a .NET application, especially when we expect a lot of our values to be large enough to hit the LOH. That means fragmentation, and pain down the road. Instead, our API uses ArraySegment<byte> for keys, and Stream for values. Since keys are expected to be relatively small, I don’t foresee this being a problem. And the values are streams, so they are easily handled without introducing any cost from the API.

Another thing that leveldb does quite a lot is batch things in memory for a while. It may be the current block, it may be the current data block, it may be the index block, but it does so quite often. That works nicely for C++ apps with expected small values, but not so much for our expected use case. So I want to avoid as much as possible holding items in managed memory. Here is the API for creating an SST:

   1: var options = new StorageOptions();
   2: using (var file = File.Create("test.sst"))
   3: using(var temp = new FileStream(Path.GetTempFileName(),FileMode.CreateNew,FileAccess.ReadWrite, 
   4:                     FileShare.None, 4096, FileOptions.DeleteOnClose | FileOptions.SequentialScan))
   5: {
   6:     var tblBuilder = new TableBuilder(options, file, temp);
   8:     for (int i = 0; i < 100; i++)
   9:     {
  10:         var key = "tests/" + i.ToString("0000");
  11:         tblBuilder.Add(key, new MemoryStream(Encoding.UTF8.GetBytes(key)));
  12:     }
  14:     tblBuilder.Finish();
  15: }

As you can see, we uses two streams here, one to actually write to the table, the second is a temporary stream that we use to write the index block while we are working, then we merged it back to the actual sst file. Note that after building the table, the temp file can be deleted (indeed, we marked is as delete on close, so that would automatically happen).

That part was easy, all it required was simple I/O for generating the file. The more interesting part is going to be reading the values out.


Published at

Originally posted at

Comments (1)

Reviewing LevelDB: Part XVIII–Summary

Well, I am very happy at the conclusion of this blog post series. Beside being one of the longest that I have done, this actually stretched my ability to count using roman numerals.

In summary, I am quite happy that I spent the time reading all of this code. The LevelDB codebase is really simple, when you grok what it actually does. There is nothing there that would totally stun a person. What there is there, however, is a lot of accumulated experience in building those sort of things.

You see this all over the place, in the format of the SST, in the way compaction is working, in the ability to add filters, write merging, etc. The leveldb codebase is a really good codebase to read, and I am very happy to have done so. Especially since doing this in C++ is way out of m comfort zone. It was also interesting to read what I believe is idiomatic C++ code.

Another observation about leveldb is that it is a hard core C++ product. You can’t really take that and use the same approach in a managed environment. In particular, efforts to port leveldb to java (https://github.com/dain/leveldb/) are going to run into hard issues with problems like managing the memory. Java, like .NET, has issues with allocating large byte arrays, and even from the brief look I took, working with leveldb on java using the codebase as it is would likely put a lot of pressure there.

Initially, I wanted to use leveldb as a storage engine for RavenDB. Since I couldn’t get it compiling & working on Windows (and yes, that is a hard requirement. And yes, it has to be compiling on my machine to apply), I thought about just porting it. That isn’t going to be possible. At least not in the trivial sense. Too much work is require to make it work properly.

Yes, I have an idea, no, you don’t get to hear it at this time Smile.

RavenDB’s transaction merging

I like reading interesting code bases. And in this case, I have been reading through the leveldb codebase for a while. Most of the time, I am concentrating on grokking how other people are doing things. And the knowledge that I glean from those is only relevant a while later.

In this case, I was able to apply leveldb’s knowledge in very short order. We got a complaint about the speed of RavenDB under high transactional writes. The customer complained that they were seeing about 100 – 200 writes/sec with multi threaded inserts, with a separate session per document. This simulate pretty well the behavior of a web application that does a write per request.

The problem was that we basically had many write requests all competing on the disk. Since all transactions need to call fsync, that meant that we were limited by the number of fsync that we could call on the physical medium (more or less, it is a bit more complex than that, but that works).

There really isn’t much that I can do about it when the transactions are sequential. But there might be when we have parallel transactions. Instead of make them wait for one another, I took a leaf from the leveldb codebase and decided to merge them. I re-wrote the code so we would use the following approach:


while (currentTask.Completed() == false && pendingBatches.Peek() != pending)
if (currentTask.Completed())
    return pending.Result;

lock (putSerialLock)

        var sp = Stopwatch.StartNew();
        TransactionalStorage.Batch(actions =>
            int totalCommands = 0;
            PendingBatch currentBatch;
            while (totalCommands < 1024 &&  // let us no overload the transaction buffer size too much
                    pendingBatches.TryDequeue(out currentBatch))
                if (ProcessBatch(currentBatch) == false)
                    // we only go on committing transactions as long as we are successful, if there is an error, 
                    // we abort and do the rest in another cycle
                totalCommands += currentBatch.Commands.Count;

As you can see, this code is very similar to the leveldb one. We queue the transaction to be executed and then we check if we are the first in the queue. If so, we will execute that transaction and continue executing all available transactions.

The key here is that because we merge those transactions, we can benefit from only calling fsync once, at the end of the global operation.

This code is nice because it allows us to take advantage on load on our system. The more, the more efficient we can batch things. But at the same time, if there isn’t any load, we don’t care.

Note that I limited the amount of work that can be done in merged transactions, because we don’t want the first transaction, the one that is actually doing the work for all of the others, to wait for too long. This is a really pretty way of doing this, especially since it doesn’t even require a background thread, which is how I usually solved this issue.

Oh, and the results?

On my machine, without this change, we get about 400 writes / second. Afterward, with 25 threads, we get over 1,100 writes / second.


Published at

Originally posted at

Comments (4)

Searching for a lease in time & space

For some reason, there are a lot of real estate / rental people using RavenDB. I can tell you that I did not see that coming. However, that does bring us some interesting decisions.

In particular, at one client, we had the need to search for a lease. Searching for a lease can be done on one of many interesting properties. For example, the unit number, or internal code, or by the leaser name.

And here we got an interesting bug report.

Jane Smith leased an apartment from us at Dec 2012. At Feb 2013, she got married and changed her name to Jane Smith-Smyth. We need to allow searches on both names to find the appropriate lease.

Now, remember, you can’t go and change the lease document. That is a legal document that is frozen. Any change to that would invalidate it. (To be rather more accurate, you can modify the document, but there are certain fields that are frozen after the lease is signed.)

Luckily, this was really easy to do, using RavenDB’s referenced document feature:

   1: from lease in docs.Leases
   2: select new
   3: {
   4:    Leasers = lease.Leasers.Select(x=>x.Name)
   5:               .Union(lease.Leasers.Select(x=>LoadDocument(x.Id).Name))
   6:               .Distinct()
   7: }

And now we can support changes in the names, while maintaining the sanctity of the frozen fields.

Sadly, this is still not enough. And we actually need to keep track of all of the names that the leaser had during the leasing period.

Jane Smith-Smyth decided that it is a stupid name and changed her name to Jane Smite.

Now we need to support multiple names per leaser, while at the same time we have the current name for the lease. It looks like this:

   1: from lease in docs.Leases
   2: select new 
   3: {
   4:     Leasers = lease.Leasers.Select(x=>x.Name)
   5:     .Union(lease.Leasers.SelectMany(x=>LoadDocument(x.Id).Names))
   6:     .Distinct()  
   7: }

I highlighted the required changes Smile.


Published at

Originally posted at

Comments (2)

Exporting Data from RavenDB, the new way

In RavenDB 2.5, we provide an easy way to grab all the data from the database, regardless of the usual paging limit.

I had the chance to experiment with that recently during a course. Here is the code we wrote. For fun, we made it use the async API:


I am pretty happy about that. We have stuff that streams all the way from the ravendb to the end client.


Published at

Originally posted at

Comments (2)

Querying your way to madness: the Facebook timeline

Facebook certainly changed the way we are doing things. Sometimes, that ain’t always for the best, as can be seen from the way too much time humanity as a whole spend watching cat videos.

One of the ways that Facebook impacted our professional lives is that a lot of people look at that as a blue print of how they want their application to be. I am not going to touch on whatever that is a good thing or not, suffice to say that this is a well known model that is very easy for a lot of users to grasp.

It is also a pretty hard model to actually design and build. I recently had a call from a friend who was tasked with building a Facebook like timeline. Like most such tasks, we have the notion of social network, with people following other people. I assume that this isn’t actually YASNP (yet another social network project), but I didn’t check too deeply.

The question was how to actually build the timeline. The first thing that most people would try is something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItems = 
   3:    session.Query<Items>()
   4:       .Where(x=>x.Source.In(user.Following))
   5:       .OrderByDescending(x=>x.PublishedAt)
   6:       .Take(30)
   7:       .ToList();

Now, this looks good, and it would work, as long as you have small number of users and no one follows a lot of people. And as long as you don’t have  a lot of items. And as long as you don’t have to do any additional work.  When any of those assumption is broken… well, welcome to unpleasantville, population: you.

It can’t work. And I don’t care what technology you are using for storage. You can’t create a good solution using queries for something like the timeline.

Nitpicker corner:

  • If you have users that are following a LOT of people (and you will have those), you are likely to get into problems with the query.
  • The more items you have, the slower this query becomes. Since you need to sort them all before you can return results. And you are likely to have a LOT of them.
  • You can’t really shard this query nicely or easily.
  • You can’t apply additional filtering in any meaningful way.

Let us consider the following scenario. Let us assume that I care for this Rohit person. But I really don’t care for Farmville.

hide farmville ribbon

And then:

hide farmville

Now, try to imagine doing this sort of thing in a query. For fun, assume that I do care for Farmville updates from some people, but not from all. That is what I mean when I said that you want to do meaningful filtering.

You can’t do this using queries. Not in any real way.

Instead, you have to turn it around. You would do something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItmes = session.Query<TimeLineItems>() 
   3:       .Where(x=>x.ForUser == userId)
   4:       .OrderBy(x=>x.Date)
   5:       .ToList();

Note how we structure this. There is a set of TimeLineItems objects, which store a bit of information about a set of items. Usually we would have one per user per day. Something like:

  • users/123/timeline/2013-03-12
  • users/123/timeline/2013-03-13
  • users/123/timeline/2013-03-14

That means that we get well scoped values, we only need to search on a single set of items (easily sharded, with a well known id, which means that we can also just load them by id, instead of querying for them).

Of course, that means that you have to have something that builds those timeline documents. That is usually an async process that run whenever you have a user that update something. It goes something like this:

   1: public void UpdateFollowers(string itemId)
   2: {
   3:     var item = session.Include<Item>(x=>x.UserId)
   4:         .Load(itemId);
   6:     var user = session.Load<User>(item.UserId);
   8:     // user.Followers list of documents with batches of followers
   9:     // we assume that we might have a LOT, so we use this techinque
  10:     // to avoid loading them all into memory at once
  11:     // http://ayende.com/blog/96257/document-based-modeling-auctions-bids
  12:     foreach(var followersDocId in user.Followers)
  13:     {
  14:         NotifyFollowers(followersDocId, item);
  15:     }
  16: }
  18: public void NotifyFollowers(string followersDocId, Item item)
  19: {
  20:     var followers = session.Include<FollowersCollection>(x=>x.Followers)
  21:         .Load(followersDocId);
  23:     foreach(var follower in followers.Followers)
  24:     {
  25:         var user = session.Load<User>(follower);
  26:         if(user.IsMatch(item) == false)
  27:             continue;
  28:         AddToTimeLine(follower, item);
  29:     }
  30: }

As you can see, we are batching the operation, loading the followers and batched on their settings, decide whatever to let that item to be added to their timeline or not.

Note that this has a lot of implications. Different people will see this show up in their timeline in different times (but usually very close to one another). Your memory usage is kept low, because you are only processing some of it at any given point in time. For users with a LOT of followers, and there will be some like those, you might want to build special code paths, but this should be fine even at its current stage.

What about post factum operations? Let us say that I want to start following a new user? This require special treatment, you would have to read the latest timeline items from the new user to follow and start merging that with the existing timeline. Likewise when you need to delete someone. Or adding a new filter.

It is a lot more work than just changing the query, sure. But you can get things done this way. And you cannot get anywhere with the query only approach.

RavenDB Indexing: An off the cuff stat

I am currently teaching a RavenDB Course, and we were just talking about indexing. In particular, Search Indexes, like the one below:


After we defined this guy, we took a look at the stats.

As you can see, indexing 1 million documents took just over 2 minutes (full text support enabled). More interesting, you can see how we rapidly increased the number of items that we indexed to finish indexing everything faster.


Quite nice.


Published at

Originally posted at

Comments (6)

What is making us slow (for the first time, after an idle period)?

We recently covered this question in several iterations in the ravendb mailing list.

The actual content of the discussion wasn’t so interesting as the number of ways idle time can make you life… interesting. In order to avoid having issues with idle time, you need to:

  • Disable IIS unloading for inactive websites.
  • Disable RavenDB  unloading for inactive databases.
  • Make sure that the HD doesn’t spin down during inactivity.
  • You need to make sure that the system doesn’t got to idle / hibenration.
  • Check that the server hasn’t been paged.
  • Check that the CPU hasn’t moved to low power mode.
  • Check authentication timeouts.

In the end, it was actually the last one that caused the problem. By default, Windows Auth token expire after 15 minutes, so you have to re-authenticate again, and that may make the first query after a while a little slower.

Just for fun, by default, all of the above happen. And that is just when running on a physical machine. When running on VMs (or in the cloud), you need to do all of those checks for the VM and the host machines.

RavenDB Books Update

Yes, you read the title correctly. Books as in plural. Currently I am aware of at least 4 different  books about RavenDB that are in various working stages.

Here is what happens when I reviewed one of them Smile.


And no, I don’t think that I am going to tell you more at this time.


Published at

Originally posted at

Comments (5)

Reviewing LevelDB: Part XVII– Filters? What filters? Oh, those filters…

I was all set to finish this series, when I realized that I missed something important, I didn’t covered, and indeed, I don’t currently understand, what filters are, and how are they used.

As usual, once I actually got to the part of the code where this is actually handled (instead of ignoring it), it made a lot more sense:


And that, following with the rest of the code that I read makes a lot more sense.

A SST is a sorted table, essentially. We have the keys and the blocks, etc. And we can find a value very quickly. But what if you could do this even more cheaply?

Using a bloom filter is a good way to never get false negative, and it will reduce the amount of work we have to do if we can’t find the key in the SST drastically (only need to read the filter value, don’t even need to do any additional checks). Quite elegant, even if I say so myself.

There is one thing to pay attention to and that is that you can define your own comparator, in which case you must also define you own filter policy. If you use a comparator that ignore casing, you also need to provider a filter that ignore casing.

Reviewing LevelDB: Part XV–MemTables gets compacted too

As mentioned before, every 4 MB or so, we will move the current memtable and switch to a new one, turning the current one into an immutable memtable, which will then be compacted to disk. This post is about that process. The work there is done by the surprisingly name: DBImpl::CompactMemTable, which then calls to WriteLevel0Table.

That method build the actual table, using the same approach I already covered. Honestly, the code is pretty obvious now that I spent enough time reading this. The only thing of interest is DeleteObsoleteFiles() method, which looks interesting if only because it is one of the few places leveldb actually does real work with files. Mostly, however, ParseFileName is interesting:


It give me a good way to look at the actual files being used. This will be important when we will investigate how leveldb handles recovery.

Leveldb will consider files obsolete if:

  • They are log files and aren't of the current/previous log number.
  • If it is a manifest file, keep the current one (or later if there are any).
  • If it is a table file, keep it if it is in use. A table file may have been compacted away.
  • If it is a lock file, info log or the current db file.

All of this is interesting, but on a minor level. I think that we are nearing the end of the codebase now. The only thing that I really want to go into is the recovery parts, and that would pretty much be it.

Published at

Originally posted at

Reviewing LevelDB: Part XVI–Recovery ain’t so tough?

This is how leveldb starts up a new db:


As you can imagine, this is quite useful to find out, since it means that everything we do on startup is recover. I do wonder why we have to take a lock immediately, though. I don't imagine that anything else can try to make use of the db yet, it hasn't been published to the outside world.

Recover is interesting (and yes, I know I wrote that a lot lately).

  • It starts by taking a lock on the db (File System lock, by holding to LOCK file).
  • If the db does not exists, we call NewDB(), which will create a default MANIFEST-00001 file and a CURRENT file pointing at that.
  • If the db exists, we call VersionSet::Recover(), this starts by reading the CURRENT file, which points to the appropriate MANIFEST file.
  • This then gives us the latest status of the db versions, in particular, it tells us what files belong to what levels and what they ranges are.
  • We check that the comparators we use is identical to the one used when creating the db.
  • The code make it looks like there might be multiple version records in the MANIFEST file, but I don't recall that. It might be just the way the API is structured, though. I just checked with the part that write it, and yes, it should have just one entry.
  • Once we have our versions, what next? We check that all of the expected files are actually there, because otherwise we might have a corrupted install.
  • The final thing we do is look for log files that are later than the latest we have in the MANIFEST. I am not sure if this indicates a recover from a crash or a way to be compatible with an older version. I guess that one way this can happen is if you crashed midway while committing a transaction.

When recovering, we are forcing checksum checks, to make sure that we don't get corrupt data (which might very well be the case, since the data can just stop at any point here. The relevant code here is leveldb::log:Reader, which takes care of reading potentially corrupt log file and reporting on its finding. I already went over how the log file is structured, the log reader just does the same thing in reverse, with a lot of paranoia. While reading from the log file, we build a memtable with the committed & safe changes. Here, too, we need to handle with memtable sizes, so we might generate multiple level 0 files during this process.

And... that is pretty much it.

I'll have another post summarizing this, and maybe another on what I intend to do with this information, but I'll keep that close to the chest for now.

Reviewing LevelDB: Part XIV– there is the mem table and then there is the immutable memtable

In leveldb we have the memtable, into which all of the current writes are going and the immutable memtable. So far, I haven't really checked what it actually means.

It looks like this happens on the write, by default, a mem table is limited to about 4 MB of space. We write to the memtable (backed by the tx log) until we get to the point where we go beyond the 4MB limit (note that again, if you have large values, you might go much higher than 4MB and then we switch to another memtable, change the existing memtable to be the immutable one, and move on.

Something that might trip you is that if you write 2 big values one after the other, in separate batches, the second one might need to wait for the compaction to complete.

Here is the test code:

   std::string big(1024 * 1024 * 5, 'a');
     for(int i=0; i < 10; i++ ){
         std::clock_t start = std::clock();
         std::ostringstream o;
         o << "test" << i;
         std::string key =  o.str();

         db->Put(writeOptions, key, big);
         std::clock_t end = std::clock();
         std::cout << i << ": " << 1000.0 *(end - start) / CLOCKS_PER_SEC << "ms " << std::endl;

And the output is:

0: 20ms

1: 30ms

2: 50ms

3: 50ms

4: 50ms

5: 50ms

6: 40ms

7: 60ms

8: 50ms

9: 50ms

Not really that interesting, I'll admit. But when I tried it with 50 mb for each value, it felt like the entire machine grind to a halt. Considering the amount of times memory is copied around, and that it needs to be saved to at least two locations (log file & sst), that makes a lot of sense.

For references ,those were my timing, but I am not sure that I trust them.

0: 170ms

1: 310ms

2: 350ms

3: 460ms

4: 340ms

5: 340ms

6: 280ms

7: 530ms

8: 400ms

9: 1200ms

Published at

Originally posted at

Reviewing LevelDB: Part XIII–Smile, and here is your snapshot

Okay, now that I know how data actually gets to the disk and from it, it is time to read how leveldb handles snapshots. Snapshots seems to be very simple on the surface. On every write, we generate a sequence number. We store the number locally and use the oldest still living snapshot as the oldest version that we have to retain when doing compaction work.

I had hard time figuring out how it worked out with regards to using this in memory. Consider the following code:

     leveldb::WriteOptions writeOptions;
     writeOptions.sync = true;
     db->Put(writeOptions, "test", "one");
     const leveldb::Snapshot* snapshot = db->GetSnapshot();
     db->Put(writeOptions, "test", "two");
    leveldb::ReadOptions readOptions;
    readOptions.snapshot = snapshot;
    std::string val;
    status = db->Get(readOptions, "test", &val);

This will properly give val == “one” when we execute it.

As it turned out, I missed something when I read the code earlier for MemTables. The value that is actually stored as the key is [key][tag]. And the tag is the sequence number + write type. And because of the way it is sorted (little endian, always), it means that higher values are sorted first. And what that means in turn is that unless you specify a specific snapshot number (which is what this tag contains, most of the time), you are going to get the latest version. But if you specify a snapshot number, you’ll get the value that was there as of that snapshot.

And that, in turn, means that we can write code like this:


Where key.memtable_key() contains the required snapshot value. So we can just skip all the ones larger than what we want.

That is really cool, but what about when we go to disk? Pretty much in the same way. The actual key value include the sequence & tag value. But the comparator knows to filter it out when needed. This is quite nice, and an elegant way to handle this situation.

Reviewing RavenDB: Part XII–Reading an SST

After figuring out how the TableCache works, I now want to dig a bit deeper into what the Tables are. That means that we are starting out in Table::Open. And this starts with:


So we start by reading the footer. Then we read the index block:


Construct a new table, and we are pretty much set. I went back to look at the TableBuilder, and I think that I am getting things better. There is the BlockHandle, which is basically just the offset / size in the file.

Then we have the actual index, which is located at the end of the file. This one has the format of:

key, offset, size

key, offset, size

key, offset, size

The only problem is that I don't see yet where this is actually used. And here it is, inside Table::InternalGet.


So we seek into the block data. And that matches pretty closely what I would expect here. And then we have this:



Some notes about this code, it looks like we are going to be loading the entire block into memory. But what happens if we have a very large value? For that matter, what happens if we have a very large value next to a very small value on the same block, but we wanted the small value?

I think that I am wrong here. Looking at the code ,I found this comment, which I previously missed:


And that explains it really nicely with regards to how blocks works in general. The cool part happens inside Block::Iter::Seek, we first do a binary search for the prefixes that we have inside the block, and then a linear search inside the restart section. By default, there are 16 items per restart, so that is going to generate a really fast turnaround in general.

One key point I think that bear repeating is with regards to my previous comment about sizes. We don't actually ever copy the entire block to memory, instead, we rely on the OS to do so for us, because the files are actually memory mapped. That means that we can easily access them with very little cost, even if we have mixed big & small values on the same block. That is because we can just skip the large value and not touch it,and rely on the OS to page the right pages for us. That said, when reading a block, not just iterating it, we are going to allocate it in memory:


So I am not sure about that. I think that I was mistaken previously. The only problem is that the buf is actually used for scratch purposes.  I think that this is right on both ends, looking at the code, we have PosixRandomAccessFile:


This is the standard approach, actually. Read it from the disk into the buffer, and go on with your life.  But we also have PosixMMapReadableFile implementation:


And this is a lot more reasonable and in line with what I was thinking. We don't use the scratch at all.

However, we still need to allocate this scratch buffer on every single write. Looking at the code, the decision on what to allocate seems to be done here:


As you can see, it is mmap_limit_ that will make that determination. That is basically limiting us to no memory maps on 32 bits (make sense, the 2 GB virtual address space is really small) and to a max of 1,000 mapped files for 64 bits. Given that I am assuming that you are going to run this on 64 bits a lot more often than on 32 bits (at least for server apps), it would make more sense...

Stopping right here. Leveldb is being used in the browser as well, and presumably on mobile devices, etc. That means that you can't really assume/require 64 bits. And given that most of the time, you are going to have blocks of up to 4 KB in size (except if you have very large keys), I think that this is reasonable. I would probably have done away with allocating the buffer in the happy case, but that is beside the point, probably, since most of the time I assume that the value are small enough.

I am looking at this through the eyes of someone who deals with larger values all the time, so it triggers a lot of introspection for me. And this is how we actually read a value from disk, I actually managed to also figure out how we write, and in what format. All together good day's (actually, it was mostly at night) fun.

Reviewing LevelDB: Part XI–Reading from Sort String Tables via the TableCache

In the previous post, I focused mostly on reading the code for writing a SST to disk. But I have to admit that I am not really following how you read them back in a way that would be easy to read.

In order to understand that, I think that the right place in the code would be the TableCache. The API for that is pretty slick, here are just the header (comments were stripped).

class TableCache {
  TableCache(const std::string& dbname, const Options* options, int entries);

  Iterator* NewIterator(const ReadOptions& options,
                        uint64_t file_number,
                        uint64_t file_size,
                        Table** tableptr = NULL);

  Status Get(const ReadOptions& options,
             uint64_t file_number,
             uint64_t file_size,
             const Slice& k,
             void* arg,
             void (*handle_result)(void*, const Slice&, const Slice&));

  void Evict(uint64_t file_number);


  Status FindTable(uint64_t file_number, uint64_t file_size, Cache::Handle**);

There are a couple of interesting things here that show up right away. How do you know what is the number of entries. I guess that this is stored externally, but I am not sure where. I am going to figure that question out first.

And the answer is strange:


It isn't number of entries in the table, it is the number of files? That actually say something very important, since this means that the table cache is reading multiple SST files, rather than just one per cache. Looking at the rest of the API, it makes even more sense. We need to pass the file number that we are going to be reading. That one is going to be got from the Version we are using.

Side note: I just spend an hour or so setting up a tryout project for leveldb, so I can actually execute the code and see what it does. This had me learning cmake (I am using KDevelop to read the code) and I finally got it. Still haven't figure out how to step into leveldb's code. But that is a good step.

Also, the debug experience is about 2/10 compared to VS.

And damn, I am so not a C++ developer any longer. Then again, never was a real dev on linux, anyway.

The first thing the TableCache does is to setup a cache. This is interesting, and I followed the code, it create 16 internal caches and has between them. I think that this is done to get concurrency because all the internal cache methods looks like this:


Note the mutex lock. That is pretty much how it works for all of the cache work. Looking deeper into the code, I can see another reason why I should be grateful for staying away from C++. leveldb comes with its own hash table functionality. At any rate, the cache it using LRU to evict items, and it get the number of entries from the value that was passed in the ctor. That make me think that it hold the opened files.

Speaking of the cache, here is an example of the code using it:


The cache is also used to do block caching, this is why it takes a slice as an argument. I'm going to go over that later, because this looks quite interesting. The rest of this method looks like this:


So the only very interesting bit is the Table::Open. The rest is just opening the mmap file and storing it in the cache. Note that the actual file size is passed externally. I am not really sure what this means yet. I'll go over the table code later, right now I want to focus on the table cache.

Looking over the TableCache interface, we can see that we always get the file size from outside. That got me curious enough to figure out why. And the answer is that we always have it in the FileMetaData that we previously explored. I am not sure why that is so important, but I'll ignore it for now.

The rest of TableCache is pretty simple, although this made me grin:


More specifically, look at the RegisterCleanup, this is basically registering for the delete event, so they can unref the cache. Pretty nice, all around.

The rest of the code is just forwarding calls to the Table, so that is what I'll be reading next...

Reviewing LevelDB: Part X–table building is all fun and games until…

It seems like I have been trying to get to the piece that actually write to disk for a very long while.


The API from the header file looks to be pretty nice. I would assume that you would call it something like this:

TableBuilder builder(options,writableFile);
builder.Add(“one”, “bar”);
builder.Add(“two”, “foo”);

There are some things that looks funky. Like Flush(), which has a comment that I don't follow: Can be used to ensure that two adjacent entries never live in the same data block.

Oh well, maybe I need to actually read the code first?

And immediately we have this:


I think that this explain why you have Flush(), this forces different blocks. Most of the time, you let leveldb handle that, but you can also control that yourself if you really know what you are doing.

There is something called FilterBlockPolicy, but I don't get what it is for, ignoring for now.

Calling Add() does some really nice things with regards to the implementation. Let us say that we are storing keys with similar postfixes. That is very common.

When considering leveldb for RavenDB, we decided that we would store the data in this manner:

/users/ayende/data = {}

/users/ayende/metadata = {}

Now, when we want to add them, we add the first one and then the second one, but we have a chance to optimize things.

We can arrange things so we would output:




15, 9, 3



Note that for the second key, we can reuse the first 15 characters of the previous entry. And we just saved some bytes on the disk.

Now, blocks are held in memory, and by default they are flushed every 4 KB or so.

Side note: looking at the code, it is really scary just how many times data is copied from one memory location to another in the leveldb codebase. Client code to WriteBatch, WriteBatch to MemTable, MemTable to TableBuilder, etc.

Okay, so now I am pretty sure that I am following what is going on when writing to disk. Still not sure how you search in an SST. The data is sorted, but that looks like it would require multiple seeks .There is metadata & index blocks in there, but I am not quite sure how they would work. Need to read the reader side of things to really get things. Well, I'll do that next time...

RavenDB Course Snippet

We got several requests for this, so here you go. A small snippet (10 minutes out of more than 12 hours!) of the RavenDB Course that is available for purchase.




Published at

Originally posted at

Comments (14)

Reviewing RaveDB: Part IX- Compaction is the new black

After going over the VersionSet, I understand how leveldb decide when to compact and what it decide to compact. More or less, at least.

This means that I now mostly can figure out what this does:

Status DBImpl::BackgroundCompaction()

A user may force manual compaction of a range, or we may have reasons of our own to decide to compact, based on leveldb heuristics. Either way, we get the Compaction object, which tells us what files we need to merge.

There is a check there whatever we can do a trivial compaction, that is, just move the file from the current level to level+1. The interesting thing is that we avoid doing that if this is going to cause issues in level+2 (require more expensive compaction later on).

But the interesting work is done in DoCompactionWork, where we actually do compaction of complex data.

The code refers to snapshots for the first time that I see. We only merge values that are in a valid snapshot. So data doesn't “go away” for users. While holding a snapshot active.

The actual work starts in here:


This give us the ability to iterate over all of the entries in all of the files that need compaction.

And then we go over it:


But note that on each iteration we do a check if we need to CompactMemTable(); I followed the code there, and we finally write stuff to disk! I am quite excited about this, but I'll handle that in a future blog post. I want to see how actual compaction works.

We then have this:


This is there to stop a compaction that would force a very expensive compaction next time, I think.

As a side note, this really drive me crazy:


Note that we have current_output() and FileSize() in here. I don't really care what naming convention you use, but I would really rather that you had one. If there is one for the leveldb code base, I can't say that I figured it out. It seems like mostly it is PascalCase, but every now & then we have this_style methods.

Back to the code, it took me a while to figure it out.


Will return values in sorted key order, that means that if you have the same key in multiple levels, we need to ignore the older values. After this is happening, we now have this:


This is where we are actually writing stuff out to the SST file! This is quite exciting :-). I have been trying to figure that out for a while now.

The rest of the code in the function is mostly stuff related to stats book keeping, but this looks important:


This generate the actual VersionEdit, which will remove all of the files that were compacted and add the new file that was created as a new Version to the VersionSet.

Good work so far, even if I say so myself. We actually go to where we are building the SST files. Now it is time to look at the code that build those table. Next post, Table Builder...

Reviewing LevelDB: Part VIII–What are the levels all about?

When I last left things, we were still digging into the code for Version and figure out that this is a way to handle the list of files, and that it is the table cache that is actually doing IO. For this post, I want to see where continuing to read the Version code will take us.

There are some interesting things going on in here. We use the version to do a lot more. For example, because it holds all of the files, we can use it to check if they are overlaps in key spaces, which leveldb can use to optimize things later on.

I am not sure that I am following everything that is going on there, but I start to get the shape of things.

As I understand things right now. Data is stored in the following levels:

  • MemTable – where all the current updates are going
  • Immutable MemTable – which is frozen, and probably in the process of being written out
  • Level 0 – probably just a dump of the mem table to disk, key ranges may overlap, so you may need to read multiple files to know where a particular value is.
  • Level 1 to 6 – files where keys cannot overlap, so you probably only need one seek to find things. But seeks are still charged if you need to go through multiple levels to find a value.

That make sense, and it explains things like charging seeks, etc.

Now, I am currently going over VersionSet::Builder, which appears to be applying something called VersionEdit efficiently. I don't know what any of those things are yet...

VersionEdit appears to be a way to hold (and maybe store on disk?) data about pending changes to a version. This is currently not really useful because I still don't know how that is used. This means that I need to go deeper into the VersionSet code now.

VersionSet::LogAndApply looks like it is an interesting candidate for behavior.

I also found why we need VersionEdit to be persistable, we write it to the manifest file. Probably as a way to figure out what is going on when we recover. LogAndApply write the VersionEdit to a manifest file, then set the current file to point to that manifest file.

I am currently reading the Recover function, and it seems to be applying things in reverse. It reads the CURRENT file which points to the current manifest, which contains the serialized VersionEdit contents. Those are read in turn, after which we apply the edit to the in memory state.

Toward the end of VersionSet.cc file, we start seeing things regards compaction. Like:

Iterator* VersionSet::MakeInputIterator(Compaction* c)

Compaction* VersionSet::PickCompaction()

I sort of follow and sort of doesn't follow what the code is doing there. It selects the appropriate set of files that need to be compacted. I don't really get the actual logic, I'll admit, but hopefully that will become better when I read the rest of the code.

Compaction appears to work on level and level+1 files. I assume that it is going to read all of those files, merge them into level+1 and then delete (or mark for deletion) the existing files.

This is now close to midnight, and my eyes started building and the code started to do gymnastics on the screen, so I'll call it a post for now.

Published at

Originally posted at