Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q c

Posts: 6,370 | Comments: 47,287

filter by tags archive

Reviewing LevelDBPart IX- Compaction is the new black

time to read 4 min | 726 words

After going over the VersionSet, I understand how leveldb decide when to compact and what it decide to compact. More or less, at least.

This means that I now mostly can figure out what this does:

Status DBImpl::BackgroundCompaction()

A user may force manual compaction of a range, or we may have reasons of our own to decide to compact, based on leveldb heuristics. Either way, we get the Compaction object, which tells us what files we need to merge.

There is a check there whatever we can do a trivial compaction, that is, just move the file from the current level to level+1. The interesting thing is that we avoid doing that if this is going to cause issues in level+2 (require more expensive compaction later on).

But the interesting work is done in DoCompactionWork, where we actually do compaction of complex data.

The code refers to snapshots for the first time that I see. We only merge values that are in a valid snapshot. So data doesn't “go away” for users. While holding a snapshot active.

The actual work starts in here:


This give us the ability to iterate over all of the entries in all of the files that need compaction.

And then we go over it:


But note that on each iteration we do a check if we need to CompactMemTable(); I followed the code there, and we finally write stuff to disk! I am quite excited about this, but I'll handle that in a future blog post. I want to see how actual compaction works.

We then have this:


This is there to stop a compaction that would force a very expensive compaction next time, I think.

As a side note, this really drive me crazy:


Note that we have current_output() and FileSize() in here. I don't really care what naming convention you use, but I would really rather that you had one. If there is one for the leveldb code base, I can't say that I figured it out. It seems like mostly it is PascalCase, but every now & then we have this_style methods.

Back to the code, it took me a while to figure it out.


Will return values in sorted key order, that means that if you have the same key in multiple levels, we need to ignore the older values. After this is happening, we now have this:


This is where we are actually writing stuff out to the SST file! This is quite exciting :-). I have been trying to figure that out for a while now.

The rest of the code in the function is mostly stuff related to stats book keeping, but this looks important:


This generate the actual VersionEdit, which will remove all of the files that were compacted and add the new file that was created as a new Version to the VersionSet.

Good work so far, even if I say so myself. We actually go to where we are building the SST files. Now it is time to look at the code that build those table. Next post, Table Builder...

Reviewing LevelDBPart VIII–What are the levels all about?

time to read 3 min | 567 words

When I last left things, we were still digging into the code for Version and figure out that this is a way to handle the list of files, and that it is the table cache that is actually doing IO. For this post, I want to see where continuing to read the Version code will take us.

There are some interesting things going on in here. We use the version to do a lot more. For example, because it holds all of the files, we can use it to check if they are overlaps in key spaces, which leveldb can use to optimize things later on.

I am not sure that I am following everything that is going on there, but I start to get the shape of things.

As I understand things right now. Data is stored in the following levels:

  • MemTable – where all the current updates are going
  • Immutable MemTable – which is frozen, and probably in the process of being written out
  • Level 0 – probably just a dump of the mem table to disk, key ranges may overlap, so you may need to read multiple files to know where a particular value is.
  • Level 1 to 6 – files where keys cannot overlap, so you probably only need one seek to find things. But seeks are still charged if you need to go through multiple levels to find a value.

That make sense, and it explains things like charging seeks, etc.

Now, I am currently going over VersionSet::Builder, which appears to be applying something called VersionEdit efficiently. I don't know what any of those things are yet...

VersionEdit appears to be a way to hold (and maybe store on disk?) data about pending changes to a version. This is currently not really useful because I still don't know how that is used. This means that I need to go deeper into the VersionSet code now.

VersionSet::LogAndApply looks like it is an interesting candidate for behavior.

I also found why we need VersionEdit to be persistable, we write it to the manifest file. Probably as a way to figure out what is going on when we recover. LogAndApply write the VersionEdit to a manifest file, then set the current file to point to that manifest file.

I am currently reading the Recover function, and it seems to be applying things in reverse. It reads the CURRENT file which points to the current manifest, which contains the serialized VersionEdit contents. Those are read in turn, after which we apply the edit to the in memory state.

Toward the end of VersionSet.cc file, we start seeing things regards compaction. Like:

Iterator* VersionSet::MakeInputIterator(Compaction* c)

Compaction* VersionSet::PickCompaction()

I sort of follow and sort of doesn't follow what the code is doing there. It selects the appropriate set of files that need to be compacted. I don't really get the actual logic, I'll admit, but hopefully that will become better when I read the rest of the code.

Compaction appears to work on level and level+1 files. I assume that it is going to read all of those files, merge them into level+1 and then delete (or mark for deletion) the existing files.

This is now close to midnight, and my eyes started building and the code started to do gymnastics on the screen, so I'll call it a post for now.

Reviewing LevelDBPart VII–The version is where the levels are

time to read 7 min | 1272 words

Okay, so far I have written 6 parts, and the only thing that happened is that we wrote some stuff to the log file. That is cool, but I am assuming that there has got to be more. I started tracking the code, and I think that what happens is that we have compactions of the MemTable, at which point we flush it to disk.

I think that what happens is this, we have a method called MaybeScheduleCompaction, in db_impl.cc, which is kicking of the actual process for the MemTable compaction. This is getting called from a few places, but most importantly, it is called during the Write() call. Reading the code, it seems that before we can go to the actual compaction work, we need to look at something called a VersionSet. This looks like it holds all of the versions of the database at a particular point in time. Including all the files that it is using, etc.

A lot of what it (and its associate, the Version class) is about managing lists of this structure:


I am not sure what allowed_seeks mean, I assume it is there to force compaction for the next level.

Okay, moving on to version, it looks this is where all the actual lookups are done. We have a list of file metadata, including smallest & largest keys in each file. That allows us to find the appropriate files to look at quite easily. There seems to be some interaction between Version and TableCache, but I’m not going into that now.

A version is holding an array of 7 levels, and at each level we have the associated files. I am going to continue digging into Version & VersionSet for the moment.

Side Note: In fact, I got frustrated enough with trying to figure out leveldb on Windows that I setup a Ubunto machine with KDevelop just to get things done. This blog post is actually written on the Ubunto machine (later to be copy into live writer :-)).

I am still in the process of going through the code. It is a really much easier to do this in an IDE that can actually build & understand the code.

Once thing that I can tell you right now is that C++ programmers are strange. I mean, take a look at the following code, from Version::LevelFileNumIterator :


This returns a byte array containing encoded file num & size in a buffer. Would it be so hard to create a struct for that or use std::pair ? Seems like this would complicate the client code. Then again, maybe there is a perf reason that I am not seeing?

Then again, here is the client code:


And that seems pretty clear.

So far, it appears as if the Version is the current state of all of the files in a particular point in time. I think that this is how leveldb implements snapshots. The files are SSTables, which are pretty much write once only. A version belong to a set (not sure exactly what that means yet) and is part of a linked list. Again, I am not sure what is the purpose of that yet.

I'll need to do a deeper dive into snapshots in general, later on, because it is interesting to see how that is implemented with regards to the memtable.

Moving back to the actual code, we have this code:


This seems to me to indicate that the table_cache is the part of the code that is actually manages the SSTables, probably using some variant of the page pool.

Now, let us get to the good parts, Version::Get:


This looks like this is actually doing something useful. In fact, it find the relevant files to look for that particular key, once it did that, it calls:


So the data is actually retrieved from the cache, as expected. But there was an interesting comment there about “charging” seeks for files, so I am going to be looking at who is calling Version::Get right now, then come back to the cache in another post.

What is interesting is that we have this guy:


And that in turn all make sense now. allowed_seeks is something that is set when we apply a VersionEdit, it seems. No idea what this is now, but there is a comment there that explains that we use this as a way to trigger compaction when it is cheaper to do do compaction than continue doing those seeks. Interestingly enough, seeks are only counted if we have to go through more than one file to find a value, which makes sense, I guess.

Okay, now let us back up a bit and see who is calling Version::Get. And as it turned out, it is our dear friend, DBImpl::Get().

There, we first look in the current memtable, then in the immutable memtable (which is probably on its way to become a SSTable now. And then we are looking at the current Version, calling Version::Get. If we actually hit the version, we also call Version::UpdateStats, and if we need to, we then call MaybeScheduleCompaction(), which is where we started this post.

And... that is it for this post, we still have managed to find where we actually save to disk (they hid it really deep), but I think I'll probably be able to figure this out in this sitting, watch out for the next post.

Reviewing LevelDBPart VI, the Log is base for Atomicity

time to read 23 min | 4442 words

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

   1: // A file abstraction for sequential writing.  The implementation
   2: // must provide buffering since callers may append small fragments
   3: // at a time to the file.
   4: class WritableFile {
   5:  public:
   6:   WritableFile() { }
   7:   virtual ~WritableFile();
   9:   virtual Status Append(const Slice& data) = 0;
  10:   virtual Status Close() = 0;
  11:   virtual Status Flush() = 0;
  12:   virtual Status Sync() = 0;
  14:  private:
  15:   // No copying allowed
  16:   WritableFile(const WritableFile&);
  17:   void operator=(const WritableFile&);
  18: };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

   1: Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

   1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));
   2: if (status.ok() && options.sync) {
   3:  status = logfile_->Sync();
   4: }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

   1: Status Writer::AddRecord(const Slice& slice) {
   2:   const char* ptr = slice.data();
   3:   size_t left = slice.size();
   5:   // Fragment the record if necessary and emit it.  Note that if slice
   6:   // is empty, we still want to iterate once to emit a single
   7:   // zero-length record
   8:   Status s;
   9:   bool begin = true;
  10:   do {
  11:     const int leftover = kBlockSize - block_offset_;
  12:     assert(leftover >= 0);
  13:     if (leftover < kHeaderSize) {
  14:       // Switch to a new block
  15:       if (leftover > 0) {
  16:         // Fill the trailer (literal below relies on kHeaderSize being 7)
  17:         assert(kHeaderSize == 7);
  18:         dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));
  19:       }
  20:       block_offset_ = 0;
  21:     }
  23:     // Invariant: we never leave < kHeaderSize bytes in a block.
  24:     assert(kBlockSize - block_offset_ - kHeaderSize >= 0);
  26:     const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
  27:     const size_t fragment_length = (left < avail) ? left : avail;
  29:     RecordType type;
  30:     const bool end = (left == fragment_length);
  31:     if (begin && end) {
  32:       type = kFullType;
  33:     } else if (begin) {
  34:       type = kFirstType;
  35:     } else if (end) {
  36:       type = kLastType;
  37:     } else {
  38:       type = kMiddleType;
  39:     }
  41:     s = EmitPhysicalRecord(type, ptr, fragment_length);
  42:     ptr += fragment_length;
  43:     left -= fragment_length;
  44:     begin = false;
  45:   } while (s.ok() && left > 0);
  46:   return s;
  47: }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.

Reviewing LevelDBPart V, into the MemTables we go

time to read 2 min | 263 words

You can read about the theory of Sorted Strings Tables and Memtables here. In this case, what I am interested in is going a bit deeper into the leveldb codebase, and understanding how the data is actually kept in memory and what is it doing there.

In order to do that, we are going to investigate MemTable. As it turned out, this is actually a very simple data structure. A MemTable just hold a SkipList, whish is a sorted data structure that allows O(log N) access and modifications. The interesting thing about Skip List in contrast to Binary Trees, is that it is much easier to create a performant solution of concurrent skip list (either with or without locks) over a concurrently binary tree.

The data in the table is just a list of key & value (or delete marker). And that means that searches through this can give you three results:

  • Here is the value for the key (exists)
  • The value for the key was remove (deleted)
  • The value is not in the memory table (missing)

It is the last part where we get involved with the more interesting aspect of LevelDB (and the reason it is called leveldb in the first place). The notion that you have multiple levels. The mem table is the first one, and then you spill the output out to disk (the Sorted Strings Table). Now that I figure out how simple MemTable is really is, I am going to take a look at the leveldb log, and then dive into Sorted Strings Table.

Reviewing LevelDBPart IV

time to read 3 min | 449 words

his is a bit of a side track. One of the things that is quite clear to me when I am reading the leveldb code is that I was never really any good at C++. I was a C/C++ developer. And that is a pretty derogatory term. C & C++ share a lot of the same syntax and underlying assumption, but the moment you want to start writing non trivial stuff, they are quite different. And no, I am not talking about OO or templates.

I am talking about things that came out of that. In particular, throughout the leveldb codebase, they are very rarely, if at all, allocate memory directly. Pretty much the whole codebase rely on std::string to handle buffer allocations and management. This make sense, since RAII is still the watch ward for good C++ code. Being able to utilize std::string for memory management also means that the memory will be properly released without having to deal with it explicitly.

More interestingly, the leveldb codebase is also using std::string as a general buffer. I wonder why it is std::string vs. std::vector<char>,  which would bet more reasonable, but I guess that this is because most of the time, users will want to pass strings as keys, and likely this is easier to manage, given the type of operations available on std::string (such as append).

It is actually quite fun to go over the codebase and discover those sort of things. Especially if I can figure them out on my own Smile.

This is quite interesting because from my point of view, buffers are a whole different set of problems. We don’t have to worry about the memory just going away in .NET (although we do have to worry about someone changing the buffer behind our backs), but we have to worry a lot about buffer size. This is because at some point (80Kb), buffers graduate to the large object heap, and stay there. Which means, in turn, that every time that you want to deal with buffers you have to take that into account, usually with a buffer pool.

Another aspect that is interesting with regards to memory usage is the explicit handling of copying. There are various places in the code where the copy constructor was made private, to avoid this. Or a comment is left about making a type copy-able intentionally. I get the reason why, because it is a common failing point in C++, but I forgot (although I am pretty sure that I used to know) the actual semantics of when/ how you want to do that in all cases.

Reviewing LevelDBPart III, WriteBatch isn’t what you think it is

time to read 26 min | 5198 words

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;
   2: batch.Delete(key1);
   3: batch.Put(key2, value);
   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch
   2: {
   3:    List<Operation> Operations;
   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=
   2: //    sequence: fixed64
   3: //    count: fixed32
   4: //    data: record[count]
   5: // record :=
   6: //    kTypeValue varstring varstring         |
   7: //    kTypeDeletion varstring
   8: // varstring :=
   9: //    len: varint32
  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.
   2: class Handler {
   3:  public:
   4:   virtual ~Handler();
   5:   virtual void Put(const Slice& key, const Slice& value) = 0;
   6:   virtual void Delete(const Slice& key) = 0;
   7: };
   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {
   2:   Slice input(rep_);
   3:   if (input.size() < kHeader) {
   4:     return Status::Corruption("malformed WriteBatch (too small)");
   5:   }
   7:   input.remove_prefix(kHeader);
   8:   Slice key, value;
   9:   int found = 0;
  10:   while (!input.empty()) {
  11:     found++;
  12:     char tag = input[0];
  13:     input.remove_prefix(1);
  14:     switch (tag) {
  15:       case kTypeValue:
  16:         if (GetLengthPrefixedSlice(&input, &key) &&
  17:             GetLengthPrefixedSlice(&input, &value)) {
  18:           handler->Put(key, value);
  19:         } else {
  20:           return Status::Corruption("bad WriteBatch Put");
  21:         }
  22:         break;
  23:       case kTypeDeletion:
  24:         if (GetLengthPrefixedSlice(&input, &key)) {
  25:           handler->Delete(key);
  26:         } else {
  27:           return Status::Corruption("bad WriteBatch Delete");
  28:         }
  29:         break;
  30:       default:
  31:         return Status::Corruption("unknown WriteBatch tag");
  32:     }
  33:   }
  34:   if (found != WriteBatchInternal::Count(this)) {
  35:     return Status::Corruption("WriteBatch has wrong count");
  36:   } else {
  37:     return Status::OK();
  38:   }
  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {
   2:  public:
   3:   SequenceNumber sequence_;
   4:   MemTable* mem_;
   6:   virtual void Put(const Slice& key, const Slice& value) {
   7:     mem_->Add(sequence_, kTypeValue, key, value);
   8:     sequence_++;
   9:   }
  10:   virtual void Delete(const Slice& key) {
  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());
  12:     sequence_++;
  13:   }
  14: };
  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,
  17:                                       MemTable* memtable) {
  18:   MemTableInserter inserter;
  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);
  20:   inserter.mem_ = memtable;
  21:   return b->Iterate(&inserter);
  22: }

As I can figure it so far, we have the following steps:

  • WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
  • We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

Reviewing LevelDBPart II, Put some data on the disk, dude

time to read 27 min | 5257 words

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

   1: leveldb::DB* db;
   2: leveldb::DB::Open(options, "play/testdb", &db);
   3: status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

   1: // Default implementations of convenience methods that subclasses of DB
   2: // can call if they wish
   3: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
   4:   WriteBatch batch;
   5:   batch.Put(key, value);
   6:   return Write(opt, &batch);
   7: }
   9: Status DB::Delete(const WriteOptions& opt, const Slice& key) {
  10:   WriteBatch batch;
  11:   batch.Delete(key);
  12:   return Write(opt, &batch);
  13: }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

   1: Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
   2:   Writer w(&mutex_);
   3:   w.batch = my_batch;
   4:   w.sync = options.sync;
   5:   w.done = false;
   7:   MutexLock l(&mutex_);
   8:   writers_.push_back(&w);
   9:   while (!w.done && &w != writers_.front()) {
  10:     w.cv.Wait();
  11:   }
  12:   if (w.done) {
  13:     return w.status;
  14:   }
  16:   // May temporarily unlock and wait.
  17:   Status status = MakeRoomForWrite(my_batch == NULL);
  18:   uint64_t last_sequence = versions_->LastSequence();
  19:   Writer* last_writer = &w;
  20:   if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions
  21:     WriteBatch* updates = BuildBatchGroup(&last_writer);
  22:     WriteBatchInternal::SetSequence(updates, last_sequence + 1);
  23:     last_sequence += WriteBatchInternal::Count(updates);
  25:     // Add to log and apply to memtable.  We can release the lock
  26:     // during this phase since &w is currently responsible for logging
  27:     // and protects against concurrent loggers and concurrent writes
  28:     // into mem_.
  29:     {
  30:       mutex_.Unlock();
  31:       status = log_->AddRecord(WriteBatchInternal::Contents(updates));
  32:       if (status.ok() && options.sync) {
  33:         status = logfile_->Sync();
  34:       }
  35:       if (status.ok()) {
  36:         status = WriteBatchInternal::InsertInto(updates, mem_);
  37:       }
  38:       mutex_.Lock();
  39:     }
  40:     if (updates == tmp_batch_) tmp_batch_->Clear();
  42:     versions_->SetLastSequence(last_sequence);
  43:   }
  45:   while (true) {
  46:     Writer* ready = writers_.front();
  47:     writers_.pop_front();
  48:     if (ready != &w) {
  49:       ready->status = status;
  50:       ready->done = true;
  51:       ready->cv.Signal();
  52:     }
  53:     if (ready == last_writer) break;
  54:   }
  56:   // Notify new head of write queue
  57:   if (!writers_.empty()) {
  58:     writers_.front()->cv.Signal();
  59:   }
  61:   return status;
  62: }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

   1: struct DBImpl::Writer {
   2:   Status status;
   3:   WriteBatch* batch;
   4:   bool sync;
   5:   bool done;
   6:   port::CondVar cv;
   8:   explicit Writer(port::Mutex* mu) : cv(mu) { }
   9: };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…

Reviewing LevelDBPart I, What is this all about?

time to read 2 min | 285 words

LevelDB is…

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

That is the project’s own definition. Basically, it is a way for users to store data in an efficient manner. It isn’t a SQL database. It isn’t even a real database in any sense of the word. What it is is a building block for building databases. It handles writing and reading to disk, and it supports atomicity. But anything else is on you (from transaction management to more complex items).

As such, it appears perfect for the kind of things that we need to do. I decided that I wanted to get to know the codebase, especially since at this time, I can’t even get it to compile Sad smile. The fact that this is a C++ codebase, written by people who eat & breath C++ for a living is another reason why. I expect that this would be a good codebase, so I might as well sharpen my C++-foo at the same time that I grok what this is doing.

The first thing to do is to look at the interface that the database provides us with:


That is a very small surface area, and as you can imagine, this is something that I highly approve of. It make it much easier to understand and reason about. And there is some pretty complex behavior behind this, which I’ll be exploring soon.

LevelDB & Windows: It ain’t a love story

time to read 1 min | 162 words

I have been investigating the LevelDB project for the purpose of adding another storage engine to RavenDB. The good news is that there is a very strong likelihood that we can actually use that as a basis for what we want.

The bad news is that it is insanely easy to get LevelDB to compile and work on Linux, and appears to be an insurmountable barrier to do the same on Windows.

Yes, I know that I can get it working by just using a precompiled binary, but that won’t work. I actually want to make some changes there (mostly in the C API, right now).

This instructions appears to be no longer current. And this thread was promising, but didn’t lead anywhere.

I am going to go over the codebase with a fine tooth comb, but I am no longer a C++ programmer, and the intricacies of the build system is putting a very high roadblock of frustration.


No future posts left, oh my!


  1. RavenDB 4.0 (4):
    19 May 2017 - Managing encrypted databases
  2. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
  3. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  4. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  5. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats