Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,125 | Comments: 45,490

filter by tags archive

Reviewing LevelDBPart VI, the Log is base for Atomicity

time to read 23 min | 4442 words

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

   1: // A file abstraction for sequential writing.  The implementation
   2: // must provide buffering since callers may append small fragments
   3: // at a time to the file.
   4: class WritableFile {
   5:  public:
   6:   WritableFile() { }
   7:   virtual ~WritableFile();
   9:   virtual Status Append(const Slice& data) = 0;
  10:   virtual Status Close() = 0;
  11:   virtual Status Flush() = 0;
  12:   virtual Status Sync() = 0;
  14:  private:
  15:   // No copying allowed
  16:   WritableFile(const WritableFile&);
  17:   void operator=(const WritableFile&);
  18: };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

   1: Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

   1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));
   2: if (status.ok() && options.sync) {
   3:  status = logfile_->Sync();
   4: }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

   1: Status Writer::AddRecord(const Slice& slice) {
   2:   const char* ptr = slice.data();
   3:   size_t left = slice.size();
   5:   // Fragment the record if necessary and emit it.  Note that if slice
   6:   // is empty, we still want to iterate once to emit a single
   7:   // zero-length record
   8:   Status s;
   9:   bool begin = true;
  10:   do {
  11:     const int leftover = kBlockSize - block_offset_;
  12:     assert(leftover >= 0);
  13:     if (leftover < kHeaderSize) {
  14:       // Switch to a new block
  15:       if (leftover > 0) {
  16:         // Fill the trailer (literal below relies on kHeaderSize being 7)
  17:         assert(kHeaderSize == 7);
  18:         dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));
  19:       }
  20:       block_offset_ = 0;
  21:     }
  23:     // Invariant: we never leave < kHeaderSize bytes in a block.
  24:     assert(kBlockSize - block_offset_ - kHeaderSize >= 0);
  26:     const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
  27:     const size_t fragment_length = (left < avail) ? left : avail;
  29:     RecordType type;
  30:     const bool end = (left == fragment_length);
  31:     if (begin && end) {
  32:       type = kFullType;
  33:     } else if (begin) {
  34:       type = kFirstType;
  35:     } else if (end) {
  36:       type = kLastType;
  37:     } else {
  38:       type = kMiddleType;
  39:     }
  41:     s = EmitPhysicalRecord(type, ptr, fragment_length);
  42:     ptr += fragment_length;
  43:     left -= fragment_length;
  44:     begin = false;
  45:   } while (s.ok() && left > 0);
  46:   return s;
  47: }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.

More posts in "Reviewing LevelDB" series:

  1. (26 Apr 2013) Part XVIII–Summary
  2. (15 Apr 2013) Part XVII– Filters? What filters? Oh, those filters…
  3. (12 Apr 2013) Part XV–MemTables gets compacted too
  4. (11 Apr 2013) Part XVI–Recovery ain’t so tough?
  5. (10 Apr 2013) Part XIV– there is the mem table and then there is the immutable memtable
  6. (09 Apr 2013) Part XIII–Smile, and here is your snapshot
  7. (08 Apr 2013) Part XII–Reading an SST
  8. (05 Apr 2013) Part XI–Reading from Sort String Tables via the TableCache
  9. (04 Apr 2013) Part X–table building is all fun and games until…
  10. (03 Apr 2013) Part IX- Compaction is the new black
  11. (02 Apr 2013) Part VIII–What are the levels all about?
  12. (29 Mar 2013) Part VII–The version is where the levels are
  13. (28 Mar 2013) Part VI, the Log is base for Atomicity
  14. (27 Mar 2013) Part V, into the MemTables we go
  15. (26 Mar 2013) Part IV
  16. (22 Mar 2013) Part III, WriteBatch isn’t what you think it is
  17. (21 Mar 2013) Part II, Put some data on the disk, dude
  18. (20 Mar 2013) Part I, What is this all about?


Matt Warren

In case anyone else is interested, some of the reasons behind the log format are explained here http://leveldb.googlecode.com/git/doc/log_format.txt:

Some benefits over the recordio format:

(1) We do not need any heuristics for resyncing - just go to next block boundary and scan. If there is a corruption, skip to the next block. As a side-benefit, we do not get confused when part of the contents of one log file are embedded as a record inside another log file.

(2) Splitting at approximate boundaries (e.g., for mapreduce) is simple: find the next block boundary and skip records until we hit a FULL or FIRST record.

(3) We do not need extra buffering for large records.

Comment preview

Comments have been closed on this topic.


  1. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - one day from now
  2. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 4 days from now
  3. The design of RavenDB 4.0: Voron has a one track mind - 5 days from now
  4. RavenDB 3.5 whirl wind tour: Digging deep into the internals - 6 days from now
  5. The design of RavenDB 4.0: Separation of indexes and documents - 7 days from now

And 11 more posts are pending...

There are posts all the way to May 30, 2016


  1. The design of RavenDB 4.0 (14):
    05 May 2016 - Physically segregating collections
  2. RavenDB 3.5 whirl wind tour (14):
    04 May 2016 - I’ll find who is taking my I/O bandwidth and they SHALL pay
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats