Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,007 | Comments: 44,761

filter by tags archive

Raven.Storage – that ain’t proper logging for our kind of environment

time to read 9 min | 1650 words

Continuing with my work on porting leveldb to .NET, we run into another problem. The log file. The log file is pretty important, this is how you ensure durability, so any problems there are a big cause of concern.

You can read a bit about the format used by leveldb here, but basically, it uses the following:

   1: block := record* trailer?
   2: record :=
   3:  checksum: uint32    // crc32c of type and data[] ; little-endian
   4:  length: uint16        // little-endian
   5:  type: uint8        // One of FULL, FIRST, MIDDLE, LAST
   6:  data: uint8[length]

Block is of size 32Kb.

The type can be First, Middle, End or Full. Since it is legit to split a record across multiple blocks. The reasoning behind this format are outlined in the link above.

It is also a format that assumes that you know, upfront, the entire size of your record, so you can split it accordingly. That makes a lot of sense, when working in C++ and passing buffers around.


This is straightforward in C++, where the API is basically:

Status Writer::AddRecord(const Slice& slice)

(Slice is basically just a byte array).

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH. So we had to be a bit smarter about things, in particular, we had an interesting issue with streaming the results. If I want to write a document with a size of 100K, how do I handle that?

Instead, I wanted this to look like this:

   1: var buffer = BitConverter.GetBytes(seq);
   2: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   3: buffer = BitConverter.GetBytes(opCount);
   4: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   6: foreach (var operation in writes.SelectMany(writeBatch => writeBatch._operations))
   7: {
   8:     buffer[0] = (byte) operation.Op;
   9:     await state.LogWriter.WriteAsync(buffer, 0, 1);
  10:     await state.LogWriter.Write7BitEncodedIntAsync(operation.Key.Count);
  11:     await state.LogWriter.WriteAsync(operation.Key.Array, operation.Key.Offset, operation.Key.Count);
  12:     if (operation.Op != Operations.Put)
  13:         continue;
  14:     using(var stream = state.MemTable.Read(operation.Handle))
  15:         await stream.CopyToAsync(stream);
  16: }

The problem with this approach is that we don’t know, upfront, what is the size that we are going to have. This means that we don’t know how to split the record, because we don’t have the record until it is over. And we don’t want (can’t actually) to go back in the log and change things to set the record straight (pun intended).

What we ended up doing is this:


Note that we explicitly mark the start / end of the record, and in the meantime, we can push however many bytes we want. Internally, we buffer up to 32Kb in size (a bit less, actually, but good enough for now) and based on the next call, we decide whatever the current block should be marked as good or bad.

The reason this is important is that this allows us to actually keep the same format as leveldb, with all of the benefits for dealing with corrupted data, if we need to. I also really like the idea of being able to have parallel readers on the log file, because we know that we can just skip at block boundaries.



Might be worth doing CPU profiling here because all those awaits for single or very few bytes are likely to burn lots of CPU. You probably can't write out more than 10MB/s with this code. Or is the bulk of the work for big buffers and mem tables?

Ayende Rahien

Tobi, This is actually buffered, so by default, you won't be doing any async work for the most part. The work is done explicitly to handle large buffers, yes.

Greg young

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH.

See buffermanager. Have blogged about it years ago and it's in eventstore source base. It handles this issue. Source is bsd 3 clause

Ayende Rahien

Greg, I am aware of buffer manager, and I am actually using it quite a lot. That doesn't really solve the problem. The way the leveldb codebase behaves, it keeps allocating more & more memory as needed, and that requires a lot of copying around.

Comment preview

Comments have been closed on this topic.


No future posts left, oh my!


  1. Speaking (3):
    23 Sep 2015 - Build Stuff 2015 (Lithuania & Ukraine), Nov 18 - 24
  2. Production postmortem (11):
    22 Sep 2015 - The case of the Unicode Poo
  3. Technical observations from my wife (2):
    15 Sep 2015 - Disk speeds
  4. Find the bug (5):
    11 Sep 2015 - The concurrent memory buster
  5. Buffer allocation strategies (3):
    09 Sep 2015 - Bad usage patterns
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats