Ayende @ Rahien

It's a girl

Raven.Storage – that ain’t proper logging for our kind of environment

Continuing with my work on porting leveldb to .NET, we run into another problem. The log file. The log file is pretty important, this is how you ensure durability, so any problems there are a big cause of concern.

You can read a bit about the format used by leveldb here, but basically, it uses the following:

   1: block := record* trailer?
   2: record :=
   3:  checksum: uint32    // crc32c of type and data[] ; little-endian
   4:  length: uint16        // little-endian
   5:  type: uint8        // One of FULL, FIRST, MIDDLE, LAST
   6:  data: uint8[length]

Block is of size 32Kb.

The type can be First, Middle, End or Full. Since it is legit to split a record across multiple blocks. The reasoning behind this format are outlined in the link above.

It is also a format that assumes that you know, upfront, the entire size of your record, so you can split it accordingly. That makes a lot of sense, when working in C++ and passing buffers around.

 

This is straightforward in C++, where the API is basically:

Status Writer::AddRecord(const Slice& slice)

(Slice is basically just a byte array).

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH. So we had to be a bit smarter about things, in particular, we had an interesting issue with streaming the results. If I want to write a document with a size of 100K, how do I handle that?

Instead, I wanted this to look like this:

   1: var buffer = BitConverter.GetBytes(seq);
   2: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   3: buffer = BitConverter.GetBytes(opCount);
   4: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   5:  
   6: foreach (var operation in writes.SelectMany(writeBatch => writeBatch._operations))
   7: {
   8:     buffer[0] = (byte) operation.Op;
   9:     await state.LogWriter.WriteAsync(buffer, 0, 1);
  10:     await state.LogWriter.Write7BitEncodedIntAsync(operation.Key.Count);
  11:     await state.LogWriter.WriteAsync(operation.Key.Array, operation.Key.Offset, operation.Key.Count);
  12:     if (operation.Op != Operations.Put)
  13:         continue;
  14:     using(var stream = state.MemTable.Read(operation.Handle))
  15:         await stream.CopyToAsync(stream);
  16: }

The problem with this approach is that we don’t know, upfront, what is the size that we are going to have. This means that we don’t know how to split the record, because we don’t have the record until it is over. And we don’t want (can’t actually) to go back in the log and change things to set the record straight (pun intended).

What we ended up doing is this:

image

Note that we explicitly mark the start / end of the record, and in the meantime, we can push however many bytes we want. Internally, we buffer up to 32Kb in size (a bit less, actually, but good enough for now) and based on the next call, we decide whatever the current block should be marked as good or bad.

The reason this is important is that this allows us to actually keep the same format as leveldb, with all of the benefits for dealing with corrupted data, if we need to. I also really like the idea of being able to have parallel readers on the log file, because we know that we can just skip at block boundaries.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

tobi
05/31/2013 10:49 AM by
tobi

Might be worth doing CPU profiling here because all those awaits for single or very few bytes are likely to burn lots of CPU. You probably can't write out more than 10MB/s with this code. Or is the bulk of the work for big buffers and mem tables?

Ayende Rahien
05/31/2013 11:14 AM by
Ayende Rahien

Tobi, This is actually buffered, so by default, you won't be doing any async work for the most part. The work is done explicitly to handle large buffers, yes.

Greg young
06/04/2013 03:33 PM by
Greg young

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH.

See buffermanager. Have blogged about it years ago and it's in eventstore source base. It handles this issue. Source is bsd 3 clause

Ayende Rahien
06/06/2013 10:14 AM by
Ayende Rahien

Greg, I am aware of buffer manager, and I am actually using it quite a lot. That doesn't really solve the problem. The way the leveldb codebase behaves, it keeps allocating more & more memory as needed, and that requires a lot of copying around.

Comments have been closed on this topic.