Raven.Storage – that ain’t proper logging for our kind of environment

time to read 9 min | 1650 words

Continuing with my work on porting leveldb to .NET, we run into another problem. The log file. The log file is pretty important, this is how you ensure durability, so any problems there are a big cause of concern.

You can read a bit about the format used by leveldb here, but basically, it uses the following:

   1: block := record* trailer?
   2: record :=
   3:  checksum: uint32    // crc32c of type and data[] ; little-endian
   4:  length: uint16        // little-endian
   5:  type: uint8        // One of FULL, FIRST, MIDDLE, LAST
   6:  data: uint8[length]

Block is of size 32Kb.

The type can be First, Middle, End or Full. Since it is legit to split a record across multiple blocks. The reasoning behind this format are outlined in the link above.

It is also a format that assumes that you know, upfront, the entire size of your record, so you can split it accordingly. That makes a lot of sense, when working in C++ and passing buffers around.

 

This is straightforward in C++, where the API is basically:

Status Writer::AddRecord(const Slice& slice)

(Slice is basically just a byte array).

In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH. So we had to be a bit smarter about things, in particular, we had an interesting issue with streaming the results. If I want to write a document with a size of 100K, how do I handle that?

Instead, I wanted this to look like this:

   1: var buffer = BitConverter.GetBytes(seq);
   2: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   3: buffer = BitConverter.GetBytes(opCount);
   4: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
   5:  
   6: foreach (var operation in writes.SelectMany(writeBatch => writeBatch._operations))
   7: {
   8:     buffer[0] = (byte) operation.Op;
   9:     await state.LogWriter.WriteAsync(buffer, 0, 1);
  10:     await state.LogWriter.Write7BitEncodedIntAsync(operation.Key.Count);
  11:     await state.LogWriter.WriteAsync(operation.Key.Array, operation.Key.Offset, operation.Key.Count);
  12:     if (operation.Op != Operations.Put)
  13:         continue;
  14:     using(var stream = state.MemTable.Read(operation.Handle))
  15:         await stream.CopyToAsync(stream);
  16: }

The problem with this approach is that we don’t know, upfront, what is the size that we are going to have. This means that we don’t know how to split the record, because we don’t have the record until it is over. And we don’t want (can’t actually) to go back in the log and change things to set the record straight (pun intended).

What we ended up doing is this:

image

Note that we explicitly mark the start / end of the record, and in the meantime, we can push however many bytes we want. Internally, we buffer up to 32Kb in size (a bit less, actually, but good enough for now) and based on the next call, we decide whatever the current block should be marked as good or bad.

The reason this is important is that this allows us to actually keep the same format as leveldb, with all of the benefits for dealing with corrupted data, if we need to. I also really like the idea of being able to have parallel readers on the log file, because we know that we can just skip at block boundaries.