Raven.Storage – that ain’t proper logging for our kind of environment
Continuing with my work on porting leveldb to .NET, we run into another problem. The log file. The log file is pretty important, this is how you ensure durability, so any problems there are a big cause of concern.
You can read a bit about the format used by leveldb here, but basically, it uses the following:
1: block := record* trailer?
2: record :=
3: checksum: uint32 // crc32c of type and data[] ; little-endian4: length: uint16 // little-endian5: type: uint8 // One of FULL, FIRST, MIDDLE, LAST6: data: uint8[length]
Block is of size 32Kb.
The type can be First, Middle, End or Full. Since it is legit to split a record across multiple blocks. The reasoning behind this format are outlined in the link above.
It is also a format that assumes that you know, upfront, the entire size of your record, so you can split it accordingly. That makes a lot of sense, when working in C++ and passing buffers around.
This is straightforward in C++, where the API is basically:
Status Writer::AddRecord(const Slice& slice)
(Slice is basically just a byte array).
In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH. So we had to be a bit smarter about things, in particular, we had an interesting issue with streaming the results. If I want to write a document with a size of 100K, how do I handle that?
Instead, I wanted this to look like this:
1: var buffer = BitConverter.GetBytes(seq);
2: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
3: buffer = BitConverter.GetBytes(opCount);
4: await state.LogWriter.WriteAsync(buffer, 0, buffer.Length);
5:
6: foreach (var operation in writes.SelectMany(writeBatch => writeBatch._operations))7: {
8: buffer[0] = (byte) operation.Op;9: await state.LogWriter.WriteAsync(buffer, 0, 1);
10: await state.LogWriter.Write7BitEncodedIntAsync(operation.Key.Count);
11: await state.LogWriter.WriteAsync(operation.Key.Array, operation.Key.Offset, operation.Key.Count);
12: if (operation.Op != Operations.Put)13: continue;14: using(var stream = state.MemTable.Read(operation.Handle))15: await stream.CopyToAsync(stream);
16: }
The problem with this approach is that we don’t know, upfront, what is the size that we are going to have. This means that we don’t know how to split the record, because we don’t have the record until it is over. And we don’t want (can’t actually) to go back in the log and change things to set the record straight (pun intended).
What we ended up doing is this:
Note that we explicitly mark the start / end of the record, and in the meantime, we can push however many bytes we want. Internally, we buffer up to 32Kb in size (a bit less, actually, but good enough for now) and based on the next call, we decide whatever the current block should be marked as good or bad.
The reason this is important is that this allows us to actually keep the same format as leveldb, with all of the benefits for dealing with corrupted data, if we need to. I also really like the idea of being able to have parallel readers on the log file, because we know that we can just skip at block boundaries.
Comments
Might be worth doing CPU profiling here because all those awaits for single or very few bytes are likely to burn lots of CPU. You probably can't write out more than 10MB/s with this code. Or is the bulk of the work for big buffers and mem tables?
Tobi, This is actually buffered, so by default, you won't be doing any async work for the most part. The work is done explicitly to handle large buffers, yes.
In .NET, we do not want to be passing buffers around, mostly because of the impact on the LOH.
See buffermanager. Have blogged about it years ago and it's in eventstore source base. It handles this issue. Source is bsd 3 clause
Greg, I am aware of buffer manager, and I am actually using it quite a lot. That doesn't really solve the problem. The way the leveldb codebase behaves, it keeps allocating more & more memory as needed, and that requires a lot of copying around.
Comment preview