Ayende @ Rahien

It's a girl

Reviewing LevelDB: Part III, WriteBatch isn’t what you think it is

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;
   2: batch.Delete(key1);
   3: batch.Put(key2, value);
   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch
   2: {
   3:    List<Operation> Operations;
   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=
   2: //    sequence: fixed64
   3: //    count: fixed32
   4: //    data: record[count]
   5: // record :=
   6: //    kTypeValue varstring varstring         |
   7: //    kTypeDeletion varstring
   8: // varstring :=
   9: //    len: varint32
  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.
   2: class Handler {
   3:  public:
   4:   virtual ~Handler();
   5:   virtual void Put(const Slice& key, const Slice& value) = 0;
   6:   virtual void Delete(const Slice& key) = 0;
   7: };
   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {
   2:   Slice input(rep_);
   3:   if (input.size() < kHeader) {
   4:     return Status::Corruption("malformed WriteBatch (too small)");
   5:   }
   6:  
   7:   input.remove_prefix(kHeader);
   8:   Slice key, value;
   9:   int found = 0;
  10:   while (!input.empty()) {
  11:     found++;
  12:     char tag = input[0];
  13:     input.remove_prefix(1);
  14:     switch (tag) {
  15:       case kTypeValue:
  16:         if (GetLengthPrefixedSlice(&input, &key) &&
  17:             GetLengthPrefixedSlice(&input, &value)) {
  18:           handler->Put(key, value);
  19:         } else {
  20:           return Status::Corruption("bad WriteBatch Put");
  21:         }
  22:         break;
  23:       case kTypeDeletion:
  24:         if (GetLengthPrefixedSlice(&input, &key)) {
  25:           handler->Delete(key);
  26:         } else {
  27:           return Status::Corruption("bad WriteBatch Delete");
  28:         }
  29:         break;
  30:       default:
  31:         return Status::Corruption("unknown WriteBatch tag");
  32:     }
  33:   }
  34:   if (found != WriteBatchInternal::Count(this)) {
  35:     return Status::Corruption("WriteBatch has wrong count");
  36:   } else {
  37:     return Status::OK();
  38:   }
  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {
   2:  public:
   3:   SequenceNumber sequence_;
   4:   MemTable* mem_;
   5:  
   6:   virtual void Put(const Slice& key, const Slice& value) {
   7:     mem_->Add(sequence_, kTypeValue, key, value);
   8:     sequence_++;
   9:   }
  10:   virtual void Delete(const Slice& key) {
  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());
  12:     sequence_++;
  13:   }
  14: };
  15:  
  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,
  17:                                       MemTable* memtable) {
  18:   MemTableInserter inserter;
  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);
  20:   inserter.mem_ = memtable;
  21:   return b->Iterate(&inserter);
  22: }

As I can figure it so far, we have the following steps:

  • WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
  • We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

Comments

Giorgi
03/22/2013 10:31 AM by
Giorgi

Slightly off-topic but did you consider RaptorDB as an alternative?

http://www.codeproject.com/Articles/316816/RaptorDB-The-Key-Value-Store-V2

Ayende Rahien
03/22/2013 10:54 AM by
Ayende Rahien

Giorgi, Any database whose main site is a code project article isn't something that I want to take into account. Limiting key size to max of 255 chars is really limiting.

tobi
03/22/2013 01:06 PM by
tobi

I initially thought this was so that appending to the log was a simple memcpy or WriteFile(buffer, len) so that no significant CPU time is spent there. Instead the serialization would happen on the infinitely scalable free threads.

But then you said that the buffer is always deserialized later. Your explanation is probably correct.

Tim Gebhardt
03/22/2013 01:31 PM by
Tim Gebhardt

ZeroMQ has a simiar problem: how to buffer the writes in a way to increase throughput (in this case over the network).

ZeroMQ is also a high-performance networking library so it strives to do as little memcpy as possible because that's a big waste of time and resources.

So one thing it offers part of the C++ API is to define a callback when a message has been flushed and you can safely deallocate it.

andy
03/22/2013 05:52 PM by
andy

Or is http://dbreeze.codeplex.com/ worth a look?

Ayende Rahien
03/22/2013 06:09 PM by
Ayende Rahien

Andy, Haven't heard about that. The docs is slightly creepy, with recommended error handling like this:

 catch (Exception ex)
                {
                            Console.WriteLine(ex.ToString());
                }

There is no discussion on how it works, just on how to use it, and the docs are full of warnings.

Mike Scott
03/25/2013 05:42 PM by
Mike Scott

Even in .Net land, alhough data submitted for batch writing can't be released underneath you, it can be modified underneath you, which may screw up what was meant to be written. So it's always a good idea to copy the data at the point of the request, in case the caller does something stupid with it later before it's been written to disk.

Ayende Rahien
03/25/2013 10:09 PM by
Ayende Rahien

Mike, There are way to mitigate that. For example, you can accept immutable data (strings are a good example). Although if you are using byte arrays, that is a major concern, sure.

Mike Scott
03/26/2013 12:32 AM by
Mike Scott

Agreed that passing immutables is a good idea, but not enforceable in a general batching write API. Of course, I also ignored the fact that copying would have to be deep, which is of course non-trivial so there's no good general "safe" solution.

David Schwartz
05/13/2013 04:02 AM by
David Schwartz

There really should be a batch write API that doesn't deep copy. You can, of course, only call it if you promise to keep the data stable until the batch is completely written. That would allow people who happen to have the whole batch in memory to do a maximally efficient write.

Ayende Rahien
05/13/2013 08:02 AM by
Ayende Rahien

David, That would make it pretty hard to determine when you can release the data. Moreover, we are using Stream as our API, and that means that you can't assume that you can even manage to read the data twice. It might also be very expensive to do so.

David Schwartz
05/13/2013 08:07 AM by
David Schwartz

Then perhaps the better solution is for you to get a buffer from LevelDB that you put the data in. This at least saves a second copy if you have to assemble the data in a buffer just to have LevelDB allocate a new buffer to copy it into.

Ayende Rahien
05/13/2013 08:08 AM by
Ayende Rahien

David, That is what I ended up doing.

Comments have been closed on this topic.