Ayende @ Rahien

It's a girl

Reviewing LevelDB: Part II, Put some data on the disk, dude

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

   1: leveldb::DB* db;
   2: leveldb::DB::Open(options, "play/testdb", &db);
   3: status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

   1: // Default implementations of convenience methods that subclasses of DB
   2: // can call if they wish
   3: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
   4:   WriteBatch batch;
   5:   batch.Put(key, value);
   6:   return Write(opt, &batch);
   7: }
   8:  
   9: Status DB::Delete(const WriteOptions& opt, const Slice& key) {
  10:   WriteBatch batch;
  11:   batch.Delete(key);
  12:   return Write(opt, &batch);
  13: }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

   1: Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
   2:   Writer w(&mutex_);
   3:   w.batch = my_batch;
   4:   w.sync = options.sync;
   5:   w.done = false;
   6:  
   7:   MutexLock l(&mutex_);
   8:   writers_.push_back(&w);
   9:   while (!w.done && &w != writers_.front()) {
  10:     w.cv.Wait();
  11:   }
  12:   if (w.done) {
  13:     return w.status;
  14:   }
  15:  
  16:   // May temporarily unlock and wait.
  17:   Status status = MakeRoomForWrite(my_batch == NULL);
  18:   uint64_t last_sequence = versions_->LastSequence();
  19:   Writer* last_writer = &w;
  20:   if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions
  21:     WriteBatch* updates = BuildBatchGroup(&last_writer);
  22:     WriteBatchInternal::SetSequence(updates, last_sequence + 1);
  23:     last_sequence += WriteBatchInternal::Count(updates);
  24:  
  25:     // Add to log and apply to memtable.  We can release the lock
  26:     // during this phase since &w is currently responsible for logging
  27:     // and protects against concurrent loggers and concurrent writes
  28:     // into mem_.
  29:     {
  30:       mutex_.Unlock();
  31:       status = log_->AddRecord(WriteBatchInternal::Contents(updates));
  32:       if (status.ok() && options.sync) {
  33:         status = logfile_->Sync();
  34:       }
  35:       if (status.ok()) {
  36:         status = WriteBatchInternal::InsertInto(updates, mem_);
  37:       }
  38:       mutex_.Lock();
  39:     }
  40:     if (updates == tmp_batch_) tmp_batch_->Clear();
  41:  
  42:     versions_->SetLastSequence(last_sequence);
  43:   }
  44:  
  45:   while (true) {
  46:     Writer* ready = writers_.front();
  47:     writers_.pop_front();
  48:     if (ready != &w) {
  49:       ready->status = status;
  50:       ready->done = true;
  51:       ready->cv.Signal();
  52:     }
  53:     if (ready == last_writer) break;
  54:   }
  55:  
  56:   // Notify new head of write queue
  57:   if (!writers_.empty()) {
  58:     writers_.front()->cv.Signal();
  59:   }
  60:  
  61:   return status;
  62: }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

   1: struct DBImpl::Writer {
   2:   Status status;
   3:   WriteBatch* batch;
   4:   bool sync;
   5:   bool done;
   6:   port::CondVar cv;
   7:  
   8:   explicit Writer(port::Mutex* mu) : cv(mu) { }
   9: };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…

Comments

tobi
03/21/2013 10:28 AM by
tobi

I don't follow the locking strategy, partially because it is not entirely visible, but it looks like a single writer at a time aka "global mutex for writers". Not so scalable (as in "not at all").

Ayende Rahien
03/21/2013 10:29 AM by
Ayende Rahien

Tobi, There is not a single writer mutex. All concurrent writers put their data in the writers collection. A single thread goes into the part where you actually write the data to disk. After all the current writes have been completed, all the writers are freed at once.

Duarte Nunes
03/21/2013 11:29 AM by
Duarte Nunes

The locking strategy is a typical combining strategy. It's particularly interesting to make sure that if sync is true, then you only fsync after a batch write to disk.

The structure they use is the SSTable strategy based on LSM-trees.

tobi
03/21/2013 11:39 AM by
tobi

It is a viable approach, not question. Batching is often a powerful technique. Yet, as long as a significant amount of work is done by one thread at a time, Amdahl's Law kills scalability very quickly. 10% single-threaded means you can never go beyond 10x in scale.

Ayende Rahien
03/21/2013 11:40 AM by
Ayende Rahien

Tobi, You are missing something important here. We are writing to disk. Writing to disk (especially to the log file) is something that is pretty much always going to be sequential.

Simon
03/21/2013 01:35 PM by
Simon

Really like this post and the narrative approach you've taken. It's almost like pair-programming - great to get an insight into how you work through the code and that you also are content to just skip over code you don't understand immediately and come back to it later. Rather than try and figure it all out in one go

JDice
03/21/2013 04:13 PM by
JDice

Love the narrative as well and great to see this series of posts on LevelDB.

I admit, I'm impatiently waiting for these posts like a kid on Christmas Eve. Can't wait to see the performance gains.

One question though: What about making transactions "optional" for the user? That way you could write to memory really fast and it'll flush to disk later on? Since it goes to memory, then all queries could come from in-memory. This is the approach that Redis takes.

Ayende Rahien
03/21/2013 04:14 PM by
Ayende Rahien

JDice, Note that the code here does this. You can see this when you have async write. They don't wait for the flush to disk.

Sampy
03/21/2013 05:15 PM by
Sampy

"Watching" someone read code is surprisingly fascinating.

I think reading code is a undervalued skill.

Welch
03/21/2013 06:53 PM by
Welch

Something I don't get. Everything around the middle block from line 16 to 43 seems to assume it can write multiple batches to disk, but I simply don't get how that is possible. 16-43 seems to only write the current block to disk.

Pop Catalin
03/22/2013 08:35 AM by
Pop Catalin

"I don't follow the locking strategy, partially because it is not entirely visible, but it looks like a single writer at a time aka "global mutex for writers". Not so scalable (as in "not at all")."

The disk is as single threaded as it gets, a single needle writing on a spining disk, one bit after another. (Even SSDs don't have multiple channel writes ... yet). Using multiple threads for IO on a single disk never made IO scalable, this was never the case.

In such a scenario multiple concurent writers simply mean that the writers don't block "each other". All can write at the same time, and finish at the same time. (even though they will complete only when the actual physical write was completed on the disk).

Comments have been closed on this topic.