Ayende @ Rahien

filter by tags archive

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Mar 30 2013

Toys for geeks

time to read 1 min | 137 words

Tweet Share Share 10 comments

Tags:

miscellaneous

I just got myself a UFO Mini Helicopter, it looks like this:

Mini Helicopter UFO Aircraft With Remote Control

This is the first helicopter that I got, and for a 30$ toy, it is an awesome amount of fun. The only complaint that I have is that this has only about 5 minutes of battery life.

I am really bad at flying it, too.

As mentioned, this is the very first helicopter that I bought, and I think that I would like to have a better one for the next time. Any recommendations from you guys?

I would like a better battery life. 30 minutes – 1 hour would be what I want.
Should be pretty resistant to crashes. I know that I am going to crash it a lot.

Any recommendations?

Mar 29 2013

Reviewing LevelDBPart VII–The version is where the levels are

time to read 7 min | 1272 words

Tweet Share Share 1 comments

Tags:

Okay, so far I have written 6 parts, and the only thing that happened is that we wrote some stuff to the log file. That is cool, but I am assuming that there has got to be more. I started tracking the code, and I think that what happens is that we have compactions of the MemTable, at which point we flush it to disk.

I think that what happens is this, we have a method called MaybeScheduleCompaction, in db_impl.cc, which is kicking of the actual process for the MemTable compaction. This is getting called from a few places, but most importantly, it is called during the Write() call. Reading the code, it seems that before we can go to the actual compaction work, we need to look at something called a VersionSet. This looks like it holds all of the versions of the database at a particular point in time. Including all the files that it is using, etc.

A lot of what it (and its associate, the Version class) is about managing lists of this structure:

I am not sure what allowed_seeks mean, I assume it is there to force compaction for the next level.

Okay, moving on to version, it looks this is where all the actual lookups are done. We have a list of file metadata, including smallest & largest keys in each file. That allows us to find the appropriate files to look at quite easily. There seems to be some interaction between Version and TableCache, but I’m not going into that now.

A version is holding an array of 7 levels, and at each level we have the associated files. I am going to continue digging into Version & VersionSet for the moment.

Side Note: In fact, I got frustrated enough with trying to figure out leveldb on Windows that I setup a Ubunto machine with KDevelop just to get things done. This blog post is actually written on the Ubunto machine (later to be copy into live writer :-)).

I am still in the process of going through the code. It is a really much easier to do this in an IDE that can actually build & understand the code.

Once thing that I can tell you right now is that C++ programmers are strange. I mean, take a look at the following code, from Version::LevelFileNumIterator :

This returns a byte array containing encoded file num & size in a buffer. Would it be so hard to create a struct for that or use std::pair ? Seems like this would complicate the client code. Then again, maybe there is a perf reason that I am not seeing?

Then again, here is the client code:

And that seems pretty clear.

So far, it appears as if the Version is the current state of all of the files in a particular point in time. I think that this is how leveldb implements snapshots. The files are SSTables, which are pretty much write once only. A version belong to a set (not sure exactly what that means yet) and is part of a linked list. Again, I am not sure what is the purpose of that yet.

I'll need to do a deeper dive into snapshots in general, later on, because it is interesting to see how that is implemented with regards to the memtable.

Moving back to the actual code, we have this code:

This seems to me to indicate that the table_cache is the part of the code that is actually manages the SSTables, probably using some variant of the page pool.

Now, let us get to the good parts, Version::Get:

This looks like this is actually doing something useful. In fact, it find the relevant files to look for that particular key, once it did that, it calls:

So the data is actually retrieved from the cache, as expected. But there was an interesting comment there about “charging” seeks for files, so I am going to be looking at who is calling Version::Get right now, then come back to the cache in another post.

What is interesting is that we have this guy:

And that in turn all make sense now. allowed_seeks is something that is set when we apply a VersionEdit, it seems. No idea what this is now, but there is a comment there that explains that we use this as a way to trigger compaction when it is cheaper to do do compaction than continue doing those seeks. Interestingly enough, seeks are only counted if we have to go through more than one file to find a value, which makes sense, I guess.

Okay, now let us back up a bit and see who is calling Version::Get. And as it turned out, it is our dear friend, DBImpl::Get().

There, we first look in the current memtable, then in the immutable memtable (which is probably on its way to become a SSTable now. And then we are looking at the current Version, calling Version::Get. If we actually hit the version, we also call Version::UpdateStats, and if we need to, we then call MaybeScheduleCompaction(), which is where we started this post.

And... that is it for this post, we still have managed to find where we actually save to disk (they hid it really deep), but I think I'll probably be able to figure this out in this sitting, watch out for the next post.

Mar 28 2013

Reviewing LevelDBPart VI, the Log is base for Atomicity

time to read 23 min | 4442 words

Tweet Share Share 1 comments

Tags:

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

   1: // A file abstraction for sequential writing.  The implementation

   2: // must provide buffering since callers may append small fragments

   3: // at a time to the file.

   4: class WritableFile {

   5:  public:

   6:   WritableFile() { }

   7:   virtual ~WritableFile();

8:

   9:   virtual Status Append(const Slice& data) = 0;

  10:   virtual Status Close() = 0;

  11:   virtual Status Flush() = 0;

  12:   virtual Status Sync() = 0;

13:

  14:  private:

  15:   // No copying allowed

  16:   WritableFile(const WritableFile&);

  17:   void operator=(const WritableFile&);

  18: };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

   1: Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

   1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));

   2: if (status.ok() && options.sync) {

   3:  status = logfile_->Sync();

   4: }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

   1: Status Writer::AddRecord(const Slice& slice) {

   2:   const char* ptr = slice.data();

   3:   size_t left = slice.size();

4:

   5:   // Fragment the record if necessary and emit it.  Note that if slice

   6:   // is empty, we still want to iterate once to emit a single

   7:   // zero-length record

   8:   Status s;

   9:   bool begin = true;

  10:   do {

  11:     const int leftover = kBlockSize - block_offset_;

  12:     assert(leftover >= 0);

  13:     if (leftover < kHeaderSize) {

  14:       // Switch to a new block

  15:       if (leftover > 0) {

  16:         // Fill the trailer (literal below relies on kHeaderSize being 7)

  17:         assert(kHeaderSize == 7);

  18:         dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));

  19:       }

  20:       block_offset_ = 0;

  21:     }

22:

  23:     // Invariant: we never leave < kHeaderSize bytes in a block.

  24:     assert(kBlockSize - block_offset_ - kHeaderSize >= 0);

25:

  26:     const size_t avail = kBlockSize - block_offset_ - kHeaderSize;

  27:     const size_t fragment_length = (left < avail) ? left : avail;

28:

  29:     RecordType type;

  30:     const bool end = (left == fragment_length);

  31:     if (begin && end) {

  32:       type = kFullType;

  33:     } else if (begin) {

  34:       type = kFirstType;

  35:     } else if (end) {

  36:       type = kLastType;

  37:     } else {

  38:       type = kMiddleType;

  39:     }

40:

  41:     s = EmitPhysicalRecord(type, ptr, fragment_length);

  42:     ptr += fragment_length;

  43:     left -= fragment_length;

  44:     begin = false;

  45:   } while (s.ok() && left > 0);

  46:   return s;

  47: }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.

Mar 27 2013

Reviewing LevelDBPart V, into the MemTables we go

time to read 2 min | 263 words

Tweet Share Share 2 comments

Tags:

You can read about the theory of Sorted Strings Tables and Memtables here. In this case, what I am interested in is going a bit deeper into the leveldb codebase, and understanding how the data is actually kept in memory and what is it doing there.

In order to do that, we are going to investigate MemTable. As it turned out, this is actually a very simple data structure. A MemTable just hold a SkipList, whish is a sorted data structure that allows O(log N) access and modifications. The interesting thing about Skip List in contrast to Binary Trees, is that it is much easier to create a performant solution of concurrent skip list (either with or without locks) over a concurrently binary tree.

The data in the table is just a list of key & value (or delete marker). And that means that searches through this can give you three results:

Here is the value for the key (exists)
The value for the key was remove (deleted)
The value is not in the memory table (missing)

It is the last part where we get involved with the more interesting aspect of LevelDB (and the reason it is called leveldb in the first place). The notion that you have multiple levels. The mem table is the first one, and then you spill the output out to disk (the Sorted Strings Table). Now that I figure out how simple MemTable is really is, I am going to take a look at the leveldb log, and then dive into Sorted Strings Table.

Mar 26 2013

Reviewing LevelDBPart IV

time to read 3 min | 449 words

Tweet Share Share 3 comments

Tags:

his is a bit of a side track. One of the things that is quite clear to me when I am reading the leveldb code is that I was never really any good at C++. I was a C/C++ developer. And that is a pretty derogatory term. C & C++ share a lot of the same syntax and underlying assumption, but the moment you want to start writing non trivial stuff, they are quite different. And no, I am not talking about OO or templates.

I am talking about things that came out of that. In particular, throughout the leveldb codebase, they are very rarely, if at all, allocate memory directly. Pretty much the whole codebase rely on std::string to handle buffer allocations and management. This make sense, since RAII is still the watch ward for good C++ code. Being able to utilize std::string for memory management also means that the memory will be properly released without having to deal with it explicitly.

More interestingly, the leveldb codebase is also using std::string as a general buffer. I wonder why it is std::string vs. std::vector<char>, which would bet more reasonable, but I guess that this is because most of the time, users will want to pass strings as keys, and likely this is easier to manage, given the type of operations available on std::string (such as append).

It is actually quite fun to go over the codebase and discover those sort of things. Especially if I can figure them out on my own Smile .

This is quite interesting because from my point of view, buffers are a whole different set of problems. We don’t have to worry about the memory just going away in .NET (although we do have to worry about someone changing the buffer behind our backs), but we have to worry a lot about buffer size. This is because at some point (80Kb), buffers graduate to the large object heap, and stay there. Which means, in turn, that every time that you want to deal with buffers you have to take that into account, usually with a buffer pool.

Another aspect that is interesting with regards to memory usage is the explicit handling of copying. There are various places in the code where the copy constructor was made private, to avoid this. Or a comment is left about making a type copy-able intentionally. I get the reason why, because it is a common failing point in C++, but I forgot (although I am pretty sure that I used to know) the actual semantics of when/ how you want to do that in all cases.

Mar 25 2013

RavenDB 2.0.3 Stable Release!

time to read 5 min | 859 words

Tweet Share Share 1 comments

Tags:

raven

We have just released the next stable build 2330 of RavenDB 2.0. You can find it here. This release contains a lot of bug fixes, improvements, streamlining and some interesting new stuff.

The full change log is actually here, because we found a bug in 2325 (ironically, it was a bug in how it reported its build number).

Breaking Changes:

SQL Replication script / configuration change (more below).

Features:

More debug / visibility endpoints (user info, changes traffic, map/reduce data, etc).
Better highlighting support.
Spatial Search will sort by distance by default.
Better indexing for TimeSpan values.
Can do more Parallel Work in Map/Reduce indexes now.

Improvements:

Map/Reudce indexes tune themselves automatically.
Better Periodic Backup behavior when there is no new writes.
Better handling of transactions during documents put with high number of referencing documents.
Better use of alerts.
Better float support.

Studio:

Better import/export UI.

Bug fixes:

Can backup & restore even in the presence of corrupt / missing indexes.
LoadDocument with map/reduce indexes cause issues.
Allow to change the number of cached requests on the client side without NRE.
Fixing Unique Constraints bundle with null unique properties.
Forbidden error when running as a non admin user in the studio.
Better support for indexing nullable properties with HasValue.
Fixed a problem with replication of deleted documents when adding a new node in the topology.
Support export / import with versioning bundle.

SQL Replication Breaking Changes

With SQL Replication, it became apparent that we missed a pretty big use case. Deletions.

Deletions is something that we didn’t handle, and couldn’t handle using the existing format. It was a touch call, but we decided to make a breaking change here.

Now, you need to define all the tables that you’ll be working with (as well as the order we will be writing to them). Assuming that we have a User document, and we want to replicate to Users and UsersGroups tables, we would have:

   1: replicateToUsers({

   2:    Name: this.Name

   3: })

4:

   5: for(var i = 0; i < this.Groups.length; i++) {

   6:   replicateToUsersGroups({

   7:       Group: this.Groups[i]

   8:   });

   9: }

This replaced the sqlReplicate calls. Note that this is a hard breaking reset. When you upgrade, you’ll need to update all of your SQL Replication definitions (but you keep the replication state, you won’t have to start replicating from scratch).

Mar 22 2013

Reviewing LevelDBPart III, WriteBatch isn’t what you think it is

time to read 26 min | 5198 words

Tweet Share Share 13 comments

Tags:

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;

   2: batch.Delete(key1);

   3: batch.Put(key2, value);

   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch

   2: {

   3:    List<Operation> Operations;

   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=

   2: //    sequence: fixed64

   3: //    count: fixed32

   4: //    data: record[count]

   5: // record :=

   6: //    kTypeValue varstring varstring         |

   7: //    kTypeDeletion varstring

   8: // varstring :=

   9: //    len: varint32

  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.

   2: class Handler {

   3:  public:

   4:   virtual ~Handler();

   5:   virtual void Put(const Slice& key, const Slice& value) = 0;

   6:   virtual void Delete(const Slice& key) = 0;

   7: };

   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {

   2:   Slice input(rep_);

   3:   if (input.size() < kHeader) {

   4:     return Status::Corruption("malformed WriteBatch (too small)");

   5:   }

6:

   7:   input.remove_prefix(kHeader);

   8:   Slice key, value;

   9:   int found = 0;

  10:   while (!input.empty()) {

  11:     found++;

  12:     char tag = input[0];

  13:     input.remove_prefix(1);

  14:     switch (tag) {

  15:       case kTypeValue:

  16:         if (GetLengthPrefixedSlice(&input, &key) &&

  17:             GetLengthPrefixedSlice(&input, &value)) {

  18:           handler->Put(key, value);

  19:         } else {

  20:           return Status::Corruption("bad WriteBatch Put");

  21:         }

  22:         break;

  23:       case kTypeDeletion:

  24:         if (GetLengthPrefixedSlice(&input, &key)) {

  25:           handler->Delete(key);

  26:         } else {

  27:           return Status::Corruption("bad WriteBatch Delete");

  28:         }

  29:         break;

  30:       default:

  31:         return Status::Corruption("unknown WriteBatch tag");

  32:     }

  33:   }

  34:   if (found != WriteBatchInternal::Count(this)) {

  35:     return Status::Corruption("WriteBatch has wrong count");

  36:   } else {

  37:     return Status::OK();

  38:   }

  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {

   2:  public:

   3:   SequenceNumber sequence_;

   4:   MemTable* mem_;

5:

   6:   virtual void Put(const Slice& key, const Slice& value) {

   7:     mem_->Add(sequence_, kTypeValue, key, value);

   8:     sequence_++;

   9:   }

  10:   virtual void Delete(const Slice& key) {

  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());

  12:     sequence_++;

  13:   }

  14: };

15:

  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,

  17:                                       MemTable* memtable) {

  18:   MemTableInserter inserter;

  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);

  20:   inserter.mem_ = memtable;

  21:   return b->Iterate(&inserter);

  22: }

As I can figure it so far, we have the following steps:

WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

Mar 21 2013

Reviewing LevelDBPart II, Put some data on the disk, dude

time to read 27 min | 5257 words

Tweet Share Share 11 comments

Tags:

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

   1: leveldb::DB* db;

   2: leveldb::DB::Open(options, "play/testdb", &db);

   3: status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

   1: // Default implementations of convenience methods that subclasses of DB

   2: // can call if they wish

   3: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {

   4:   WriteBatch batch;

   5:   batch.Put(key, value);

   6:   return Write(opt, &batch);

   7: }

8:

   9: Status DB::Delete(const WriteOptions& opt, const Slice& key) {

  10:   WriteBatch batch;

  11:   batch.Delete(key);

  12:   return Write(opt, &batch);

  13: }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

   1: Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {

   2:   Writer w(&mutex_);

   3:   w.batch = my_batch;

   4:   w.sync = options.sync;

   5:   w.done = false;

6:

   7:   MutexLock l(&mutex_);

   8:   writers_.push_back(&w);

   9:   while (!w.done && &w != writers_.front()) {

  10:     w.cv.Wait();

  11:   }

  12:   if (w.done) {

  13:     return w.status;

  14:   }

15:

  16:   // May temporarily unlock and wait.

  17:   Status status = MakeRoomForWrite(my_batch == NULL);

  18:   uint64_t last_sequence = versions_->LastSequence();

  19:   Writer* last_writer = &w;

  20:   if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions

  21:     WriteBatch* updates = BuildBatchGroup(&last_writer);

  22:     WriteBatchInternal::SetSequence(updates, last_sequence + 1);

  23:     last_sequence += WriteBatchInternal::Count(updates);

24:

  25:     // Add to log and apply to memtable.  We can release the lock

  26:     // during this phase since &w is currently responsible for logging

  27:     // and protects against concurrent loggers and concurrent writes

  28:     // into mem_.

  29:     {

  30:       mutex_.Unlock();

  31:       status = log_->AddRecord(WriteBatchInternal::Contents(updates));

  32:       if (status.ok() && options.sync) {

  33:         status = logfile_->Sync();

  34:       }

  35:       if (status.ok()) {

  36:         status = WriteBatchInternal::InsertInto(updates, mem_);

  37:       }

  38:       mutex_.Lock();

  39:     }

  40:     if (updates == tmp_batch_) tmp_batch_->Clear();

41:

  42:     versions_->SetLastSequence(last_sequence);

  43:   }

44:

  45:   while (true) {

  46:     Writer* ready = writers_.front();

  47:     writers_.pop_front();

  48:     if (ready != &w) {

  49:       ready->status = status;

  50:       ready->done = true;

  51:       ready->cv.Signal();

  52:     }

  53:     if (ready == last_writer) break;

  54:   }

55:

  56:   // Notify new head of write queue

  57:   if (!writers_.empty()) {

  58:     writers_.front()->cv.Signal();

  59:   }

60:

  61:   return status;

  62: }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

   1: struct DBImpl::Writer {

   2:   Status status;

   3:   WriteBatch* batch;

   4:   bool sync;

   5:   bool done;

   6:   port::CondVar cv;

7:

   8:   explicit Writer(port::Mutex* mu) : cv(mu) { }

   9: };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…

Mar 21 2013

RavenDB vNext: It is so pink!

time to read 1 min | 104 words

Tweet Share Share 2 comments

Tags:

raven

Just thought that you might appreciate a peek into what we have been working on:

You can consider the bright pink background a bug, by the way. But the installer is real, and it will guide you through an install of RavenDB using the “Yes, Dear” model.

This is mostly for clients that don’t like xcopy installs (honestly, this is to make sure that setting up in IIS is no longer a set of manual steps).

Mar 20 2013

Reviewing LevelDBPart I, What is this all about?

time to read 2 min | 285 words

Tweet Share Share 9 comments

Tags:

LevelDB is…

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

That is the project’s own definition. Basically, it is a way for users to store data in an efficient manner. It isn’t a SQL database. It isn’t even a real database in any sense of the word. What it is is a building block for building databases. It handles writing and reading to disk, and it supports atomicity. But anything else is on you (from transaction management to more complex items).

As such, it appears perfect for the kind of things that we need to do. I decided that I wanted to get to know the codebase, especially since at this time, I can’t even get it to compile Sad smile . The fact that this is a C++ codebase, written by people who eat & breath C++ for a living is another reason why. I expect that this would be a good codebase, so I might as well sharpen my C++-foo at the same time that I grok what this is doing.

The first thing to do is to look at the interface that the database provides us with:

That is a very small surface area, and as you can imagine, this is something that I highly approve of. It make it much easier to understand and reason about. And there is some pretty complex behavior behind this, which I’ll be exploring soon.

Oren Eini

Oren Eini

CEO of RavenDB

Toys for geeks

Reviewing LevelDBPart VII–The version is where the levels are

Reviewing LevelDBPart VI, the Log is base for Atomicity

Reviewing LevelDBPart V, into the MemTables we go

Reviewing LevelDBPart IV

RavenDB 2.0.3 Stable Release!

Reviewing LevelDBPart III, WriteBatch isn’t what you think it is

Reviewing LevelDBPart II, Put some data on the disk, dude

RavenDB vNext: It is so pink!

Reviewing LevelDBPart I, What is this all about?

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed