Ayende @ Rahien

Refunds available at head office

Toys for geeks

I just got myself a UFO Mini Helicopter, it looks like this:

Mini Helicopter UFO Aircraft With Remote Control

This is the first helicopter that I got, and for a 30$ toy, it is an awesome amount of fun. The only complaint that I have is that this has only about 5 minutes of battery life.

I am really bad at flying it, too.

As mentioned, this is the very first helicopter that I bought, and I think that I would like to have a better one for the next time. Any recommendations from you guys?

  • I would like a better battery life. 30 minutes – 1 hour would be what I want.
  • Should be pretty resistant to crashes. I know that I am going to crash it a lot.

Any recommendations?

Reviewing LevelDB: Part VII–The version is where the levels are

Okay, so far I have written 6 parts, and the only thing that happened is that we wrote some stuff to the log file. That is cool, but I am assuming that there has got to be more. I started tracking the code, and I think that what happens is that we have compactions of the MemTable, at which point we flush it to disk.

I think that what happens is this, we have a method called MaybeScheduleCompaction, in db_impl.cc, which is kicking of the actual process for the MemTable compaction. This is getting called from a few places, but most importantly, it is called during the Write() call. Reading the code, it seems that before we can go to the actual compaction work, we need to look at something called a VersionSet. This looks like it holds all of the versions of the database at a particular point in time. Including all the files that it is using, etc.

A lot of what it (and its associate, the Version class) is about managing lists of this structure:

image

I am not sure what allowed_seeks mean, I assume it is there to force compaction for the next level.

Okay, moving on to version, it looks this is where all the actual lookups are done. We have a list of file metadata, including smallest & largest keys in each file. That allows us to find the appropriate files to look at quite easily. There seems to be some interaction between Version and TableCache, but I’m not going into that now.

A version is holding an array of 7 levels, and at each level we have the associated files. I am going to continue digging into Version & VersionSet for the moment.

Side Note: In fact, I got frustrated enough with trying to figure out leveldb on Windows that I setup a Ubunto machine with KDevelop just to get things done. This blog post is actually written on the Ubunto machine (later to be copy into live writer :-)).

I am still in the process of going through the code. It is a really much easier to do this in an IDE that can actually build & understand the code.

Once thing that I can tell you right now is that C++ programmers are strange. I mean, take a look at the following code, from Version::LevelFileNumIterator :

image

This returns a byte array containing encoded file num & size in a buffer. Would it be so hard to create a struct for that or use std::pair ? Seems like this would complicate the client code. Then again, maybe there is a perf reason that I am not seeing?

Then again, here is the client code:

image

And that seems pretty clear.

So far, it appears as if the Version is the current state of all of the files in a particular point in time. I think that this is how leveldb implements snapshots. The files are SSTables, which are pretty much write once only. A version belong to a set (not sure exactly what that means yet) and is part of a linked list. Again, I am not sure what is the purpose of that yet.

I'll need to do a deeper dive into snapshots in general, later on, because it is interesting to see how that is implemented with regards to the memtable.

Moving back to the actual code, we have this code:

image

This seems to me to indicate that the table_cache is the part of the code that is actually manages the SSTables, probably using some variant of the page pool.

Now, let us get to the good parts, Version::Get:

image

This looks like this is actually doing something useful. In fact, it find the relevant files to look for that particular key, once it did that, it calls:

image

So the data is actually retrieved from the cache, as expected. But there was an interesting comment there about “charging” seeks for files, so I am going to be looking at who is calling Version::Get right now, then come back to the cache in another post.

What is interesting is that we have this guy:

image

And that in turn all make sense now. allowed_seeks is something that is set when we apply a VersionEdit, it seems. No idea what this is now, but there is a comment there that explains that we use this as a way to trigger compaction when it is cheaper to do do compaction than continue doing those seeks. Interestingly enough, seeks are only counted if we have to go through more than one file to find a value, which makes sense, I guess.

Okay, now let us back up a bit and see who is calling Version::Get. And as it turned out, it is our dear friend, DBImpl::Get().

There, we first look in the current memtable, then in the immutable memtable (which is probably on its way to become a SSTable now. And then we are looking at the current Version, calling Version::Get. If we actually hit the version, we also call Version::UpdateStats, and if we need to, we then call MaybeScheduleCompaction(), which is where we started this post.

And... that is it for this post, we still have managed to find where we actually save to disk (they hid it really deep), but I think I'll probably be able to figure this out in this sitting, watch out for the next post.

Reviewing LevelDB: Part VI, the Log is base for Atomicity

Here we are starting to get into the interesting bits. How do we actually write to disk. There are two parts of that. The first part is the log file. This is were all the recent values are stored, and it is an unsorted backup for the MemTable in case of crashes.

Let us see how this actually works. There are two classes which are involved in this manner. leveldb::log::Writer and leveldb::WritableFile. I think that WritableFile is the leveldb abstraction, so it is bound to be simpler. We’ll take a look at that first.

Here is what it looks like:

   1: // A file abstraction for sequential writing.  The implementation
   2: // must provide buffering since callers may append small fragments
   3: // at a time to the file.
   4: class WritableFile {
   5:  public:
   6:   WritableFile() { }
   7:   virtual ~WritableFile();
   8:  
   9:   virtual Status Append(const Slice& data) = 0;
  10:   virtual Status Close() = 0;
  11:   virtual Status Flush() = 0;
  12:   virtual Status Sync() = 0;
  13:  
  14:  private:
  15:   // No copying allowed
  16:   WritableFile(const WritableFile&);
  17:   void operator=(const WritableFile&);
  18: };

Pretty simple, overall. There is the buffering requirement, but that is pretty easy overall. Note that this is a C++ interface. There is a bunch of implementations, but the one that I think will be relevant here is PosixMmapFile. So much for it being simple. As I mentioned, this is Posix code that I am reading, and I have to do a lot of lookup into the man pages. The implementation isn’t that interesting, to be fair, and full of mmap files on posix minutia. So I am going to skip it.

I wonder why the choice was map to use memory mapped files, since the API exposed here is pretty much perfect for streams. As you can imagine from the code, calling Apend() just writes the values to the mmap file, flush is a no op, and Sync() actually ask the file system to write the values to disk and wait on that. I am guessing that the use of mmap files is related to the fact that mmap files are used extensively in the rest of the code base (for reads) and that gives leveldb the benefit of using the OS memory manager as the buffer.

Now that we got what a WritableFile is like, let us see what the leveldb::log::Writer is like. In terms of the interface, it is pretty slick, it has a single public method:

   1: Status AddRecord(const Slice& slice);

As a remind, those two are used together in the DBImpl::Write() method, like so:

   1: status = log_->AddRecord(WriteBatchInternal::Contents(updates));
   2: if (status.ok() && options.sync) {
   3:  status = logfile_->Sync();
   4: }

From the API look of things, it appears that this is a matter of simply forwarding the call from one implementation to another. But a lot more is actually going on:

   1: Status Writer::AddRecord(const Slice& slice) {
   2:   const char* ptr = slice.data();
   3:   size_t left = slice.size();
   4:  
   5:   // Fragment the record if necessary and emit it.  Note that if slice
   6:   // is empty, we still want to iterate once to emit a single
   7:   // zero-length record
   8:   Status s;
   9:   bool begin = true;
  10:   do {
  11:     const int leftover = kBlockSize - block_offset_;
  12:     assert(leftover >= 0);
  13:     if (leftover < kHeaderSize) {
  14:       // Switch to a new block
  15:       if (leftover > 0) {
  16:         // Fill the trailer (literal below relies on kHeaderSize being 7)
  17:         assert(kHeaderSize == 7);
  18:         dest_->Append(Slice("\x00\x00\x00\x00\x00\x00", leftover));
  19:       }
  20:       block_offset_ = 0;
  21:     }
  22:  
  23:     // Invariant: we never leave < kHeaderSize bytes in a block.
  24:     assert(kBlockSize - block_offset_ - kHeaderSize >= 0);
  25:  
  26:     const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
  27:     const size_t fragment_length = (left < avail) ? left : avail;
  28:  
  29:     RecordType type;
  30:     const bool end = (left == fragment_length);
  31:     if (begin && end) {
  32:       type = kFullType;
  33:     } else if (begin) {
  34:       type = kFirstType;
  35:     } else if (end) {
  36:       type = kLastType;
  37:     } else {
  38:       type = kMiddleType;
  39:     }
  40:  
  41:     s = EmitPhysicalRecord(type, ptr, fragment_length);
  42:     ptr += fragment_length;
  43:     left -= fragment_length;
  44:     begin = false;
  45:   } while (s.ok() && left > 0);
  46:   return s;
  47: }

Let us see if we do a lot here. But I don’t know yet what is going on. From the first glance, it appears that we are looking at fragmenting the value into multiple records, and we might want to enter zero length records (no idea what that is for?maybe compactions?).

It appears that we write in blocks of 32Kb at a time. Line 12 – 21 are dealing with how to finalize the block when you have no more space. (Basically fill in with nulls).

Lines 26 – 40 just set the figure out what the type of the record that we are going to work (a full record, all of which can sit in a single buffer, a first record, which is the start in a sequence of items or middle / end, which is obvious).

And then we just emit the physical record to disk, and move on. I am not really sure what the reasoning is behind it. It may be to avoid having to read records that are far too big?

I looked at EmitPhysicalRecord to see what we have there and it is nothing much, it writes the header, including CRC computation, but that is pretty much it. So far, a lot of questions, but not a lot of answers. Maybe I’ll get them when I’ll start looking at the reading portion of the code. But that will be in another post.

Reviewing LevelDB: Part V, into the MemTables we go

You can read about the theory of Sorted Strings Tables and Memtables here. In this case, what I am interested in is going a bit deeper into the leveldb codebase, and understanding how the data is actually kept in memory and what is it doing there.

In order to do that, we are going to investigate MemTable. As it turned out, this is actually a very simple data structure. A MemTable just hold a SkipList, whish is a sorted data structure that allows O(log N) access and modifications. The interesting thing about Skip List in contrast to Binary Trees, is that it is much easier to create a performant solution of concurrent skip list (either with or without locks) over a concurrently binary tree.

The data in the table is just a list of key & value (or delete marker). And that means that searches through this can give you three results:

  • Here is the value for the key (exists)
  • The value for the key was remove (deleted)
  • The value is not in the memory table (missing)

It is the last part where we get involved with the more interesting aspect of LevelDB (and the reason it is called leveldb in the first place). The notion that you have multiple levels. The mem table is the first one, and then you spill the output out to disk (the Sorted Strings Table). Now that I figure out how simple MemTable is really is, I am going to take a look at the leveldb log, and then dive into Sorted Strings Table.

Reviewing LevelDB: Part IV: On std::string, buffers and memory management in C++

his is a bit of a side track. One of the things that is quite clear to me when I am reading the leveldb code is that I was never really any good at C++. I was a C/C++ developer. And that is a pretty derogatory term. C & C++ share a lot of the same syntax and underlying assumption, but the moment you want to start writing non trivial stuff, they are quite different. And no, I am not talking about OO or templates.

I am talking about things that came out of that. In particular, throughout the leveldb codebase, they are very rarely, if at all, allocate memory directly. Pretty much the whole codebase rely on std::string to handle buffer allocations and management. This make sense, since RAII is still the watch ward for good C++ code. Being able to utilize std::string for memory management also means that the memory will be properly released without having to deal with it explicitly.

More interestingly, the leveldb codebase is also using std::string as a general buffer. I wonder why it is std::string vs. std::vector<char>,  which would bet more reasonable, but I guess that this is because most of the time, users will want to pass strings as keys, and likely this is easier to manage, given the type of operations available on std::string (such as append).

It is actually quite fun to go over the codebase and discover those sort of things. Especially if I can figure them out on my own Smile.

This is quite interesting because from my point of view, buffers are a whole different set of problems. We don’t have to worry about the memory just going away in .NET (although we do have to worry about someone changing the buffer behind our backs), but we have to worry a lot about buffer size. This is because at some point (80Kb), buffers graduate to the large object heap, and stay there. Which means, in turn, that every time that you want to deal with buffers you have to take that into account, usually with a buffer pool.

Another aspect that is interesting with regards to memory usage is the explicit handling of copying. There are various places in the code where the copy constructor was made private, to avoid this. Or a comment is left about making a type copy-able intentionally. I get the reason why, because it is a common failing point in C++, but I forgot (although I am pretty sure that I used to know) the actual semantics of when/ how you want to do that in all cases.

RavenDB 2.0.3 Stable Release!

We have just released the next stable build 2330 of RavenDB 2.0. You can find it here. This release contains a lot of bug fixes, improvements, streamlining and some interesting new stuff.

The full change log is actually here, because we found a bug in 2325 (ironically, it was a bug in how it reported its build number).

Breaking Changes:

  • SQL Replication script / configuration change (more below).

Features:

  • More debug / visibility endpoints (user info, changes traffic, map/reduce data, etc).
  • Better highlighting support.
  • Spatial Search will sort by distance by default.
  • Better indexing for TimeSpan values.
  • Can do more Parallel Work in Map/Reduce indexes now.

Improvements:

  • Map/Reudce indexes tune themselves automatically.
  • Better Periodic Backup behavior when there is no new writes.
  • Better handling of transactions during documents put with high number of referencing documents.
  • Better use of alerts.
  • Better float support.

Studio:

  • Better import/export UI.

Bug fixes:

  • Can backup & restore even in the presence of corrupt / missing indexes.
  • LoadDocument with map/reduce indexes cause issues.
  • Allow to change the number of cached requests on the client side without NRE.
  • Fixing Unique Constraints bundle with null unique properties.
  • Forbidden error when running as a non admin user in the studio.
  • Better support for indexing nullable properties with HasValue.
  • Fixed a problem with replication of deleted documents when adding a new node in the topology.
  • Support export / import with versioning bundle.

 

SQL Replication Breaking Changes

With SQL Replication, it became apparent that we missed a pretty big use case.  Deletions.

Deletions is something that we didn’t handle, and couldn’t handle using the existing format. It was a touch call, but we decided to make a breaking change here.

Now, you need to define all the tables that you’ll be working with (as well as the order we will be writing to them). Assuming that we have a User document, and we want to replicate to Users and UsersGroups tables, we would have:

   1: replicateToUsers({
   2:    Name: this.Name
   3: })
   4:  
   5: for(var i = 0; i < this.Groups.length; i++) {
   6:   replicateToUsersGroups({
   7:       Group: this.Groups[i]
   8:   });
   9: }

This replaced the sqlReplicate calls. Note that this is a hard breaking reset. When you upgrade, you’ll need to update all of your SQL Replication definitions (but you keep the replication state, you won’t have to start replicating from scratch).

Tags:

Published at

Originally posted at

Comments (1)

Reviewing LevelDB: Part III, WriteBatch isn’t what you think it is

One of the key external components of leveldb is the idea of WriteBatch. It allows you to batch multiple operations into a single atomic write.

It looks like this, from an API point of view:

   1: leveldb::WriteBatch batch;
   2: batch.Delete(key1);
   3: batch.Put(key2, value);
   4: s = db->Write(leveldb::WriteOptions(), &batch);

As we have learned in the previous post, WriteBatch is how leveldb handles all writes. Internally, any call to Put or Delete is translated into a single WriteBatch, then there is some batching involved across multiple batches, but that is beside the point right now.

I dove into the code for WriteBatch, and immediately I realized that this isn’t really what I bargained for. In my mind, WriteBatch was supposed to be something like this:

   1: public class WriteBatch
   2: {
   3:    List<Operation> Operations;
   4: }

Which would hold the in memory operations until they get written down to disk, or something.

Instead, it appears that leveldb took quite a different route. The entire data is stored in the following format:

   1: // WriteBatch::rep_ :=
   2: //    sequence: fixed64
   3: //    count: fixed32
   4: //    data: record[count]
   5: // record :=
   6: //    kTypeValue varstring varstring         |
   7: //    kTypeDeletion varstring
   8: // varstring :=
   9: //    len: varint32
  10: //    data: uint8[len]

This is the in memory value, mind. So we are already storing this in a single buffer. I am not really sure why this is the case, to be honest.

WriteBatch is pretty much a write only data structure, with one major exception:

   1: // Support for iterating over the contents of a batch.
   2: class Handler {
   3:  public:
   4:   virtual ~Handler();
   5:   virtual void Put(const Slice& key, const Slice& value) = 0;
   6:   virtual void Delete(const Slice& key) = 0;
   7: };
   8: Status Iterate(Handler* handler) const;

You can iterate over the batch. The problem is that we now have this implementation for Iterate:

   1: Status WriteBatch::Iterate(Handler* handler) const {
   2:   Slice input(rep_);
   3:   if (input.size() < kHeader) {
   4:     return Status::Corruption("malformed WriteBatch (too small)");
   5:   }
   6:  
   7:   input.remove_prefix(kHeader);
   8:   Slice key, value;
   9:   int found = 0;
  10:   while (!input.empty()) {
  11:     found++;
  12:     char tag = input[0];
  13:     input.remove_prefix(1);
  14:     switch (tag) {
  15:       case kTypeValue:
  16:         if (GetLengthPrefixedSlice(&input, &key) &&
  17:             GetLengthPrefixedSlice(&input, &value)) {
  18:           handler->Put(key, value);
  19:         } else {
  20:           return Status::Corruption("bad WriteBatch Put");
  21:         }
  22:         break;
  23:       case kTypeDeletion:
  24:         if (GetLengthPrefixedSlice(&input, &key)) {
  25:           handler->Delete(key);
  26:         } else {
  27:           return Status::Corruption("bad WriteBatch Delete");
  28:         }
  29:         break;
  30:       default:
  31:         return Status::Corruption("unknown WriteBatch tag");
  32:     }
  33:   }
  34:   if (found != WriteBatchInternal::Count(this)) {
  35:     return Status::Corruption("WriteBatch has wrong count");
  36:   } else {
  37:     return Status::OK();
  38:   }
  39: }

So we write it directly to a buffer, then read from that buffer. The interesting bit is that the actual writing to leveldb itself is done in a similar way, see:

   1: class MemTableInserter : public WriteBatch::Handler {
   2:  public:
   3:   SequenceNumber sequence_;
   4:   MemTable* mem_;
   5:  
   6:   virtual void Put(const Slice& key, const Slice& value) {
   7:     mem_->Add(sequence_, kTypeValue, key, value);
   8:     sequence_++;
   9:   }
  10:   virtual void Delete(const Slice& key) {
  11:     mem_->Add(sequence_, kTypeDeletion, key, Slice());
  12:     sequence_++;
  13:   }
  14: };
  15:  
  16: Status WriteBatchInternal::InsertInto(const WriteBatch* b,
  17:                                       MemTable* memtable) {
  18:   MemTableInserter inserter;
  19:   inserter.sequence_ = WriteBatchInternal::Sequence(b);
  20:   inserter.mem_ = memtable;
  21:   return b->Iterate(&inserter);
  22: }

As I can figure it so far, we have the following steps:

  • WriteBatch.Put / WriteBatch.Delete gets called, and the values we were sent are copied into our buffer.
  • We actually save the WriteBatch, at which point we unpack the values out of the buffer and into the memtable.

It took me a while to figure it out, but I think that I finally got it. The reason this is the case is that leveldb is a C++ application. As such, memory management is something that it needs to worry about explicitly.

In particular, you can’t just rely on the memory you were passed to be held, the user may release that memory after they called to Put. This means, in turn, that you must copy the memory to memory that leveldb allocated, so leveldn can manage its own lifetime. This is a foreign concept to me because it is such a strange thing to do in the .NET land, where memory cannot just disappear underneath you.

On my next post, I’ll deal a bit more with this aspect, buffers management and memory handling in general.

Reviewing LevelDB: Part II, Put some data on the disk, dude

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

   1: leveldb::DB* db;
   2: leveldb::DB::Open(options, "play/testdb", &db);
   3: status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

   1: // Default implementations of convenience methods that subclasses of DB
   2: // can call if they wish
   3: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {
   4:   WriteBatch batch;
   5:   batch.Put(key, value);
   6:   return Write(opt, &batch);
   7: }
   8:  
   9: Status DB::Delete(const WriteOptions& opt, const Slice& key) {
  10:   WriteBatch batch;
  11:   batch.Delete(key);
  12:   return Write(opt, &batch);
  13: }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

   1: Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {
   2:   Writer w(&mutex_);
   3:   w.batch = my_batch;
   4:   w.sync = options.sync;
   5:   w.done = false;
   6:  
   7:   MutexLock l(&mutex_);
   8:   writers_.push_back(&w);
   9:   while (!w.done && &w != writers_.front()) {
  10:     w.cv.Wait();
  11:   }
  12:   if (w.done) {
  13:     return w.status;
  14:   }
  15:  
  16:   // May temporarily unlock and wait.
  17:   Status status = MakeRoomForWrite(my_batch == NULL);
  18:   uint64_t last_sequence = versions_->LastSequence();
  19:   Writer* last_writer = &w;
  20:   if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions
  21:     WriteBatch* updates = BuildBatchGroup(&last_writer);
  22:     WriteBatchInternal::SetSequence(updates, last_sequence + 1);
  23:     last_sequence += WriteBatchInternal::Count(updates);
  24:  
  25:     // Add to log and apply to memtable.  We can release the lock
  26:     // during this phase since &w is currently responsible for logging
  27:     // and protects against concurrent loggers and concurrent writes
  28:     // into mem_.
  29:     {
  30:       mutex_.Unlock();
  31:       status = log_->AddRecord(WriteBatchInternal::Contents(updates));
  32:       if (status.ok() && options.sync) {
  33:         status = logfile_->Sync();
  34:       }
  35:       if (status.ok()) {
  36:         status = WriteBatchInternal::InsertInto(updates, mem_);
  37:       }
  38:       mutex_.Lock();
  39:     }
  40:     if (updates == tmp_batch_) tmp_batch_->Clear();
  41:  
  42:     versions_->SetLastSequence(last_sequence);
  43:   }
  44:  
  45:   while (true) {
  46:     Writer* ready = writers_.front();
  47:     writers_.pop_front();
  48:     if (ready != &w) {
  49:       ready->status = status;
  50:       ready->done = true;
  51:       ready->cv.Signal();
  52:     }
  53:     if (ready == last_writer) break;
  54:   }
  55:  
  56:   // Notify new head of write queue
  57:   if (!writers_.empty()) {
  58:     writers_.front()->cv.Signal();
  59:   }
  60:  
  61:   return status;
  62: }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

   1: struct DBImpl::Writer {
   2:   Status status;
   3:   WriteBatch* batch;
   4:   bool sync;
   5:   bool done;
   6:   port::CondVar cv;
   7:  
   8:   explicit Writer(port::Mutex* mu) : cv(mu) { }
   9: };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…

RavenDB vNext: It is so pink!

Just thought that you might appreciate a peek into what we have been working on:

image

You can consider the bright pink background a bug, by the way. But the installer is real, and it will guide you through an install of RavenDB using the “Yes, Dear” model.

This is mostly for clients that don’t like xcopy installs (honestly, this is to make sure that setting up in IIS is no longer a set of manual steps).

Tags:

Published at

Originally posted at

Comments (2)

Reviewing LevelDB, Part I: What is this all about?

LevelDB is…

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

That is the project’s own definition. Basically, it is a way for users to store data in an efficient manner. It isn’t a SQL database. It isn’t even a real database in any sense of the word. What it is is a building block for building databases. It handles writing and reading to disk, and it supports atomicity. But anything else is on you (from transaction management to more complex items).

As such, it appears perfect for the kind of things that we need to do. I decided that I wanted to get to know the codebase, especially since at this time, I can’t even get it to compile Sad smile. The fact that this is a C++ codebase, written by people who eat & breath C++ for a living is another reason why. I expect that this would be a good codebase, so I might as well sharpen my C++-foo at the same time that I grok what this is doing.

The first thing to do is to look at the interface that the database provides us with:

image

That is a very small surface area, and as you can imagine, this is something that I highly approve of. It make it much easier to understand and reason about. And there is some pretty complex behavior behind this, which I’ll be exploring soon.

RavenDB vNext? Just the same, a little bit better, all over the place

We pretty much completed all of the major work that was going to get into the next major RavenDB release. There is the indexing stuff that I already talked about in this blog, but most of the rest are a lot of seemingly minor things. Nothing that would blow you head in isolation.

Things like better aggressive caching invalidation, faster error in network timeout scenarios, better bucket selection algorithm for map/reduce, providing you with auto resolving conflicts, or even more improvements to the studio.

Hell, it is even things like this:

image

We output a lot of information that can help you figure out what is going on in your system. But now we are actually starting to take it up a notch and also provide good UX for handling that.

Tags:

Published at

Originally posted at

Comments (6)

Aggressive caching: Pacified

RavenDB’s aggressive caching allows RavenDB Clients to skip going to the server and  serve requests directly off the client cache. That means that you can answer queries very  quickly, because you never even have to leave your process memory space.

The downside to that was, of course, that you might be showing information that have changed behind your back.

That is why we usually wrote aggressive caching code like this:

using (session.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(5)))
{
   var user =  session.Load<User>("users/1");
   Console.WriteLine(user.Name);
}

We gave it some duration in which it was okay to skip going to the server. So we had a maximum of 5 minutes when we had the cached information in place.

That was nice, but it was awkward. And not really nice thing to do in general. In particular, it meant that you had to wait 5 minutes for things to actually show up in the application. That can be... frustrating, because it looks like the system isn’t really doing something. On the other hand, it also means that you still have to query the server when the duration is over, and by necessity, the durations are relatively short.

We decided to change that. Now, when you are using aggressive caching, RavenDB will automatically subscribe to changes from the server. When that happens, we will be able to use the notifications to know when we need to re-check on the server.

That means that you can set an aggressive cache duration of much longer period, and that we will know, automatically and without you needing to do anything.

It is a small touch, but an important one. Things just get better Smile.

Tags:

Published at

Originally posted at

Comments (18)

RavenDB and storing large number of entities types in a single database

By default, we expect to have a rather smaller number of entities types in a database. Unlike relational databases, where you typically see hundreds or thousands of tables, because everything gets dumped into a single database (and because a single document in RavenDB typically reside in many relational tables).

That said, we got a few complaints about this story, because the studio UI becomes hard to use when this happens. We decided to support it in an interesting fashion. This is what your documents will look like when you have ~60 entities types in a single database:

collcections

And here is where you go if you have hundreds of them. Psychedelic days are here again!

image

Tags:

Published at

Originally posted at

Comments (29)

Reviewing RavenBurgerCo: What could be improved?

There are two things that I would change in the RavenBurgerCo sample app.

The first would be session management, I dislike code like this:

image

I would much rather do that in a base controller and avoid manual session management. But that is most a design choice, and it ain’t really that important.

But what is important is the number of indexes that the application uses. We have:

  • LocationIndex
  • DeliveryIndex
  • DriveThruIndex

And I am not really sure that we need all three. In fact, I am pretty sure that we don’t. What we can do is merge them all into a single index. I am pretty sure that the reason that there were three of them was because there there was a bug in RavenDB that made it error if you gave it a null WKT (vs. just recognize this an a valid opt out). I fixed that bug, but even with that issue in place, we can get things working:

   1: public class SpatialIndex : AbstractIndexCreationTask<Restaurant>
   2: {
   3:     public SpatialIndex()
   4:     {
   5:         Map = restaurants =>
   6:               from restaurant in restaurants
   7:               select new
   8:                   {
   9:                       _ = SpatialGenerate(restaurant.Latitude, restaurant.Longitude),
  10:                       __ = restaurant.DriveThruArea == null ? 
  11:                                     new object[0] : 
  12:                                     SpatialGenerate("drivethru", restaurant.DriveThruArea),
  13:                       ___ = restaurant.DeliveryArea == null ? 
  14:                                     new object[0] : 
  15:                                     SpatialGenerate("delivery", restaurant.DeliveryArea)
  16:                   };
  17:     }
  18: }

And from then, it is just a matter of updating the queries, which now looks like the following:

Getting the restaurants near my location (for Eat In page):

   1: return session.Query<Restaurant, SpatialIndex>()
   2:     .Customize(x =>
   3:                    {
   4:                        x.WithinRadiusOf(25, latitude, longitude);
   5:                        x.SortByDistance();
   6:                    })
   7:     .Take(250)
   8:     .Select( ... );

Getting the restaurants that deliver to my location (Delivery page):

   1: return session.Query<Restaurant, SpatialIndex>()
   2:     .Customize(x => x.RelatesToShape("delivery", point, SpatialRelation.Intersects))
   3:     // SpatialRelation.Contains is not supported
   4:     // SpatialRelation.Intersects is OK because we are using a point as the query parameter
   5:     .Take(250)
   6:     .Select( ... ) ;

Getting the restaurants inside a particular rectangle (Map page):

   1: return session.Query<Restaurant, SpatialIndex>()
   2:     .Customize(x => x.RelatesToShape(Constants.DefaultSpatialFieldName, rectangle, SpatialRelation.Within))
   3:     .Take(512)
   4:     .Select( ... );

Note that we use DefaultSpatialFieldName, instead of indexing the location twice.

And finally, getting the restaurants that are applicable for drive through for my route (Drive Thru page):

   1: return session.Query<Restaurant, SpatialIndex>()
   2:     .Customize(x => x.RelatesToShape("drivethru", lineString, SpatialRelation.Intersects))
   3:     .Take(512)
   4:     .Select( ... );

And that is that.

Really great project, and quite amazing, both client & server code. It is simple, it is elegant and it is effective. Well done Simon!

Tags:

Published at

Originally posted at

Comments (2)

Reviewing RavenBurgerCo

This is a review of RavenBurgerCo, created as a sample app for RavenDB spatial support by Simon Bartlett. This is by no means an unbiased review, if only because I had laughed out load and crazily when I saw the first page:

image

What is this about?

Raven Burger Co is a chain of fast food restaurants, based in the United Kingdom. Their speciality is burgers made with raven meat. All their restaurants offer eat-in/take-out service, while some offer home delivery, and others offer a drive thru service.

This sample application is their online restaurant locator.

Good things about this project? Here is how you get started:

  1. Clone this repository
  2. Open the solution in Visual Studio 2012
  3. Press F5
  4. Play!

And it actually works! It uses embeddable RavenDB to make it super easy and stupid to run it, right out of the box.

We will start this review by looking at the infrastructure for this project, starting, as usual, from Global.asax:

image

Let us see how RavenDB is setup:

   1: public static void ConfigureRaven(MvcApplication application)
   2: {
   3:     var store = new EmbeddableDocumentStore
   4:                         {
   5:                             DataDirectory = "~/App_Data/Database",
   6:                             UseEmbeddedHttpServer = true
   7:                         };
   8:  
   9:     store.Initialize();
  10:     MvcApplication.DocumentStore = store;
  11:  
  12:     IndexCreation.CreateIndexes(typeof(MvcApplication).Assembly, store);
  13:  
  14:     var statistics = store.DatabaseCommands.GetStatistics();
  15:  
  16:     if (statistics.CountOfDocuments < 5)
  17:         using (var bulkInsert = store.BulkInsert())
  18:             LoadRestaurants(application.Server.MapPath("~/App_Data/Restaurants.csv"), bulkInsert);
  19: }

So use embedded RavenDB, and if there isn’t enough data in the db, load the default data set using RavenDB’s new Bulk Insert feature.

Note that we set MvcApplication.DocumentStore property, let us see how this is used.

Simon did a really nice thing here. Note that UseEmbeddedHttpServer is set to true, which means that RavenDB will find an open port and use it, this is then exposed in the UI:

image

So you can click on the link and land right in the studio for your embedded database, which gives you the ability to view, debug & modify how things are actually going. This is a really nice way to expose it.

Now, let us move to the actual project code itself. Exploring the options in this project, we have map browsing:

image

And here I have to admit ignorance. I have no idea on how to use maps, so this is quite nice for me, something new to learn. The core of this page is this script:

   1: $(function () {
   2:  
   3:     var gmapLayer = new L.Google('ROADMAP');
   4:     var resultsLayer = L.layerGroup();
   5:  
   6:     var map = L.map('map', {
   7:         layers: [gmapLayer, resultsLayer],
   8:         center: [51.4775, -0.461389],
   9:         zoom: 12,
  10:         maxBounds: L.latLngBounds([49, 15], [60, -25])
  11:     });
  12:  
  13:     var loadMarkers = function() {
  14:         if (map.getZoom() > 9) {
  15:             var bounds = map.getBounds();
  16:             $.get('/api/restaurants', {
  17:                 north: bounds.getNorthWest().lat,
  18:                 east: bounds.getSouthEast().lng,
  19:                 south: bounds.getSouthEast().lat,
  20:                 west: bounds.getNorthWest().lng,
  21:             }).done(function(restaurants) {
  22:                 resultsLayer.clearLayers();
  23:                 $.each(restaurants, function(index, value) {
  24:                     var marker = L.marker([value.Latitude, value.Longitude])
  25:                         .bindPopup(
  26:                             '<p><strong>' + value.Name + '</strong><br />' +
  27:                                 value.Street + '<br />' +
  28:                                 value.City + '<br />' +
  29:                                 value.PostCode + '<br />' +
  30:                                 value.Phone + '</p>'
  31:                         );
  32:                     resultsLayer.addLayer(marker);
  33:                 });
  34:             });
  35:         } else {
  36:             resultsLayer.clearLayers();
  37:         }
  38:     };
  39:  
  40:     loadMarkers();
  41:     map.on('moveend', loadMarkers);
  42: });

You can see that loadMarkers method, which is getting called whenever the map is moved, and on startup. This end up calling this method with the boundaries of the visible UI on the server:

   1: public IEnumerable<object> Get(double north, double east, double west, double south)
   2: {
   3:     var rectangle = string.Format(CultureInfo.InvariantCulture, "{0:F6} {1:F6} {2:F6} {3:F6}", west, south, east, north);
   4:  
   5:     using (var session = MvcApplication.DocumentStore.OpenSession())
   6:     {
   7:         return session.Query<Restaurant, LocationIndex>()
   8:             .Customize(x => x.RelatesToShape("location", rectangle, SpatialRelation.Within))
   9:             .Take(512)
  10:             .Select(x => new
  11:                             {
  12:                                 x.Name,
  13:                                 x.Street,
  14:                                 x.City,
  15:                                 x.PostCode,
  16:                                 x.Phone,
  17:                                 x.Delivery,
  18:                                 x.DriveThru,
  19:                                 x.Latitude,
  20:                                 x.Longitude
  21:                             })
  22:             .ToList();
  23:     }
  24: }

Note that in this case, we are doing a search for items inside the rectangle. But the search options are a bit funky. You have to send the data in WKT format.  Luckily, Simon already create a better solution (in this case, he is using the long hand method to make sure that we all understand what he is doing). The better method would be to use his Geo library, in which case the code would look like:

   1: .Geo("location", x => x.RelatesToShape(new Rectangle(west, south, east, north), SpatialRelation.Within))

So that was the map, now let us look at another example, the Eat In example. In that case, we are looking for restaurants near our location to be figure out where to eat. This looks like this:

image

Right in the bull’s eye!

Here is the server side code:

   1: public IEnumerable<object> Get(double latitude, double longitude)
   2: {
   3:     using (var session = MvcApplication.DocumentStore.OpenSession())
   4:     {
   5:         return session.Query<Restaurant, LocationIndex>()
   6:             .Customize(x =>
   7:                            {
   8:                                x.WithinRadiusOf(25, latitude, longitude);
   9:                                x.SortByDistance();
  10:                            })
  11:             .Take(250)
  12:             .Select(x => new
  13:                             {
  14:                                 x.Name,
  15:                                 x.Street,
  16:                                 x.City,
  17:                                 x.PostCode,
  18:                                 x.Phone,
  19:                                 x.Delivery,
  20:                                 x.DriveThru,
  21:                                 x.Latitude,
  22:                                 x.Longitude
  23:                             })
  24:             .ToList();
  25:     }
  26: }

And on the client side, we just do the following:

   1: $('#location').change(function () {
   2:     var latlng = $('#location').locationSelector('val');
   3:  
   4:     var outerCircle = L.circle(latlng, 25000, { color: '#ff0000', fillOpacity: 0 });
   5:     map.fitBounds(outerCircle.getBounds());
   6:  
   7:     resultsLayer.clearLayers();
   8:     resultsLayer.addLayer(outerCircle);
   9:     resultsLayer.addLayer(L.circle(latlng, 15000, { color: '#ff0000', fillOpacity: 0.1 }));
  10:     resultsLayer.addLayer(L.circle(latlng, 10000, { color: '#ff0000', fillOpacity: 0.3 }));
  11:     resultsLayer.addLayer(L.circle(latlng, 5000, { color: '#ff0000', fillOpacity: 0.5 }));
  12:     resultsLayer.addLayer(L.circleMarker(latlng, { color: '#ff0000', fillOpacity: 1, opacity: 1 }));
  13:  
  14:  
  15:     $.get('/api/restaurants', {
  16:         latitude: latlng[0],
  17:         longitude: latlng[1]
  18:     }).done(function (restaurants) {
  19:         $.each(restaurants, function (index, value) {
  20:             var marker = L.marker([value.Latitude, value.Longitude])
  21:                 .bindPopup(
  22:                     '<p><strong>' + value.Name + '</strong><br />' +
  23:                     value.Street + '<br />' +
  24:                     value.City + '<br />' +
  25:                     value.PostCode + '<br />' +
  26:                     value.Phone + '</p>'
  27:                 );
  28:             resultsLayer.addLayer(marker);
  29:         });
  30:     });
  31: });

We define several circles of different opacities, and then show up the returned markers.

It is all pretty simple code, but the result it quite stunning. I am getting really excited by this thing. It is simple, beautiful and quite powerful. Wow!

The delivery tab does pretty much the same thing as the eat-in mode, but it does so in a different way. First, you might have noticed the LocationIndex in the previous two examples, this looks like this:

   1: public class LocationIndex : AbstractIndexCreationTask<Restaurant>
   2: {
   3:     public LocationIndex()
   4:     {
   5:         Map = restaurants => from restaurant in restaurants
   6:                              select new
   7:                                         {
   8:                                             restaurant.Name,
   9:                                             _ = SpatialGenerate(restaurant.Latitude, restaurant.Longitude),
  10:                                             __ = SpatialGenerate("location", restaurant.LocationWkt)
  11:                                         };
  12:     }
  13: }

Before we look at this, we need to look at a sample document:

image

I am note quite sure why we have in LocationIndex both SpatialGenerate() and SpatialGenerate(“location”). I think that this is just a part of the demo. Because the data is the same, and both lines should produce the same results.

However, for deliveries, the situation is quite different. We don’t just deliver to a certain distance, as you can see, we have a polygon that determines where do we actually delivers to. On the map, this looks like this:

image

The red circle is where I am located, the blue markers are the restaurants that delivers to my location and the blue polygon is the delivery area for the selected burger joint. Let us see how this works, okay? We will start from the index:

   1: public class DeliveryIndex : AbstractIndexCreationTask<Restaurant>
   2: {
   3:     public DeliveryIndex()
   4:     {
   5:         Map = restaurants => from restaurant in restaurants
   6:                              where restaurant.DeliveryArea != null
   7:                              select new
   8:                                         {
   9:                                             restaurant.Name,
  10:                                             _ = SpatialGenerate("delivery", restaurant.DeliveryArea, SpatialSearchStrategy.GeohashPrefixTree, 7)
  11:                                         };
  12:     }
  13: }

So we are indexing just restaurants that have a drive through polygon, and then we query it like this:

   1: public IEnumerable<object> Get(double latitude, double longitude, bool delivery)
   2: {
   3:     if (!delivery)
   4:         return Get(latitude, longitude);
   5:  
   6:     var point = string.Format(CultureInfo.InvariantCulture, "POINT ({0} {1})", longitude, latitude);
   7:  
   8:     using (var session = MvcApplication.DocumentStore.OpenSession())
   9:     {
  10:         return session.Query<Restaurant, DeliveryIndex>()
  11:             .Customize(x => x.RelatesToShape("delivery", point, SpatialRelation.Intersects))
  12:             // SpatialRelation.Contains is not supported
  13:             // SpatialRelation.Intersects is OK because we are using a point as the query parameter
  14:             .Take(250)
  15:             .Select(x => new
  16:                             {
  17:                                 x.Name,
  18:                                 x.Street,
  19:                                 x.City,
  20:                                 x.PostCode,
  21:                                 x.Phone,
  22:                                 x.Delivery,
  23:                                 x.DriveThru,
  24:                                 x.Latitude,
  25:                                 x.Longitude,
  26:                                 x.DeliveryArea
  27:                             })
  28:             .ToList();
  29:     }
  30: }

This basically says, give me all the Restaurants who delivers to a locations that includes me. And then the rest all happens on the client side.

Quite cool.

The final example is the drive thru mode, which looks like this:

image

Given that I am driving from the green dot to the red dot, what restaurants can I stop at?

Here is the index:

   1: public class DriveThruIndex : AbstractIndexCreationTask<Restaurant>
   2: {
   3:     public DriveThruIndex()
   4:     {
   5:         Map = restaurants => from restaurant in restaurants
   6:                              where restaurant.DriveThruArea != null
   7:                              select new
   8:                                         {
   9:                                             restaurant.Name,
  10:                                             _ = SpatialGenerate("drivethru", restaurant.DriveThruArea)
  11:                                         };
  12:     }
  13: }

And now the code for this:

   1: public IEnumerable<object> Get(string polyline)
   2: {
   3:     var lineString = PolylineHelper.ConvertGooglePolylineToWkt(polyline);
   4:  
   5:     using (var session = MvcApplication.DocumentStore.OpenSession())
   6:     {
   7:         return session.Query<Restaurant, DriveThruIndex>()
   8:             .Customize(x => x.RelatesToShape("drivethru", lineString, SpatialRelation.Intersects))
   9:             .Take(512)
  10:             .Select(x => new
  11:                             {
  12:                                 x.Name,
  13:                                 x.Street,
  14:                                 x.City,
  15:                                 x.PostCode,
  16:                                 x.Phone,
  17:                                 x.Delivery,
  18:                                 x.DriveThru,
  19:                                 x.Latitude,
  20:                                 x.Longitude
  21:                             })
  22:             .ToList();
  23:     }
  24: }

We get the driving direction from the map, convert it to a line string, and then just check if our path intersects with the drive thru area for the restaurants.

Pretty cool application, and some really nice UI.

Okay, enough with the accolades, next time, I’ll talk about the things that can be better.

Tags:

Published at

Originally posted at

Comments (2)

RavenDB’s Querying Streaming: Unbounded results

By default, RavenDB make it pretty hard to shoot yourself in the foot with unbounded result sets. Pretty much every single feature has rate limits on it, and that is a good thing.

However, there are times where you actually do want to get all where all actually means everything damn you, really all of them. That has been somewhat tough to do, because it requires you to do paging, and if you are trying to do that on a running system, it is possible that incoming data will impact the way you are exporting, causing you to get duplicates or miss items.

We got several suggestions about how to handle that, but most of those were pretty complex. Instead, we decided to go with the following approach:

  • We will utilize our existing infrastructure to handle exports.
  • We don’t want to do that in multiple requests, because that means that state has to be kept on both client & server.
  • The model has to be a streaming based model, because otherwise we might get memory errors if you are trying to load millions of records out.
  • The stream you get out is frozen, that means that what you read (both indexes and data) is a snapshot of the data as it was when you started reading it.

And now, let me show you the API for that:

   1: using (var session = store.OpenSession())
   2: {
   3:     var query = session.Query<User>("Users/ByActive")
   4:                        .Where(x => x.Active);
   5:     var enumerator = session.Advanced.Stream(query);
   6:     int count = 0;
   7:     while (enumerator.MoveNext())
   8:     {
   9:         Assert.IsType<User>(enumerator.Current.Document);
  10:         count++;
  11:     }
  12:  
  13:     Assert.Equal(1500, count);
  14: }

As you can see, we use standard Linq to limit our search, and the new method we have is Stream(), which allows us to get an IEnumerator, which will scan through the data.

You can see that we are able to get more than the default 1,024 limit of items from RavenDB. There are overloads there for getting additional information about the query as well (total results, timestamps, etags, etc).

Note that the values returning from the Stream() are not tracked by the session. And that if we had a users #1231 that was deleted 2 ms after the export began, you would still get it, since the data is frozen at the time of the export start.

If you want, mind, you can specify paging, by the way, and all the other indexing options are also available for you as well (transform results, for example).

Tags:

Published at

Originally posted at

Comments (20)

RavenDB Feature Request Analysis: Filtered Replication ain’t what you looking for

Every so often we get a request for filtered replication. “I want to replicate to this node, but only those documents.” We explain that replication is a whole database kind of thing, you can’t just pick & choose what you want. That isn’t actually true, we have facilities to do filtering, and it would be fairly easy to expose them.

We don’t intend to do so. And the reason why is that the customer asking the question is usually starting asking us question from midway. He read about replication, thought that it would be a good fit for a particular scenario, if only it had that feature. Except that this is completely the wrong feature to use for the scenario at hand. And usually it takes a little back & forth to figure out what the scenario actually is.

For the most part, the scenarios for this feature are all about synchronizing data between two nodes*. In particular, that is often a use case for: “I have a mobile client and I want to replicate some of the data to that laptop”, or some such.

And this is where things gets complex. To start with, you say, let us just filtered the data where CustomerId = “customers/5”. Except that you need to apply this logic for each entity type in the database, and they usually have different rules about them. For example, you may have common reference data that you would want to replication, even though they don’t belong to customers/5. And invoices may have CustomerId property, but customers does not, so you need to define that for customers, it is the Id that you want to filter by, etc.

To make things even more interesting, you need to consider the case where the sync filter have changed, (this user now have access to “customers/5” and “customers/6”). At which point, you pretty much have to go and go through the entire data set again.

Then we move to the question of updates, how are those handled? What about conflicts? How do you handle disconnected clients that may move between addresses and ips all the time? Who maintains this operation? The client? The server? How about disconnected updates?

In short, it is a very different discussion that you need to have, and just exposing the replication filters won’t be that.

* Nitpicker corner: yes, I know about MS Sync.

Tags:

Published at

Originally posted at

Comments (2)

Rob’s Sprint: The cost of getting data from LevelDB

We are currently investigating the usage of LevelDB as a storage engine in RavenDB. Some of the things that we feel very strongly about is transactions (LevelDB doesn’t have it) and performance (for a different definition of the one usually bandied about).

LevelDB does have atomicity, and the rest of CID can be built atop of that without too much complexity (already done, in fact). But we run into an issue when looking at the performance of reading. I am not sure if that is unique or not, but in our scenario, we typically deal with relatively large values. Documents of several MB are quite common. That means that we are pretty sensitive to memory allocations. It doesn’t help that we have very little control on the Large Object Heap, so it was with great interest that we looked at how LevelDB did things.

Reading the actual code make a lot of sense (more on that later, I will probably go through a big review of that). But there was one story that really didn’t make any sense to us, reading a value by key.

We started out using LevelDB Sharp:

Database.Get("users/1");

This in turn result in the following getting called:

image

A few things to note here. All from the point of view of someone who deals with very large values.

  • valuePtr is not released, even though it was allocated by us.
  • We copy the value from valuePtr into a string, resulting in two copies of the data and twice the memory usage.
  • There is no way to get just partial data.
  • There is no way to get binary data (for example, encrypted)
  • This is going to be putting a lot of pressure on the Large Object Heap.

But wait, it actually gets better. Let us look at the LevelDB method that get called:

image

So we are actually copying the data multiple times now. For fun, the db->rep->Get() call also copy the data. And that is pretty much where we stopped looking.

We are actually going to need to write a new C API and export that to be able to make use of that in our C# code. Fun, or not.

Rob’s Sprint: Result Transformers

By far the most confusing feature in RavenDB has been the index’s Transform Result. We introduced this feature to give the user the ability to do server side projections, including getting data from other documents.

Unfortunately, when we introduced this feature, we naturally added it to the index, and that cause a whole lot of confusion. In particular, people seemed to have a very hard time distinguishing between what get indexed and is searchable and the output of the index. To make matters worse, we also had major issues with how to determine the input of the TransformResults function. In short, the entire thing works, but from the point of view of an external user, that is really something that is very finicky and hard to get.

Instead, during Rob’s sprint, we have introduced a totally new concept. Stand-alone Result Transformers.

Here is what they look like:

public class OrdersStatsTransfromer : AbstractTransformerCreationTask<Order>
{
    public OrdersStatsTransfromer()
    {
        TransformResults = orders =>
                           from order in orders
                           select new
                           {
                               order.OrderedAt,
                               order.Status,
                               order.CustomerId,
                               CustomerName = LoadDocument<Customer>(order.CustomerId).Name,
                               LinesCount = order.Lines.Count
                           };
    }
}

And yes, they are quite intentionally modeled to be very similar to the way you would define them up to now, but outside of the index.

Now, why is that important? Because now you can apply a Transform Results on the server side without being tied to a customer.

For example, let us see how we can make use of this new feature:

var customerOrders = session.Query<Order>()
    .Where(x => x.CustomerId == "customers/123")
    .TransformWith<OrdersStatsTransfromer, OrderViewModel>()
    .ToList();

This separation between the result transformer and the index means that we can apply it to things like automatic indexes as well.

In fact, we can apply it during load:

var ovm = session.Load<OrdersStatsTransfromer, OrderViewModel>("orders/1");

There are a whole bunch of other goodies in there, as well. We made sure that now you don’t have to worry about the inputs to the transform. We will automatically use the right values when you access them, based on whatever you stored the field in the index or if it is accessible on the document.

All in all, this is a very major step forward, and it makes it drastically easier to use Result Transformers in various ways.

Rob’s Sprint: Query optimizer jumped a grade

RavenDB’s query optimizer is pretty smart, it knows how to find the appropriate index for your queries, and even create a new index to match your query if it didn’t exist. But that was the limits of its abilities. A human could still go into the database and say, look at those:

image

Those all operate on Posts, and you should be able to merge them all into a single index. Reducing the number of indexes is a good thing, as it reduces the amount of IO on the system, which is typically our limiting factor.

Now, there was no real reason why we couldn’t actually tell the query optimizer that it should be smart enough that when it creates a new index, it will use all of the properties that have been previously indexed.

However, doing so would actually make no difference to us. Because until now, we didn’t have a way to stop an index. With the new index idling feature, we can now have the query optimizer create a new merged index, and then the database will just mark the extra index as idle after a while.

Almost, there is still another issue that we have to resolve. What happens when we have a big database, and we introduce a new (and wider) index? By default, all matching queries would actually hit that index, and not the previously existing index. That is great, except… the new index is stale, and might remain stale for a few minutes. During that time, we have a perfectly servicable index that is just sitting there.

The query optimizer can now take into account the staleness level of an index as well when selecting it, meaning that there should be no interruption from the point of view of other queries. The new index will be introduced, go through all the documents, and then take over as the serving index for all queries. The existing index will wither away and die.

Rob’s Sprint: Faster index creation

RavenDB previously had a really nice feature for temporary indexes. Since we expected most of them to be temporary, we indexed them directly into memory, greatly saving in IO costs. With the removal of temporary indexes, that left us the option of just removing the entire code path and moving on to other things.

But we sat down and thought about this for a while. Typically, the busiest part in the index’s life is its creation, because the database needs to go through all the documents in the db and index them. We have changed things so during this creation period, we will actually index to memory, without hitting the disk. Only if we reached a configurable size or finished indexing everything will we spill everything to disk.

This, in turn, gives us the best of both worlds. We get a really nice optimization for new indexes, and we don’t have to hit the disk for indexes that would soon go away. And, of course, we get the perf boost for all indexes now.

Published at

Originally posted at

Rob’s Sprint: Indexes and the death of temporary indexes

RavenDB’s ability to analyze your queries and generate the required indexes on the fly has always been a great boon. Rob Ashton was involved in the original implementation and during his visits to Hibernating Rhinos’ secret lair, he got to whack that thing on the head a few more times.

We need to separate two important things:

  • Automatically generating the indexes based on your queries.
  • The temporary indexes model itself.

The first part is a really important feature. The second is just an implementation detail. In particular, temporary indexes had a few problems.

Most importantly, they were temporary, and there was an explicit step for promoting those indexes from one stage to the other. That caused some confusion, and there was a period of time, exactly when we decided that the index was important enough to keep, that caused the index to effectively reset itself. The other problem was that the moment that an index was upgraded to an auto index, it was there forever.

What Rob has done was to remove the concept of temporary indexes all together, which got rid of a whole bunch of code. Instead, we have just standard auto indexes. And now we had a drastically simplified story. We didn’t have the drastic jump from temp to auto, with irrecoverable implications.

Of course, this leads to a lot of interesting questions. Temporary indexes had the benefit of being indexed directly to memory, and they would go away after a database restart, as well as a whole lot of stuff. Not having special code for that made things a lot simpler for us, actually.

Automatic indexes have their age, and that is tracked internally by RavenDB. If an automatic indexed isn’t being used, it will become idle an eventually abandoned. If it is a very young index, we will decide it was a temporary index after all, and remove it from the system completely.

This feature, along with idling indexes, opened up the door for the next important feature, index merging. But before that, we need to upgrade the smarts for the query optimizer… which happens to be our next topic.

LevelDB & Windows: It ain’t a love story

I have been investigating the LevelDB project for the purpose of adding another storage engine to RavenDB. The good news is that there is a very strong likelihood that we can actually use that as a basis for what we want.

The bad news is that it is insanely easy to get LevelDB to compile and work on Linux, and appears to be an insurmountable barrier to do the same on Windows.

Yes, I know that I can get it working by just using a precompiled binary, but that won’t work. I actually want to make some changes there (mostly in the C API, right now).

This instructions appears to be no longer current. And this thread was promising, but didn’t lead anywhere.

I am going to go over the codebase with a fine tooth comb, but I am no longer a C++ programmer, and the intricacies of the build system is putting a very high roadblock of frustration.

Tags:

Published at

Originally posted at

Comments (48)