Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,495
Comments: 51,046
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 330 words

Well, this is still just a high level list only, but there is a lot of stuff there. In many cases, I posted full blog entries about each new feature, but I’ll post a few words about those that I didn’t.





Operations - As usual, we have a few goodies for the ops people. Some of them aren’t really interesting to devs, but they are part of creating production quality software. We now allow Backup Operators (and not just admins) to initiate backups, and if you are restoring a db from another machine with different settings, RavenDB will automatically set things up so you don’t have to manually do anything to get things working. We also added a bunch more of endpoints for debug & analysis and added some more information to our existing endpoints.

time to read 2 min | 249 words

imageAs the author of a schema less database, I find myself in the strange position of the barefoot shoemaker. I need to explain a bit. Our current storage engines, Esent and Munin (which was designed mostly to be like Esent) have rigid schemas. There are tables, and indexes, etc. This means that features that touch the storage layer tend to be much more complex. They require migrations, adding new features that require storage means that we have to create new storage tables, or modify indexes, or any number of a bunch of stuff that we developed RavenDB so our users wouldn’t have to.

I have been working with our managed implementation of LevelDB quite a lot lately. In order to do more than merely write tests for this, I tried to create a feature complete feature, an aggregation engine. The code is not production worthy (yet!), but what struck me quite hard was the fact that except for the fact that the storage layer is full of bugs (well, that is why I was writing stuff on top of it, to expose it), I had a blast actually working with it.

I could make modifications & changes with hardly any friction, and it was a real pleasure to start working with things in this fashion.

time to read 2 min | 231 words

Something that is no one really seems to be asking is why we started doing all this work with LevelDB all of a sudden. We already have an industry grade solution for storage with Esent. Sure, it doesn’t work with Linux, and that is somewhat important for us. But you would be wrong if you thought that this was why we pursue this.

To be perfectly honest, we want a storage solution that is on our control. Something that we can work with, tweak and understand all on our own. Just about every other tidbit of RavenDB is something that we have full transparency into. Sure, I’ll be the first to admit that I haven’t been through every single line of code in the Lucene codebase, but when we run into problems there, I was able to go into the code and fix stuff.

Esent has been incredibly useful for us, and it has been very stable. But to be able to tweak every last erg of performance and speed, we really need to have a lot more control over the storage layer than we currently have. And no, that doesn’t mean that we are going to abandon Esent. It merely means that we want to have options. And even in our early spikes ,the sort of things our managed implementation of LevelDB provides make implementing things a real breeze.

time to read 2 min | 317 words

One of the advantages that keeps showing up with leveldb is the notion that it compresses the data on disk by default. Since reading data from disk is way more expensive than the CPU cost of compression & decompression, that is a net benefit.

Or is it? In the managed implementation we are currently working on, we chose to avoid this for now. For a very simple reason. By storing the compressed data on disk, it means that you cannot just give a user the memory mapped buffer and be done with it, you actually have to decompress the data yourself, then hand the user the buffer to the decompressed memory. In other words, instead of having a single read only buffer that the OS will manage / page / optimize for you, you are going to have to allocate memory over & over again, and you’ll pay the cost of decompressing again and again.

I think that it would be better to have the client make that decision. They can send us data that is already compressed, so we won’t need to do anything else, and we would still be able to just hand them a buffer of data. Sure, it sounds like we are just moving the cost around, isn’t it? But what actually happens is that you have a better chance to do optimizations. For example, if I am storing the data compressing via gzip. And I’m exposing the data over the wire, I can just stream the results from the storage directly to the HTTP stream, without having to do anything about it. It can be decompressed on the client.

On the other hand, if I have storage level decompression, I am paying for the cost of reading the compressed data from disk, then allocating new buffer, decompressing the data, then going right ahead and compressing it again for sending over the wire.

time to read 4 min | 683 words

After taking a look at HyperLevelDB, it is time to see what Basho has changed in leveldb. They were kind enough to write a blog post detailing those changes, unfortunately, unlike HyperLevelDB, they have been pretty general and focused on their own product (which makes total sense). They have called out the reduction of “stalls”, which may or may not be related to issues with the write delay that leveldb intentionally introduce under load.

Okay, no choice about it, I am going to go over the commit log and see if I can find interesting stuff. The first tidbit that caught my eye is improving the compaction process when you have on disk corruption. Instead of stopping, it would move the bad data to the “lost” directory and move on. Note that there is some data loss associated with this, of course, but that won’t necessarily be felt by the users.

As a note, I dislike this code formatting:


Like HyperLevelDB, Basho made a lot of changes to compaction, it appears that this is the case for performance reasons:

  • No compactions triggered by reads, that is too slow.
  • There are multiple threads now handling compactions, with various levels of priorities between them. For example, flushing the immutable mem table is high priority, as is level 0 compaction, but standard compactions can wait.
  • Interestingly, when flushing data from memory to level 0, no compression is used.
  • After those were done, they also added additional logic to enforce locks that would give flushing from memory to disk and from level 0 downward much higher priority than everything else.

As an aide, another interesting thing I noticed, Basho also moved closing files and unmmaping memory to a background thread. I am not quite sure why that is the case, I wouldn’t expect that to be very expensive.

Next on the list, improving caching. Mostly by taking into account actual file sizes and by introducing a reader/writer lock.

Like HyperLevelDB, they also went for larger files, although I think that in this case, they went for significantly larger files than even HyperLevelDB did. Throttling, unlike with HyperLevelDB, where they did away with write throttling altogether in favor of concurrent writes, Basho’s leveldb went into a much more complex system of write throttling base on the current load, pending work, etc. The idea is to gain better load distribution overall. (Or maybe they didn’t think about the concurrent write strategy).

I wonder (but didn’t check) if some of the changes were pulled back into the leveldb project. Because there is some code here that I am pretty sure duplicate work already done in leveldb. In this case, the retiring of data that has already been superseded.

There is a lot of stuff that appears to relate to maintenance. Scanning SST files for errors, perf counters, etc. It also look like the decided to go to assembly for actually implementing CRC32. In fact, I am pretty sure that the asm is for calling hardware CRC inside the CPU. But I am unable to decipher that.

What I find funny is that another change I just run into is the introduction of a way to avoid copying data when Get()ing data from leveldb. If you’ll recall, I pointed that out as an issue a while ago in my first review of leveldb.

And here is another pretty drastic change. In leveldb, only level 0 can have overalapping files, but Basho’s changed things so the first 3 levels would have overlapping files. The idea is that you can do cheaper compactions this way, I am guessing.

I am aware that this is a bit of a mess, with regards to the review, but I just went over the code and wrote down the notes as I saw them. Overall, I think that I like HyperLevelDB changes better, but they have the advantage of using a much later codebase.

time to read 15 min | 2884 words

It has been a while since I actually posted some code here, and I thought that this implementation was quite nice, in that it is simple & works for what it needs to do.


   1: public class LruCache<TKey, TValue>
   2: {
   3:     private readonly int _capacity;
   4:     private readonly Stopwatch _stopwatch = Stopwatch.StartNew();
   6:     private class Node
   7:     {
   8:         public TValue Value;
   9:         public volatile Reference<long> Ticks;
  10:     }
  12:     private readonly ConcurrentDictionary<TKey, Node> _nodes = new ConcurrentDictionary<TKey, Node>();
  14:     public LruCache(int capacity)
  15:     {
  16:         Debug.Assert(capacity > 10);
  17:         _capacity = capacity;
  18:     }
  20:     public void Set(TKey key, TValue value)
  21:     {
  22:         var node = new Node
  23:         {
  24:             Value = value,
  25:             Ticks = new Reference<long> { Value = _stopwatch.ElapsedTicks }
  26:         };
  28:         _nodes.AddOrUpdate(key, node, (_, __) => node);
  29:         if (_nodes.Count > _capacity)
  30:         {
  31:             foreach (var source in _nodes.OrderBy(x => x.Value.Ticks).Take(_nodes.Count / 10))
  32:             {
  33:                 Node _;
  34:                 _nodes.TryRemove(source.Key, out _);
  35:             }
  36:         }
  37:     }
  39:     public bool TryGet(TKey key, out TValue value)
  40:     {
  41:         Node node;
  42:         if (_nodes.TryGetValue(key, out node))
  43:         {
  44:             node.Ticks = new Reference<long> {Value = _stopwatch.ElapsedTicks};
  45:             value = node.Value;
  46:             return true;
  47:         }
  48:         value = default(TValue);
  49:         return false;
  50:     }
  51: }
time to read 5 min | 814 words

So, here I am writing some really fun code, when I found out that I am running into dead locks in the code. I activate emergency protocols and went into deep debugging mode.

After being really through in figuring out several possible causes, I was still left with what is effectively a WTF @!(*!@ DAMN !(@*#!@* YOU !@*!@( outburst and a sudden longing for something to repeatedly hit.

Eventually, however, I figure out what was going on.

I have the following method: Aggregator.AggregateAsync(), inside which we have a call to the PulseAll method. That method will then go and execute the following code:

   1: public void PulseAll()
   2: {
   3:     Interlocked.Increment(ref state);
   4:     TaskCompletionSource<object> result;
   5:     while (waiters.TryDequeue(out result))
   6:     {
   7:         result.SetResult(null);
   8:     }
   9: }

After that, I return from the method. In another piece of the code (Aggregator.Dispose) I am waiting for the task that is running the AggregateAsync method to complete.

Nothing worked! It took me a while before I figured out that I wanted to check the stack, where I found this:


Basically, I had a dead lock because when I called SetResult on the completion source (which freed the Dispose code to run), I actually switched over to that task and allowed it to run. Still in the same thread, but in a different task, I run through the rest of the code and eventually got to the Aggregator.Dispose(). Now, I could only get to it if it the PulseAll() method was called. But, because we are on the same thread, that task hasn’t been completed yet!

In the end, I “solved” that by introducing a DisposeAsync() method, which allowed us to yield the thread, and then the AggregateAsync task was completed, and then we could move on.

But I am really not very happy about this. Any ideas about proper way to handle async & IDisposable?

time to read 21 min | 4088 words

As I mentioned, I run into a very nasty issue with the TPL. I am not sure if it is me doing things wrong, or an actual issue.

Let us look at the code, shall we?

We start with a very simple code:

   1: public class AsyncEvent
   2: {
   3:     private volatile TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
   5:     public Task WaitAsync()
   6:     {
   7:         return tcs.Task;
   8:     }
  10:     public void PulseAll()
  11:     {
  12:         var taskCompletionSource = tcs;
  13:         tcs = new TaskCompletionSource<object>();
  14:         taskCompletionSource.SetResult(null);
  15:     }
  16: }

This is effectively an auto reset event. All the waiters will be released when the PulseAll it called. Then we have this runner, which just execute work:

   1: public class Runner : IDisposable
   2: {
   3:     private readonly ConcurrentQueue<TaskCompletionSource<object>> items =
   4:         new ConcurrentQueue<TaskCompletionSource<object>>();
   5:     private readonly Task<Task> _bg;
   6:     private readonly AsyncEvent _event = new AsyncEvent();
   7:     private volatile bool _done;
   9:     public Runner()
  10:     {
  11:         _bg = Task.Factory.StartNew(() => Background());
  12:     }
  14:     private async Task Background()
  15:     {
  16:         while (_done == false)
  17:         {
  18:             TaskCompletionSource<object> result;
  19:             if (items.TryDequeue(out result) == false)
  20:             {
  21:                 await _event.WaitAsync();
  22:                 continue;
  23:             }
  25:             //work here, note that we do NOT use await!
  27:             result.SetResult(null);
  28:         }
  29:     }
  31:     public Task AddWork()
  32:     {
  33:         var tcs = new TaskCompletionSource<object>();
  34:         items.Enqueue(tcs);
  36:         _event.PulseAll();
  38:         return tcs.Task;
  39:     }
  41:     public void Dispose()
  42:     {
  43:         _done = true;
  44:         _event.PulseAll();
  45:         _bg.Wait();
  46:     }
  47: }

And finally, the code that causes the problem:

   1: public static async Task Run()
   2: {
   3:     using (var runner = new Runner())
   4:     {
   5:         await runner.AddWork();
   6:     }
   7: }

So far, it is all pretty innocent, I think you would agree. But this cause hangs with a dead lock. Here is why:


Because tasks can share threads, we are in the Background task thread, and we are trying to wait on that background task completion.

Result, deadlock.

If I add:

   1: await Task.Yield();

Because that forces this method to be completed in another thread, but that looks more like something that you add after you discover the bug, to be honest.


No future posts left, oh my!


  1. Recording (13):
    05 Mar 2024 - Technology & Friends - Oren Eini on the Corax Search Engine
  2. Meta Blog (2):
    23 Jan 2024 - I'm a JS Developer now
  3. Production postmortem (51):
    12 Dec 2023 - The Spawn of Denial of Service
  4. Challenge (74):
    13 Oct 2023 - Fastest node selection metastable error state–answer
  5. Filtering negative numbers, fast (4):
    15 Sep 2023 - Beating memcpy()
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats