Ayende @ Rahien

Refunds available at head office

Shared dictionary generation

As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:


One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:


This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.


Published at

Originally posted at

Comments (5)

Building a shared dictionary

This turned out to be a pretty hard problem. I wanted to do my own thing, but for reference, femtozip is considered to be the master source for such things.

The idea of a shared dictionary system is that you have a training corpus, that you use to extract common elements from the training corpus, which you can then use to build a dictionary, which you’ll then be able to use to compress all the other data.

In order to test this, I generated 100,000 users using Mockaroo. You can find the sample data here: RandomUsers.

The data looks like this:

{"id":1,"name":"Ryan Peterson","country":"Northern Mariana Islands","email":"rpeterson@youspan.mil"},
{"id":2,"name":"Judith Mason","country":"Puerto Rico","email":"jmason@quatz.com"},
{"id":3,"name":"Kenneth Berry","country":"Pakistan","email":"kberry@wordtune.mil"},
{"id":4,"name":"Judith Ortiz","country":"Cuba","email":"jortiz@snaptags.edu"},
{"id":5,"name":"Adam Lewis","country":"Poland","email":"alewis@muxo.mil"},
{"id":6,"name":"Angela Spencer","country":"Poland","email":"aspencer@jabbersphere.info"},
{"id":7,"name":"Jason Snyder","country":"Cambodia","email":"jsnyder@voomm.net"},
{"id":8,"name":"Pamela Palmer","country":"Guinea-Bissau","email":"ppalmer@rooxo.name"},
{"id":9,"name":"Mary Graham","country":"Niger","email":"mgraham@fivespan.mil"},
{"id":10,"name":"Christopher Brooks","country":"Trinidad and Tobago","email":"cbrooks@blogtag.name"},
{"id":11,"name":"Anna West","country":"Nepal","email":"awest@twinte.gov"},
{"id":12,"name":"Angela Watkins","country":"Iceland","email":"awatkins@izio.com"},
{"id":13,"name":"Gregory Coleman","country":"Oman","email":"gcoleman@browsebug.net"},
{"id":14,"name":"Andrew Hamilton","country":"Ukraine","email":"ahamilton@rhyzio.info"},
{"id":15,"name":"James Patterson","country":"Poland","email":"jpatterson@skippad.net"},
{"id":16,"name":"Patricia Kelley","country":"Papua New Guinea","email":"pkelley@meetz.biz"},
{"id":17,"name":"Annie Burton","country":"Germany","email":"aburton@linktype.com"},
{"id":18,"name":"Margaret Wilson","country":"Saudia Arabia","email":"mwilson@brainverse.mil"},
{"id":19,"name":"Louise Harper","country":"Poland","email":"lharper@skinder.info"},
{"id":20,"name":"Henry Hunt","country":"Martinique","email":"hhunt@thoughtstorm.org"}

And what I want to do is to run over the first 1,000 records and extract a shared dictionary. Actually generating the dictionary is surprisingly hard. The first thing I tried is a prefix tree of all the suffixes. That is, given the following entries:


You would have the following tree:

  • b
    • ba
      • ban
        • bana
          • banan
            • banana
  • a
    • an
      • ana
        • anan
          • anana
      • ang
        • ange
  • n
    • na
      • nan
        • nana
      • nag
        • nage
  • l
    • le
      • lem
        • lemo
          • lemon
  • o
    • or
      • ora
        • oran
          • orang
            • orange
  • r
    • ra
      • ran
        • rang
          • range
  • g
    • ge
  • e

My idea was that this will allow me to easily find all the common substrings, and then rank them. But the problem is how do I select the appropriate entries that are actually useful? That is the part where I gave up my simple to follow and explain code and dived into the real science behind it. More on that in my next entry, but in the meantime, I would love it if someone could show me simple code to find the proper terms for the dictionary.


Published at

Originally posted at

Comments (4)

Pretty graphs and operational concerns in RavenDB 3.0

It has been a while since I have shown you some of the cool stuff that we have been doing with RavenDB 3.0.

Here is a quick taste:



We have turned a lot of the previously mysterious operational endpoints into pretty graphs, and you can see how the database is behaving in a pretty detailed way.


Published at

Originally posted at

Comments (2)

Decompression code & Discussion

As I said, a good & small interview question is this one, it is a good one because it is short, relatively simple to handle, but it should show a lot of things about your code. To start with, being faced with a non trivial task that most people are not that familiar with.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

I couldn’t resist doing this myself, and I came up with the following:

public void Decompress(Stream input, Stream output, byte[][] dictionary)
    var tmp = new byte[128];
    while (true)
        var readByte = input.ReadByte();
        if (readByte == -1)
        var prefix = (byte) readByte;
        var compressed = (prefix & 0x80) != 0;
        var len = prefix & 0x7f;

        if (compressed == false)
            while (len > 0)
                var read = input.Read(tmp, 0, len);
                if(read == 0)
                    throw new InvalidDataException("Not enough data to read from compressed input stream");
                len -= read;
                output.Write(tmp, 0, read);
            readByte = input.ReadByte();
            if(readByte == -1)
                throw new InvalidDataException("Not enough data to read from compressed input stream");
            output.Write(dictionary[readByte], 0, len);

Things to pay attention to: Low memory allocations, error handling, and handling of partial reads from the stream.

But that is just part of the question. After reading the protocol, and implementing it. The question now turns to what does the protocol says about this kind of compression scheme. The use of just 7 bits to store len drastically limit the compression utility in a general format. It also requires an external dictionary, which most compression formats don’t use, they use the actual compressed text itself as the dictionary.  Of course, I’ve been reading compression algorithms for a while now, so that isn’t that fair. But I would expect people to note that that 7 bit limits the compression usability.

And hopefully, with a bit of a hint, they should note that the external dictionary is useful for small data sets where the repetitions are actually between entry, not per entry.

Interview challenges: Decompress that

I’m always looking for additional challenges that I can ask people who interview at Hibernating Rhinos, and I run into an interesting challenge just now.

Implement a Decompress(Stream input, Stream output, byte[][] dictionary) routine for the following protocol:

Prefix byte – composed of the highest bit + 7 bits len

If the highest bit is not set, this is a uncompressed data, copy the next len from the input to the output.

If the highest bit is set, this is compressed data, the next byte will be the index in the dictionary, copy len bytes from that dictionary entry to the output.

After writing the code, the next question is going to be, what are the implication of this protocol? What is it going to be good for? What is it going to be bad for?

Code Challenge: Make the assertion fire

Can you create a calling code that would make this code fire the assertion?

public static IEnumerable<int> Fibonnaci(CancellationToken token)
    yield return 0;
    yield return 1;

    var prev = 0;
    var cur = 1;

        while (token.IsCancellationRequested == false)
            var tmp = prev + cur;
            prev = cur;
            cur = tmp;
            yield return tmp;

Note that cancellation token cannot be changed, once cancelled, it can never revert.


Published at

Originally posted at

Comments (38)

Your customer isn’t a single entity

An interesting issue came up in the comments for my modeling post.  Urmo is saying:

…there are no defined processes, just individual habits (even among people with same set of obligations) with loose coupling on the points where people need to interact. In these companies a software can be a boot that kicks them into more defined and organized operating mode.

This is part of discussion of software modeling and the kind of thinking you have to do when you approach a system. The problem with Urmo’s approach is that there is a set implicit assumptions, and that is that the customer is speaking with a single voice, that they actually know what they are doing and that they have the best interests. Yes, it is really hard to create software (or anything, actually) without those, but that happens more frequently than one might desire.

A few years ago I was working on a software to manage what was essentially long term temp workers. Long term could be 20 years, and frequently was a number of years. The area in question was caring for invalids,  and most of the customers for that company were the elderly. That meant that a worker might not be required on a pretty sudden basis (the end customer died, care no longer required).

Anyway, that is the back story. The actual problem we run into was that by the time the development team got into place there was already a very detailed spec, written by a pretty good analyst after many sessions at a luxury hotel conference room. In other words, the spec cost a lot of money to generate, and involved a lot of people from the company’s management.

What it did not include, however, was feedback from the actual people who had to place the workers at particular people’s homes, and eventually pay them for their work. Little things like the 1st of the month (you have 100s of workers coming in to get their hours approved and get paid) weren’t taken into account. The software was very focused on the individual process, and there were a lot of checks to validate input.

What wasn’t there were things like: “How do I efficiently handle many applicants at the same time?’'

The current process was paper form based, and they were basically going over the hours submitted, ask minimal questions, and provisionally approve it. Later on, they would do a more detailed scan of the hours, and do any fixups needed. That would be the time that they would also input the data to their old software. In other words, there was an entire messy process going on that the higher ups didn’t even realize was happening.

This include decisions such as “you need an advance, we’ll register that as 10 extra hours you worked this month, and we’ll deduct it next month” and “you weren’t supposed to go to Mrs. Xyz, you were supposed to go to Mr. Zabc! We can’t pay for all your hours there” , etc.

When we started working on the software, we happened to do a demo to some of the on site people, and they were horrified by what they saw. The new & improved software would end up causing them much more issues, and it would actually result in more paperwork that they have to manage just so they can make the software happy.

Modeling such things was tough, and at some point (with the client reluctant agreement) we essentially threw aside the hundreds of pages of well written spec, and just worked directly with the people who would end up using our software. The solution in the end was to codify many of the actual “business processes” that they were using. Those business processes made sense, and they were what kept the company working for decades. But management didn’t actually realize that they were working in this manner.

And that is leaving aside the “let us change the corporate structure through software” endeavors, which are unfortunately also pretty common.

To summarize, assuming that your client is a single entity, which speaks with one voice and actually know what they are talking about? Not going to fly for very long. In another case, I had to literally walk a VP of Sales through the process of how a sale is actually happening in his company versus what he thought was happening.

Sometimes this job is likely playing a shrink, but for corporations.

Playing with compression

Compression is a pretty nifty tool to have, and it can save a lot in both I/O and overall time (CPU tends to have more cycles to spare then I/O bandwidth). I decided that I wanted to investigate it more deeply.

The initial corpus was all the orders documents from the Northwind sample database. And I tested this by running GZipStream over the results.

  • 753Kb - Without compression
  •   69Kb - With compression

That is a pretty big difference between the two options. However, this is when we compress all those documents together. What happens if we want to compress each of them individually? We have 830 orders, and the result of compressing all of them individually is:

  • 752Kb - Without compression
  • 414Kb - With compression

I did a lot of research into compression algorithms recently, and it the reason for the difference is that when we compress more data together, we can get better compression ratios. That is because compression works by removing duplication, and the more data we have, the more duplication we can find. Note that we still manage to do

What about smaller data sets? I created 10,000 records looking like this:

{"id":1,"name":"Gloria Woods","email":"gwoods@meemm.gov"}

And gave it the same try (compressing each entry independently):

  • 550KB – Without compression
  • 715KB – With compression

The overhead of compression on small values is very significant. Just to note, compressing all entries together will result in a data size of 150Kb in size.

So far, I don’t think that I’ve any surprising revelations. That is pretty much exactly inline with my expectations. And the reason that I’m actually playing with compression algorithms rather than just use an off the shelve one.

This particular scenario is exactly where FemtoZip is supposed to help, and that is what I am looking at now. Ideally, what I want is a shared dictionary compression that allows me to manage the dictionary myself, and doesn’t have a static dictionary that is created once, although that seems a lot more likely to be the way we’ll go.


Published at

Originally posted at

Comments (10)

Modeling exercise: Flights & Travelers

I just got a really interesting customer inquiry, and I got their approval to share it. The basic problem is booking flights, and how to handle that.

The customer suggested something like the following:

{   //customers/12345
    "Name" : "John Doe",
    "Bookings" : [{
        "FlightId": "flights/1234",
        "BookingId": "1asifyupi",
        "Flight": "EA-4814",
        "From": "Iceland",
        "To" : "Japan", 
        "DateBooked" : "2012/1/1"

{ // flight/1234
   "PlaneId": "planes/1234"// centralized miles flown, service history
           "Seat": "F16"
           "BookedBy": "1asifyupi"

But that is probably a… suboptimal way to handle this. Let us go over the type of entities that we have here:

  • Customers / Passengers
  • Flights
  • Planes
  • Booking

The key point in here is that each of those is pretty independent. Note that for simplicity’s sake, I’m assuming that the customer is also the passenger (not true in many cases, a company may pay for your flight, so you the company in the customer and you the passenger).

The actual problem the customer is dealing with is that they have thousands of flights, tens or hundreds of thousands of seats and millions of customers competing for those seats.

Let us see if we can breaking it down to a model that can work for this scenario.  Customers deserve its own document, but I wouldn’t store the bookings directly in the customer document. There are many customers that fly a lot, and they are going to have a lot of booking there. At the same time, there are many bookings that are made for a lot of people at the same time (an entire family flying).

That leaves the Customer’s document with data about the customer (name, email, phone, passport #, etc) as well as details such as # of miles traveled, the frequent flyer status, etc.

Now, we have the notion of flights and bookings. A flight is a (from, to, time, plane), which contains the available seats number. Note that we need to explicitly allow for over booking, since that is a common practice for airlines.

There are several places were we have contention here:

  • When ordering, we want to over book up to a certain limit.
  • When seating (usually 24 – 48 hours before the flight) we want to reserve seats.

The good thing about it is that we actually have a relatively small contention on a particular flight. And the way the airline industry works, we don’t actually need a transaction between creating the booking and taking a seat on the flight.

The usual workflows goes something like this:

  • A “reservation” is made for a particular itinerary.
  • That itinerary is held for 24 – 48 hours.
  • That itinerary is sent to the customer for approval.
  • Customer approve and a booking is made, flight reservations are turned into actual booked seats.

The good thing about this is that because a flight can have up to ~600 seats in it, we don’t really have to worry about contention on a single flight. We can just use normal optimistic concurrency and avoid more complex models. That means that we can just retry on concurrency errors and see where that leads us. The breaking of the actual order into reservation and booking also helps, since we don’t have to coordinate between the actual charge and the reservation on the flight.

Overbooking is handled by setting a limit of how much we allow overbooking, and managing the number of booked seats vs. reserved seats. When we look at the customer data, we show the customer document, along with the most recent orders and the stats. When we look at a particular flight, we can get pretty much all of its state from the flight document itself.

And the plane’s stats are usually just handled via a map/reduce index on all the flights for that plane.

Now, in the real world, the situation is a bit more complex. We might give out 10 economy seats and 3 business seats to Expedia for a 2 months period, so they manage that, and we have partnership agreements with other airlines, and… but I think that this is a good foundation to start building this on.

Mixing Sync & Async calls

Take a look at the following code:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();   
for (int i = 0; i < 500; i++)
    var t = Task.Run(() =>
        var webRequest = WebRequest.Create(new Uri(“http://google.com”));
Interlocked.Increment(ref count); }); tasks.Add(t); } var whenAll = Task.WhenAll(tasks.ToArray()); while (whenAll.IsCompleted == false && whenAll.IsFaulted == false) { Thread.Sleep(1000); Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x=> x.IsCompleted == false)); } Console.WriteLine(sp.Elapsed);

As you can see, it is a pretty silly example of making 500 queries to Google. There is some stuff here about reporting & managing, but the key points is that we start 500 tasks to do some I/O. How long do you think that this is going to run?

On my machine, this code completes in roughly 22 seconds. Now, WebRequest is old school, and we want to use HttpClient, because it has a much better interface, so we wrote:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();
for (int i = 0; i < 500; i++)
    var t = Task.Run(() =>
        var client = new HttpClient();
        client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com"))).Wait();
        Interlocked.Increment(ref count);
var whenAll = Task.WhenAll(tasks.ToArray());
while (whenAll.IsCompleted == false && whenAll.IsFaulted == false)
    Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x => x.IsCompleted == false));

I’ve been running this code on my machine for the past 7 minutes, and this code hasn’t yet issue a single request.

It took a while to figure it out, but the problem is in how this is structured. We are creating a lot of tasks, more than the available threads in the thread pool. Then we make an async call, and block on that. That means that we block the thread pool thread. Which means that to process the result of this call, we’ll need to use another thread pool thread. However, we have scheduled more tasks than we have threads for. So there is no thread available to handle the reply, so all the threads are stuck waiting for reply that there is no thread to handle to unstick them.

The thread pool notices that, and will decide to allocate more threads, but they are also taken up by the already existing tasks that will immediately block.

Now, surprisingly, eventually the thread pool will allocate enough threads (although it will take it several hours to do so, probably) to start handling the requests, and the issue is resolved. Expect that this ends up basically crippling the application while this is happening.

Obviously, the solution is to not wait on async calls inside a task like that, indeed, we can use the following code quite easily:

var t = Task.Run(async () =>
    var client = new HttpClient();
    await client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com")));

    Interlocked.Increment(ref count);

And this shows performance comparable to the WebRequest method.

However, that isn’t very helpful when we need to expose a synchronous API. In our case, in RavenDB we have the IDoctumentSession, which is a simple synchronous API. We want to use the a common codebase for sync and async operations. But we need to expose the same interface, even if we changed things. To make things worse, we have no control on how the http client is scheduling the async operations, and it has no sync operations.

That means that we are left with the choice of writing everything twice, once for synchronous operations and once for async ops or just living with this issue.

A work around we managed to find is to run the root tasks (which we do control) on our own task scheduler, so they won’t compete with the I/O operations. But that is hardly something that I am thrilled about.


Published at

Originally posted at

Comments (16)

The contracts of Lazy vs Task

There was a question about our use of Task<T> and Lazy<T> in RavenDB, and I thought that this is a subtle thing that deserve more than a short email.

The basic difference between Lazy<T> and Task<T> are the kind of contracts that they express.

  • Lazy<T> represent a potentially expensive operation that has been deferred. The promise given is that the lazy’s value will be generated the first time that it is needed.
  • Task<T> represent a potentially expensive operation that is currently executing. The promise is that the task’s value will be available on request, hopefully already there by the time you asked.

The major difference is when we are actually starting the operation. Most often, when we return a task, we return a reference to an scheduled / executing task, which will complete whatever or not the task’s result will be accessed. Conversely, a lazy’s whose value was never accessed is not something that will ever execute.

The use cases for them tend to be quite different.

Lazy<T> is used a lot of the time as a way to handle once and only once creation (basically, a safe singleton pattern), and for actually deferring the execution of work. Sometimes you can avoid having to do something, and it is easier to give a caller a lazy object and only have to pay for that additional work if they access the data.

Task<T> is primarily used to actually parallelize the work, so you’ll have it running while you are doing other stuff. Usually this is used for I/O, since that can be natively parallelized.

With RavenDB, we use Task<T> as the return type of all our async operations, allowing of to take advantage on the async nature of I/O. And we use Lazy<T> to setup a deferred call for the server. When someone actually access one of lazy’s values, we have to provide you with the answer. At this point, we can go to the server with all  the pending lazy operations, and save a lot of effort just making remote calls to the server.

In fact, in RavenDB 3.0, we have lazy support in the async API. That means that we have methods that return Lazy<Task<T>>, which means: Give me a deferred operation, that when required, will perform in an async manner. That gives me both the ability to combine requests to the server and avoid blocking up a thread while that I/O is in progress.

RavenDB Migration Assistance Services now available

Many people who want to gain the benefits of RavenDB are facing a lot of challenges when they look at their projects. This is especially the case when we are talking about moving existing projects to RavenDB, rather than doing green field development. Recently we have seen quite a bit of a surge in customers coming to us for assistance in that area.

As a result of that, I would like to announce that we are now offering those services as a core part of our offering. If you have an existing application and you want to move all or part of it to RavenDB, we now have completed training for staff to help you do just that. This can be just a standalone

In addition to that, because this is a new service, we are actually going to offer the first three new customers this service for free (contact support@ravendb.net for this).

We want to encourage people to use RavenDB, and I think that one of the key things we can do to help is reduce any barriers to entry, and this one is pretty big.


Published at

Originally posted at

Working on the Uber Profiler 3.0

One of the things that tend to get lost is the fact that Hibernating Rhinos is doing more than just working on RavenDB. The tagline we like to use is Easier Data, and the main goal we have is to provide users with better ways to access, monitor and manage their data.

We have started planning the next version of the Uber Profiler, that means Entity Framework Profiler, NHibernate Profiler, etc. Obviously, we’ll offer support for EF 7.0 and NHibernate 4.0, and we had gathered quite a big dataset for common errors when using an OR/M, which we intend to convert into useful guidance whenever our users run into such an issue.

But I realized that I was so busy working on RavenDB that I forgot to even mention the work we’ve been doing elsewhere. And I also wanted to solicit feedback about the kind of features you’ll want to see in the 3.0 version of the profilers.

Exploring the data, extracting series from my blog posts

I’m pretty bad when it comes to actually organizing my blog. I just like to write stuff out, I don’t like to do things like properly setting things up in series. Mostly because I usually think about one post at a time, or three at the most.

I did notice that I usually use something like “Series name: post name” convention when writing series of posts. So I decided to write the following index to check the data out:


As you can see, this is pretty simple way of doing things. And that lead to the following data.


Some of those are obviously false positives, and we have things like this, which are obviously out:


But it looks like important series are also spread over time:


I think that I’m going to have to do a new blog feature, to highlight those emergent series.

Published at

Originally posted at

Comments (3)

Smarter failure handling with RavenDB 3.0

One of the things that we have to deal with as a replicated database is how to handle failure. In RavenDB, we have dealt with that with replication, automatic failover and a smart backoff strategy that combined reducing the impact on the clients when a node was down with being able to detect that it was up quickly.

In RavenDB 3.0, we have improved on that. Before, we would ping the failed server every now and then, to check if it is up. However, that would mean that routine operations would slow down, even when the failover server was working. In order to handle this, we used to have a backoff strategy for checking the failed server. We would only check it once every 10th request, until we have 10 failed requests, then we would check it only once every 100th request, until we had a 100 failed requests, etc. That worked quite well. Mostly because a lot of the time, failures are transients, and you don’t want to fail to a secondary for too long, and if we had a long standing issue, we had a small hiccup, and then everything worked fine, with the occasional hold on a request while we checked things. In addition to that, we also have gossip mechanism that would inform clients that their server is up so they can figure out that the primary server is up even faster.

Anyway, that is what we used to do. In RavenDB 3.0, we have moved to using an async pinging approach. When we detect that the server is down, we will still do the checks as before, but unlike 2.x, we won’t do them in the current execution thread, instead, we have a background task that will ping the server to see if it is up. That means that after the first failed request, we will immediately switch over to the secondary, and we will keep on the secondary until the background ping process will report that everything is up.

That means that for a very short failure (less than 1 second, usually) we will switch over to the secondary, where before we’ll be able to figure out that the server is still up. But the upside here is that we won’t have any interruption in service just to check that the primary is up.


Published at

Originally posted at

Comments (4)

Voron & Graphs

One of our guys is having fun playing with graph databases, and we had a serious discussion on how we can use Voron for that.

No, we don’t have any plans to do a graph database. This is purely one of the guys playing with something that interest him.

For the purpose of this post, I’m only concerned with having the ability to store graph data and read them efficiently. I don’t care for the actual graph operations.

Let us look at the following graph:


Here is how we can define this in code:

var michael = db.CreateNode();
michael["Name"] = "Michael";

var graphs = db.CreateNode();
graphs["Name"] = "Graphs";

var edge = michael.RelatesTo(graphs, db.Relationship("Plays With"));

Now, how would we go about implementing something like this? Well, with Voron, that is pretty easy.

We’ll start with defining a Nodes tree, which is going to using an incremental 64 bits integer for the key, and a JSON object for the value. This means that on CreateNode, we’ll just allocate the id for it, and just have the node itself as a JSON object that can be as complex as you want.

We also have relationships, and here it gets a bit complex, a relationship is always from a node to a node, and it has a specific type. Because the types of relationships tend to be very few, we will limit them to 65,536 relationship types. I think that this would be more than enough. As a result, I can quickly get the id of a relationship type. This leads us to having another tree in Voron, the RelationshipTypes tree, with a key that is the string name of the relationship and the value is just an incremental short. The reason we need to do this will be obvious shortly.

After we have the relationship type, we need to record the actual relationships. That means that we need to consider how we want to record that. Relationships can have their own properties, so the actual relationship is going to be another JSON object as the value in a tree. But what about the key for this tree? The question here is how are we going to work with this? What sort of queries are we going to issue. Obviously, in a graph database, we are going to follow relationships a lot. And the kind of questions we are going to ask are almost always going to be “from node X, find all outgoing relations of type Y”. So we might as well do this properly.

The key for the relations tree would be 18 bytes, the first 8 bytes are the source node id, the next 2 bytes are the relationship type and the last 8 bytes are the destination node id. That means that on the disk, the data is actually sorted first by the node id, then by the relationship type. Which make the kind of queries that I was talking about very natural and fast.

And that is pretty much it. Oh, you’re going to need metadata tree for things like the last relationship type id, and probably other stuff. But that is it, when speaking from the point of view of the storage.

The overall structure is:

Nodes - (Key: Int64, Val: JSON)

RelationshipTypes – (Key: string, Val: Int16)

Relationships ( Key: Int64, Int16, Int64, Val: JSON)

And on top of that you’ll be able to write any sort of graph logic.


Published at

Originally posted at

Comments (11)

Bisecting RavenDB

One of the results of expanding the RavenDB’s team is that we have to deal with a lot more pull requests. We have individual developers working on their features, and submitting pull requests on a daily basis. That usually means that I have to merge anything between ten to thirty pull requests per day.

Given our test suite runtime, it isn’t really reasonable to run the full test suite after each pull request, so we merge them all, review them, and then run the tests. The problem then is that if there was a test failure… which pull request, and which commit cause it?

That is why git has the wonderful git bisect command. And we have taken advantage of that to enable the following development time feature:

./bisect.ps1 HEAD HEAD~10 General_Initialized_WithoutErrors

It will output something like:

631caa60719b1a65891eec37c4ccb4673eb28d02 is the first bad commit
commit 631caa60719b1a65891eec37c4ccb4673eb28d02
Author: Pawel Pekrol
Date: Fri May 9 08:57:55 2014 +0200
another temp commit

Pin pointing exactly where the issue is.

And yes, this is here as a post primarily because I might forget about it.

Compression & Memory Sizing in RavenDB

We had an interesting issue at a customer recently, and it took a while to resolve, so it is interesting. The underlying problem as stated by the customer was that they tried to reset an index, and RavenDB took all the memory on the machine. That is kinda rude, and we have taken steps to ensure that this wouldn’t happen, so that was surprising.

So we settled down to figure out what was going on, and we were able to reproduce this locally. That was strange, and quite annoying. We finally narrowed things down to very large documents. There was a collection on this database where most documents were several MB in size. We do a lot of rate limiting, to avoid overloading, but we do usually do that on the count of documents, not the size of them. Except, that we actually do limit the size where it matter.

When we load documents to be indexed, we specify a maximum size (usually 128MB) per batch. However… what actually happens is that we go to the storage and ask it, give us X amount of documents, up to Y amount in size. The question is, what size. And in this case, we are using the size on disk, which has a close correlation to the actual size taken when loading into a JSON object. Except… when we use compression.

When we have compression, what actually happens is that we limited to the maximum size of the compressed data, which can be 10% – 15% of the actual data size when we decompress it in memory. That was a large part of the problem. Another issue was what happened when we had prefetching enabled. We routinely prefetch data to memory so we won’t have to wait for I/O. The prefetcher only considered documents count, but when each document is multiple MB, we really needed to consider both count and size.

Both issues were fixed, and there are no longer any issues with this dataset. Interestingly enough, the customer had no issue when creating the database, because it only had to deal with whatever changed, and not the entire data set.


Published at

Originally posted at

C# Async programming pitfall

I really like the TPL, and I really like the async/await syntax. It is drastically better than any other attempt I’ve seen to handle concurrency.

But it also has a really major issue. There is no way to properly debug things. Imagine that I have the following code:

public Task DoSomething()
return new TaskCompletionSource<object>().Task;

And now imagine that I have some code that is going to do an await DoSomething();

What is going to happen? This is a never ending task, so we’ll never return. And that is fine, except that there is absolutely no way to see that. I have no way of seeing which task didn’t return, and I have no way of seeing all the pending tasks, and what they are all waiting for, etc. I’ve run into something like that (obviously a lot harder to figure out) too many times.

If I was using non async code, it would be obvious that there is this thread that is stopped on this thing, and I could figure it out. For us, this make it a lot harder to work with.


Published at

Originally posted at

Comments (15)

A map to Reviewing RavenDB

Afif asked me a very interesting question:


And then followed up with a list of detailed questions. I couldn’t be happier. Well, maybe if I found this place, but until then…

First question:

I am curious about how much Lucene concepts does one need to be aware of to grasp the RavenDB code base? For e.g. will not knowing Lucene concept X limit you from understanding module Y, or is the concept X interviewed in such a manner that you simply won't get why the code is doing what its doing in dribs and drabs over many places.

In theory, you don’t need to know any Lucene to make use of RavenDB. We do a pretty good job of hiding it when you are using the client API. And we have a lot of code (tens of thousands of lines, easy) dedicated to making sure that you needn’t be aware of that. The Linq query providers, in particular, do a lot of that work. You can just think in C#, and we’ll take care of everything.

In practice, however, you need to know some concepts if you want to be able to really make full use of RavenDB. Probably the most important and visible concept is the notion of the transformations we do from the index output to the Lucene entries, and the impact of analyzers on that. This is important for complex searching, full text search, etc. A lot of that is happening in the AnonymousObjectToLuceneDocumentConverter class, and then you have the Lucene Analyzers, which allow to do full text searches. The Lucene query syntax is also good to know, because this is how we are actually processing queries. And understanding how the actual queries work can be helpful as well. But usually that isn’t required.

Some advanced features (Suggestions, More Like This, Facets, Dynamic Aggregation) are built on top of Lucene features directly, and understanding how they work is helpful, but not mandatory to making real use of them.

Second question:

Oren has often referred to RavenDB as an ACID store on top of which is an eventually consistent indexing store built. Is this a mere conceptual separation or is it clearly manifested in code? If so, how tight is the coupling, can you understand and learn one without caring much about the other?

Yes, there is a quite clear separation in the code between the two. We have ITransactionalStorage which is how the ACID store is accessed. Note that we use the concept of a wrapper method, Batch(Action<IStorageActionsAccessor>) to actually handle transactions. In general, the DocumentDatabase class is responsible for coordinating a lot of that work, but it isn’t actually doing most of it. Indexing are handled by the IndexStorage, which is mostly about maintaining the Lucene indexes properly. Then you have the IndexingExecuter, which is responsible for actually indexing all the documents. The eventual consistency aspect of RavenDB comes into play because we aren’t processing the indexes in the same transaction as the writes, we are doing that in a background thread.

In general, anything that comes from the transactional store is ACID, and anything that comes from the indexes is BASE.

Third question:

Often operational concerns add a lot of complexity (think disaster recovery, optimizations for low end machines). When looking at feature code, will I know constructs a,b,c are intermingled here to satisfy operational feature y, so I can easily separate the wheat from the chaff.

Wow, that is really hard to answer.  It is also one of our recurring pain points. Because .NET is a managed language, it is much harder to manage some things with it. I would love to be able to just tell the indexing to use this size limited heap, instead of having to worry about it using too much memory. Because of that, we’re often having to do second order stuff and guesstimate.

A lot of the code for that is in auto tuners, like this one. And we have a lot of code in the indexing related to handling that. For example, catching an OutOfMemoryException and trying to handle that.  Disaster recover is mostly handled by the transactional storage, but we do have a lot of stuff that is meant to help us with indexing. Commit points in indexes is a case where we try to be prepared for crashing, and store enough information to recover more quickly. At startup we also do checks for all indexes, to make sure that they aren’t corrupted.

A lot of other stuff is already there an exposed, the page size limits, the number of requests per session, etc. We also have a lot of configuration options that allow the users on low end machines to instruct us how to behave, but we usually want to have that handled automatically. You can also see us taking into account size & count for documents when loading them. Usually we try to move them out of the mainline code, but we can’t always do so. But it is hard for me to point at a feature code and say, this is there to support operational concern X.

That said, metrics is operational concern, an important one, and you can see how we use that throughout the code, by trying to measure how certain things are going, we get a major benefit down the road when we need to figure out what is actually going on. Another aspect of operational concern is the debug endpoints, which expose a lot of information about RavenDB internal behavior. This is also the case for debugging how an index is built, for which we have a dedicated endpoint as well.

In the replication code, you can see a lot of error handling, since we expect and handle the other side to be down a lot. A lot of thought an experience has gone into the replication, and you can see a lot of that there. The notion of batches, back off strategies, startup notifications, etc.

One thing you’ll notice that match your question is that we have a lot of stuff like this:

using (LogManager.OpenMappedContext("database", Name ?? Constants.SystemDatabase))

This is there to provide us with context for the logs, and it is usually required when we do things in the background.

Fourth question:

Is there a consistent approach to handle side effects, for e.g when x is saved, update y. Pub/sub, task queue, something else? I am hoping if I am made aware of these patterns I will more easily be able to discover/decipher such interactions.

Yes, there is! It is an interesting one, too. The way it works, every transaction has a way of saying that it did something that other pieces of RavenDB that something happened. This is done via the WorkContext’s method ShouldNotifyAboutWork, which will eventually raise the work notification when the transaction is completed. The other side there is the waiting for work, obviously.

That means that a lot of the code in RavenDB is actually sitting in a loop, like so:

while (context.DoWork)
    var isIdle = context.WaitForWork();
    if(context.DoWork == false)

There are variants, obviously, but this is how we handle everything, from indexing to replication.

Note that we don’t bother to distinguish between work types. Work can be a new document, a deleted index or indexing completing a run. Either one of those would generate a work notification. There is some code there to optimize for the case where we have a lot of work, and we won’t take a lock if we can avoid it, but the concept is pretty simple.

Fifth question:

Should I expect to see a lot of threading/synchronization code for fulfilling multi core processing concerns, or true to .NET's more recent promises, are these concerns mostly addressed by good usage of Tasks, async await, reactive extensions etc.

Well, this is a tough one. In general, I want to answer no, we don’t. But it is a complex answer. Client side, ever since 3.0, we are pretty much all async. All our API calls are purely running via the async stuff. We have support for reactive extensions, in the sense of our Changes() API, but we don’t make any use of it internally. Server side, we have very little real async code, mostly because we found that it made it very hard to debug the system under certain conditions. Instead, we do a lot of producer consumer kind of stuff (see transaction merging) and we try to avoid doing any explicit multi threading. That said, however, we do have a lot of work done in parallel. Indexing, both per index and inside each index. A lot of that work is done through our own parallel primitives, because we need to be aware of what is actually going on in the system.

We have some places where we have to be aware of multi threaded concerns, sometimes in a very funny ways (how we structure the on disk data to avoid concurrency issues) and sometimes explicitly (see how we handle transport state).

For the most part, we tend to have a dedicated manager thread per task, such as replication, indexing, etc. Then that actual work is done in parallel (for each index, destination, etc).

I hope that this give you some guidance, and I’ll be very happy to answer any additional questions, either from Afif or from anyone else.


Published at

Originally posted at

Comments (2)

NSBCon 2014 & Distributed Systems with RavenDB

I’m going to be talking in that 2014 NService Bus Conference. More specifically, I’m going to be talking about how to build proper distributed systems with RavenDB. What sort of things you need to watch out for, what features in RavenDB you can take advantage of and general distributed data design.

But there is more going on with RavenDB in NServiceBus. A lot of the new features in the Particular Service Platform are built based on RavenDB, and you’ll have the chance to hear all about it there.




Published at

Originally posted at