Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q j

Posts: 6,633 | Comments: 48,370

filter by tags archive

I WILL have orderHow Lucene sorts query results

time to read 3 min | 537 words

In this series of posts, I am going to take a look at a single feature across several search engine libraries. Given three documents, sort them by State and then by City. This is a pretty trivial query, but there is a lot that is going on behind the scenes that needs to happen for this to actually work. Let’s look at how this is implemented, shall we?

The first library to look at it Lucene, because it is so prevalent. Here is the relevant code that I’m executing:

A key part of the way Lucene executes sorting is this piece of code:


As you can see, we ask the reader (a single file in a Lucene directory) to get a the list of field values and matches for a particular field.

In this case, what his means it that doc #0 has the value in lookup[2], doc #1 as well, and doc #2 has the value in lookup[1]. This means that when we compare, we can do it using the following code:


And this is called for each field independently, like so:


All of which is pretty simple and straightforward. There is a nice optimization here in the sense that in most cases, if the readerGen is the same, we can compare the ordinals directly, without comparing the actual string values.

The problem here is that we need to hold arrays. In particular, I’m talking about the FieldCache.GetStringIndex() (and it’s related friends). The way Lucene stores the values on disk means that on first read, it needs to reconstruct the terms from the index. Here is the core of the work that is done in GetStringIndex.

As you can see, this rips through the entire file, reading each term and then getting all the documents for a particular term. The code is quite clever, because we don’t need to compare anything, we know that we are sorted, so we can take advantage of that when detecting the ordinals.

What this code isn’t very helpful about, though, is the fact that this is allocating a lot of memory. In particular, it will allocate arrays with a value per each document in the index. On large indexes, these can be very large. The good thing here is that there is a good caching going on here, so you’ll typically not need to run this code all too often. The bad thing is that this runs per segment. If you have a lot of small index batches, you’ll have a lot of values like that floating around, and then it will get merged, and you’ll have to run through this again. This is also one of the primary reasons Lucene is limited to about 2.1 billion documents per index.

The good thing about it is that this is really flexible and give us a great performance when sorting.

So now that we know how Lucene does it, let’s look at other libraries.

Distributed compare-exchange operations with RavenDB

time to read 4 min | 754 words

RavenDB uses a consensus protocol to manage much of its distributed state. The consensus is used to ensure consistency in a distributed system and it is open for users as well. You can use this feature to enable some interesting scenarios.

The idea is that you can piggy back on RavenDB’s existing consensus engine to gain the ability allow you to create robust and consistent distributed operations. RavenDB exposes these operations using  a pretty simple interface: compare-exchange.

At the most basic level, you have a key/value interface that you can make distributed atomic operations on, knowing that they are completely consistent. This is great, in abstract, but it s a bit hard to grasp without a concrete example.

Consider the following scenario. We have a bunch of support engineers, ready and willing to take on any support call that come. At the same time, an engineer can only a certain number of support calls. In order to handle this, we allow engineers to register when they are available to take a new support call. How would we handle this in RavenDB? Assuming that we wanted absolute consistency? An engineer may never be assigned too much work and work may never be lost. Assume that we need this to be robust in the face of network and node failure.

Here is how an engineer can register to the pool of available engineers.

The code above is very similar to how you would write multi-threaded code. You first get the value, then attempt to do an atomic operation to swap the old value with the new one. If we are successful, the operation is done. If not, then
we retry. Concurrent calls to RegisterEngineerAvailability will race each other. One of them will succeed and the others will have to retry.

The actual data that we store in the compare exchange value in this case is an array. You can see an example of how that would look here:


Compare exchange values can be simple values (numbers, strings), arrays or even objects. Any value that can be represented as JSON is valid there. However, the only operation that is allowed on a compare exchange value is a wholesale replacement.

The code above is only doing half of the job. We still need to be able to get an engineer to help us handle a support call. The code to complete this task is shown below:

The code for pulling an engineer from the pool is a bit more complex. Here we read the available engineers from the server. If there are none, we'll wait a bit and try again. If there are available engineers we'll remove the first one and then try to update the value. This can happen for multiple clients at the same time, so we check whatever our update was successful and only return the engineer if our change was accepted.

Note that in this case we use two different modes to update the value. If there are still more engineers in the available  pool, we'll just remove our engineer and update the value. But if our engineer is the last one, we'll delete the value
entirely. In either case, this is an atomic operation that will first check the index of the pre-existing value before performing the write.

It is important to note that when using compare exchange values, you'll typically not act on read. In other words, in PullAvailableEngineer, even if we have an available engineer, we'll not use that knowledge until we successfully wrote the new value.
The whole idea with compare exchange values is that they give you atomic operation primitive in the cluster. So a typical usage of them is always to try to do something on write until it is accepted, and only then use whatever value you read.

The acceptance of the write indicates the success of your operation and the ability to rely on whatever values you read. However, it is important to note that compare exchange operations are atomic and independent. That means an operation
that modify a compare exchange value and then do something else needs to take into account that these would run in separate transactions.

For example, if a client pull an engineer from the available pool but doesn't provide any work (maybe because the client crashed) the engineer will not magically return to the pool. In such cases, the idle engineer should periodically check
that the pool still the username and add it back if it is missing.

Daisy chaining data flow in RavenDB

time to read 4 min | 685 words

I have talked before about RavenDB’s MapReduce indexes and their ability to output results to a collection as well as RavenDb’s ETL processes and how we can use them to push some data to another database (a RavenDB database or a relational one).

Bringing these two features together can be surprisingly useful when you start talking about global distributed processing. A concrete example might make this easier to understand.

Imagine a shoe store (we’ll go with Gary’s Shoes) that needs to track sales across a large number of locations. Because sales must be processed regardless of the connection status, each store hosts a RavenDB server to record its sales. Here is the geographic distribution of the stores:


To properly manage this chain of stores, we need to be able to look at data across all stores. One way of doing this is to set up external replication from each store location to a central server. This way, all the data is aggregated into a single location. In most cases, this would be the natural thing to do. In fact, you would probably want two-way replication of most of the data so you could figure out if a given store has a specific shoe in stock by just looking at the local copy of its inventory. But for the purpose of this discussion, we’ll assume that there are enough shoe sales that we don’t actually want to have all the sales replicated.

We just want some aggregated data. But we want this data aggregated across all stores, not just at one individual store. Here’s how we can handle this: we’ll define an index that would aggregate the sales across the dimensions that we care about (model, date, demographic, etc.). This index can answer the kind of queries we want, but it is defined on the database for each store so it can only provide information about local sales, not what happens across all the stores. Let’s fix that. We’ll change the index to have an output collection. This will cause it to write all its output as documents to a dedicated collection.

Why does this matter? These documents will be written to solely by the index, but given that they are documents, they obey all the usual rules and can be acted upon like any other document. In particular, this means that we can apply an ETL process to them. Here is what this ETL script would look like.


The script sends the aggregated sales (the collection generated by the MapReduce index) to a central server. Note that we also added some static fields that will be helpful on the remote server so as to be able to tell which store each aggregated sale came from. At the central server, you can work with these aggregated sales documents to each store’s details, or you can aggregate them again to see the state across the entire chain.

The nice things about this approach are the combination of features and their end result. At the local level, you have independent servers that can work seamlessly with an unreliable network. They also give store managers a good overview of their local states and what is going on inside their own stores.

At the same time, across the entire chain, we have ETL processes that will update the central server with details about sales statuses on an ongoing basis. If there is a network failure, there will be no interruption in service (except that the sales details for a particular store will obviously not be up to date). When the network issue is resolved, the central server will accept all the missing data and update its reports.

The entire process relies entirely on features that already exist in RavenDB and are easily accessible. The end result is a distributed, highly reliable and fault tolerant MapReduce process that gives you aggregated view of sales across the entire chain with very little cost.

RavenDB 4.1 FeaturesHighlighting

time to read 2 min | 238 words

This s actually an old feature, that didn’t make the cut to enter 4.0. This is now back, and it is roaring. This is the kind of feature that is useful if you are utilizing RavenDB’s search capabilities. Let us assume that you want to search for something, but instead of querying for “give me all the active users” you want to actually… search. For example, you want to search for all employees with a BA in their bio. However, you don’t want to just get the matches, you want to show the user why this was matches.

That is the problem that highlighting is meant to solve. Consider the following query:


Which returns the following results:


Why did we get this particular employees?  Let’s find out:


Now we are asking the server to highlight for us the reason for the match. You can see this in the studio directly, in the Highlight tab:


Using this approach, you can enrich the search result and provide nicer experience for your users.

Inside RavenDB 4.0Book update

time to read 1 min | 66 words

Just to let you know, the book is pretty much edited, that means that you won’t have to suffer through my horrible sentence structure.

You can read this here.

What remains to be done now is for me to go over the book again, verify that there aren’t any issues, and we are done.

In other words, we are now “Done, Done” in the “Done, Done, Done” scale.

Code that? It is cheaper to get a human

time to read 3 min | 597 words

Rafal had a great comment on my previous post:

Much easier with humans in the process - just tell them to communicate and they will figure out how to do it. Otherwise they wouldn't be in the shoe selling business. Might be shocking for the tech folk, but just imagine how many pairs of shoes they would have to sell to pay for a decent IT system with all the features you consider necessary. Of course at some point the cost of not paying for that system will get higher than that…

This relates to have a chain of shoe stores that need to sync data and operations among the different stores.

Indeed, putting a human in the loop can in many cases be a great thing. A good example of that can be in order processing. If I can write just the happy path, I can be done very quickly. Anything not in the happy path? Let a human deal with that. That cut down costs by so much, and it allow you to make intelligent decisions on the spot, with full knowledge of the specific case. It is also quite efficient, since most orders fall into the happy path. It also means that I can come back in a few months and figure out what the most common reasons to fall off the happy path are and add them to the software, reducing the amount of work I shell to humans significantly.

I wish that more systems were built like that.

It is also quite easy to understand why they aren’t built with this approach. Humans are expensive. Let’s assume that we can pay a minimum wage, in the states, that would translate to about 20,000 USD. Note that I’m talking about the actual cost of employment, this calculation includes the salary, taxes, insurance, facilities, etc. If I need this to be a 24/7, I have to at least triple it (without accounting for vacation, sick leave, etc).

At the same time, x1e.16xlarge machine on AWS with 64 cores and 2 TB of memory will set me back by 40,000 a year. And it will likely be able to process transactions much faster than the two minimum wage employees that the same amount of money will get me.

Consider the case of a shoe store and misdirected check scenario, we need to ensure that the people actually receiving the check understand that this is meant for the wrong store and take some form of action. That means that we can just take Random Joe Teenager off the street. So another aspect to consider is the training costs. That usually means getting higher quality people and training them on your policies. All of which take quite a bit of time and effort. Especially if you want to have consistent behavior across the board.

Such a system, taken to extreme, result in rigid policy without a lot of place for independent action on the part of the people doing the work. I wish I could say that taking it to extreme was rare, but all you have to do is visit the nearest government office or bank or the post office to see common examples of people working working within very narrow set of parameters. The metric for that, by the way, is the number of times that you hear: “There is nothing I can do, these are the rules” per hour.

In such a system, it is much cheaper to have a rigid and inflexible system running on a computer. Even with the cost of actually building the system itself.

Data ownership: The story of an invoice

time to read 3 min | 478 words

imageLet’s talk about Gary, and Gary’s Shoes. Gary runs a chain of shoes stores across the nation. As part of refreshing their infrastructure, Gary want to update all the software across the entire chain. The idea is to have a unified billing, inventory, sales and time tracking for the entire chain.

Gary doesn’t spend a lot of time on this (after all, he has to sell shoes), he just installed a sync service between all the stores and HQ to sync up all the data. Well, I call in sync service. What it actually turn out to be is that the unified system is a set of Excel files on a shared DropBox folder.

Feel free to go and wash your face, have a drink, take Xanax. I know this might be a shock to see something like this.

Surprisingly enough, this isn’t the topic of my post. Instead, I want to talk about data ownership here.

Imagine that one of Gary’s stores in Chicago sold a bunch of shoes, then issued an invoice to the customer. They dutifully recorded the order in the Orders.xlsx file with the status “Pending Payment”.

That customer, however, accidently sent the check to the wrong store. No biggie, right?  The clerk at the second store can just go ahead and update the order in the shared system, marking it as “Paid in full”.

As it turns out, this is a problem. And the easiest way to explain why is data ownership. The owner of this particular record is the original store. You might say that this doesn’t matter, after all, the change happened in the same system. But the problem is that this is almost always not the case.

In addition to the operation “system” that you can see on the right, there are other things. The store manager still have a PostIt note to call that customer and ask about the missing payment. The invoice that was generated need to be closed, etc. Just updating it in the system isn’t going to cause all of that to happen.

The proper way to handle that is to call the owner of the data (the original store) and let them know that the check arrived to the wrong store. At this point, the data owner can decide how to handle that new information, apply whatever workflows need to be done, etc.

I intentionally used what looks like a toy example, because it is easy to get bogged down in the details. But in any distributed system, there are local processes that happen which can be quite important. If you go ahead and update their information behind their back, you are guaranteed to break something. And I haven’t even began to talk about the chance for conflicts… of course.

How to really fail a coding interview

time to read 1 min | 154 words

Our current interview question is from this post. We use that between the phone interview and the actual interview to get a feel about a candidate abilities. You can learn a surprising amount of information from even small amount of code.

Note that one of the primary goals of such a question isn’t to tell you “You should really hire this candidate” but to tell you that “You really shouldn’t”.  To clarify, this is a “do it on your own, and you got the whole internet at your disposal” kind of question. Typically we give a week or so to answer this.

Sometimes we get a very clear signal from the code, like in the case of this code:

But I think the crowning glory was this code:

I picked two of the worst offenders, but there were more. Some things I can sort of let slide, and some things I’ll just say no to.

DotNetRocks show on RavenDB with Kamran Ayub

time to read 1 min | 107 words

Kamran Ayub did a great DotNetRocks show about RavenDB 4.0. Kamran is also being the RavenDB 4.0 course on PluralSight, so he knows his stuff.

I got to say, it is… strange to listen to a podcast about RavenDB. I found myself nodding along quite often and the outside perspective is pretty awesome.

Kamran also tested the same application on RavenDB 3.5 and RavenDB 4.0, seeing 20x performance improvement. Best quote from the show as far as I’m concerned:

So fast you aren’t sure it actually worked.

Kamran also have a follow up post with some numbers and more details here.

Listen to the show here.

RavenDB online bootcamp is now updated to 4.0

time to read 1 min | 136 words

imageIn addition to the book and the documentation, we are also working on making it more accessible to get started with RavenDB. The RavenDB Bootcamp is a self directed course meant to give you an easy way to start using RavenDB.

This is a guided tour, walking you through the fundamentals of getting RavneDB up and running, how to put data in and query it, how you can use indexing and MapReduce. These are short lessons, providing practical experience and guidance on how to start using RavenDB.

You can also register to get a lesson a day.

This is now updated to RavenDB 4.0, smoothing the learning curve and making it even simpler to get started.


  1. I WILL have order: How Noise sorts query results - about one day from now
  2. Reviewing the Bleve search library - 2 days from now
  3. I WILL have order: How Bleve sorts query results - 3 days from now
  4. I won’t have order: Looking at search libraries without ordering - 4 days from now

There are posts all the way to May 31, 2018


  1. I WILL have order (3):
    25 May 2018 - How Lucene sorts query results
  2. RavenDB 4.1 features (4):
    22 May 2018 - Highlighting
  3. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  4. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  5. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats