Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

, @ Q j

Posts: 6,878 | Comments: 49,254

filter by tags archive
time to read 8 min | 1502 words

One of the silent features of people moving to the cloud is that it make it the relation between performance and $$$ costs evident. In your on data center, it is easy to lose track between the yearly DC budget and the application performance. But on the cloud, you can track it quite easily.

The immediately lead people to try to optimize their costs and use the minimal amount of resources that they can get away with. This is great for people who believe in efficient software.

One of the results of the focus on price and performance has been the introduction of burstable cloud instances. These instances allow you to host machines that do not need the full machine resources to run effectively. There are many systems whose normal runtime cost is minimal, with only occasional spikes. In AWS, you have the T series and Azure has the B series. Given how often RavenDB is deployed on the cloud, it shouldn’t surprise you that we are frequently being run on such instances. The cost savings can be quite attractive, around 15% in most cases. And in the case of small instances, that can be even more impressive.

RavenDB can run quite nicely on a Raspberry PI, or a resource starved container, and for many workloads, that make sense. Looking at AWS in this case, consider the t3a.medium instance with 2 cores and 4 GB at 27.4$ / month vs. a1.large (the smallest non burstable instance) with the same spec (but ARM machine) at 37.2$ per month. For that matter, a t3a.small with 2 cores and 2 GB of memory is 13.7$. As you can see, the cost savings adds up, and it make a lot of sense to want to use the most cost effective solution.

Enter the problem with burstable instances. They are bursty. That means that you have two options when you need more resources. Either you end up with throttling (reducing the amount of CPU that you can use) or you are being charged for the additional extra CPU power you used.

Our own production systems, running all of our sites and backend systems are running on a cluster of 3 t3.large instances. As I mentioned, the cost savings are significant. But what happens when you go above the limit? In our production systems, for the past 6 months, we have never had an instance where RavenDB used more than the provided burstable performance. It helps that the burstable machines allows us to accrue CPU time when we aren’t using it, but overall it means that we are doing a good job of  handling requests efficiently. Here are some metrics from one of the nodes in the cluster during a somewhat slow period (the range is usually 20 – 200 requests / sec).


So we are pretty good in terms of latency and resource utilizations, that’s great.

However, the immediately response to seeing that we aren’t hitting the system limits is… to reduce the machine size again, to pay even less. There is a lot of material on cost optimization in the cloud that you can read, but that isn’t the point of this post. One of the more interesting choices you can make with burstable instances is to ask to not go over the limit. If your system is using too much CPU, just take it away until it is done. Some workloads are just fine for this, because there is no urgency. Some workloads, such as servicing requests, are less kind to this sort of setup. Regardless, this is a common enough deployment model that we have to take it into account.

Let’s consider the worst case scenario from our perspective. We have a user that runs the cheapest possible instance a t3a.nano with 2 cores and 512 MB of RAM costing just 3.4$ a month. The caveat with this instance is that you have just 6 CPU credits / hour to your name. A CPU credit is basically 100% CPU for 1 minute. Another way of looking at this is that t t3a.nano instance has a budget of 360 CPU seconds per hour. If it uses more than that, it is charged a hefty fee (about ten times per hour than the machine cost). So we have users that disable the ability to go over the budget.

Now, let’s consider what is going to happen to such an instance when it hits an unexpected load. In the context of RavenDB, it can be that you created a new index on a big collection. But something seemingly simple such as generating a client certificate can also have a devastating impact on such an instance.RavenDB generates client certificates with 4096 bits keys. On my machine (nice powerful dev machine), that can take anything from 300 – 900 ms of CPU time and cause a noticeable spike in one of the cores:


On the nano machine, we have measured key creation time of over six minutes.

The answer isn’t that my machine is 800 times more powerful than the cloud machine. The problem is that this takes enough CPU time to consume all the available credits, which cause throttling. At this point, we are in a sad state. Any additional CPU credits (and time) that we earn goes directly to the already existing computation. That means that any other CPU time is pushed back. Way back. This happens at a level below the operating system (at the hypervisor), so there it isn’t even aware of it. What is happening from the point of view of the OS is that the CPU is suddenly much slower.

All of this make sense and expected given the burstable nature of the system. But what is the observed behavior?

Well, the instance appears to be frozen. All the cloud metrics will tell you that everything is green, but you can’t access the system (no CPU to accept new connections) you can’t SSH into it (same) and if you have an existing SSH connection, it appears to have frozen. Measuring performance from inside the machine shows that everything is cool. One CPU is busy (expected, generating the certificate, doing indexing, etc). Another is idle. But the system behaves as if it has no CPU available, which is exactly what is going on, except in a way that we can’t tell.

RavenDB goes to a lot of trouble to be a nice citizen and a good neighbor. This includes scheduling operations so the underlying OS will always have the required resources to handle additional load. For example, background tasks are run with lowered priority, etc. But when running on a burstable machine that is being throttled… well, from the outside it looked like certain trivial operations would just take the entire machine down and it wouldn’t be recoverable short of hard reboot.

Even when you know what is going on, it doesn’t really help you. Because from inside the machine, there is no way to access the cloud metrics in a good enough precision to take action.

We have a pretty strong desire to not get into these kind of situation, so we implemented what I can only refer to as counter measures. When you are running on a burstable instance, you can let RavenDB know what is your CPU credits situation, at which point RavenDB will actively monitor the machine and compute its own tally of the outstanding CPU credits situation. When we notice that the CPU credits are running short, we’ll start pro-actively halting internal processes to give the machine more space to recover from the CPU credits shortage.

We’ll move ETL processes and ongoing backups to other nodes in the cluster and even pause indexing in order to recover the CPU time we need. Here is an example of what you’ll see in such a case:


One of the nice things about the kind of promises RavenDB make about its indexes is that it is able to make these kind of decisions without impacting the guarantees that we provide.

If the situation is still bad, we’ll start proactively rejecting requests. The idea is that instead of getting of getting to the point where we are being throttled (and effectively down to the world), we’ll process a request by simply sending 503 Service Unavailable response back. This is going to be very cheap to do, so won’t put additional strain on our limited budget. At the same time, the RavenDB’s client treat this kind of error as a trigger for a failover and will use another node to service this request.

This was a pretty long post to explain that RavenDB is going to give you a much nicer experience on burstable machines even if you don’t have bursting capabilities.

time to read 4 min | 609 words

About five years ago, my wife got me a present, a FitBit. I didn’t wear a watch for a while, and I didn’t really see the need, but it was nice to see how many steps I took and we had a competition about who has the most steps a day. It was fun. I had a few FitBits since then and I’m mostly wearing one. As it turns out, FitBit allows you to get an export of all of your data, so a few months ago I decided to see what kind of information I have stored there, and what kind of data I can get from it.

The export process is painless and I got a zip with a lot of JSON files in it. I was able to process that and get a CSV file that had my heartrate over time. Here is what this looked like:


The file size is just over 300MB and it contains 9.42 million records, spanning the last 5 years.

The reason I looked into getting the FitBit data is that I’m playing with timeseries right now, and I wanted a realistic data set. One that contains dirty data. For example, even in the image above, you can see that the measurements aren’t done on a consistent basis. It seems like ten and five second intervals, but the range varies.  I’m working on a timeseries feature for RavenDB, so that was perfect testing ground for me. I threw that into RavenDB and I got the data to just under 40MB in side.

I’m using Gorilla encoding as a first pass and then LZ4 to further compress the data. In a data set where the duration between measurement is stable, I can stick over 10,000 measurements in a single 2KB segment. In the case of my heartrate, I can store an average of 672 entries in each 2KB segment. Once I have the data in there, I can start actually looking at interesting patterns.

For example, consider the following query:


Basically, I want to know how I’m doing on a global sense, just to have a place to start figuring things out. The output of this query is:


These are interesting numbers. I don’t know what I did to hit 177 BPM in 2016, but I’m not sure that I like it.

What I do like is this number:


I then run this query, going for a daily precision on all of 2016:


And I got the following results in under 120 ms.


These are early days for this feature, but I was able to take that and generate the following (based on the query above).


All of the results has been generated on my laptop, and we haven’t done any performance work yet. In fact, I’m posting about this feature because I was so excited to see that I got queries to work properly now. This feature is early stages yet.

But it is already quite cool.

time to read 4 min | 679 words

RavenDB uses X509 certificates for many purposes. One of them is to enable authentication by using clients certificates. This create a highly secured authentication method with quite a lot to recommend it. But it does create a problem. Certificates, by their very nature, expire. Furthermore, certificates usually have relatively short expiration times. For example, Let’s Encrypt certificates expire in 3 months. We don’t have to use the same cert we use for server authentication for client authentication as well, but it does create a nice symmetry and simplify the job of the admin.

Except that every cert replacement ( 3 months, remember? ) the admin will now need to go to any of the systems that we talk to and update the list of allowed certificates whenever we update the Let’s Encrypt certificate. One of the reasons behind this 3 months deadline is to ensure that you’ll automate the process of cert replacement, so it is obvious that we need a way to automate the process of updating third parties about cert replacements.

Our current design goes like this:

  • This design applies only to the nodes for which we authenticate using our own server certificate (thus excluding Pull Replication, for example).
  • Keep track of all the 3rd parties RavenDB instances that we talk to.
  • Whenever we have an updated certificate, contact each of those instances and let them know about the cert change. This is done using a request that authenticate using the old certificate and providing the new one.
  • The actual certificate replacement is delayed until all of those endpoints have been reached or until the expiration of the current certificate is near.

Things to consider:

  • Certificate updates are written to the audit log. And you can always track the chain of updates backward.
  • Obviously, a certificate can only register a replacement as long as it is active.
  • The updated certificate will have the exact same permissions as the current certificate.
  • A certificate can only ever replace itself with one other certificate. We allow to do that multiple times, but the newly updated cert will replace the previous updated cert.
  • A certificate cannot replace a certificate that it updated if that certificate has updated certificate as well.

In other words, consider certificate A that is registered in a RavenDB instance:

  • Cert A can ask the RavenDB instance to register updated certificate B, at which point users can connect to the RavenDB instance using either A or B. Until certificate A expires. This is to ensure that during the update process, we won’t see some nodes that we need to talk to using cert A and some nodes that we need to talk to using cert B.
  • Cert A can ask the RavenDB instance to register updated certificate C, at which point, certificate B is removed and is no longer valid. This is done in case we failed to update the certificate and need to update with a different certificate.
  • Cert C can then ask the RavenDB instance to register updated certificate D. At this point, certificate A become invalid and can no longer be used. Only certs C and D are now active.

More things to consider:

  • Certain certificates, such as the ones exposing Pull Replication, are likely going to be used by many clients. I’m not sure if we should allow certificate replacement there. Given that we usually won’t use the server cert for authentication in Pull Replication, I don’t see that as a problem.
  • The certificate update process will be running on only a single node in the cluster, to avoid concurrency issues.
  • We’ll provide a way to the admin to purge all expired certificates (although, with one update every 3 months, I don’t expect there to be many).
  • We are considering limiting this to non admin certificates only. So you will not be able to update a certificate if it has admin privileges in an automated manner. I’m not sure if this is a security feature or a feel good feature.
  • We’ll likely provide administrator notification that this update has happened on the destination node, and that might be enough to allow updating of admin certificates.

Any feedback you have would be greatly appreciated.

time to read 5 min | 962 words

I got into an interesting discussion about Event Sourcing in the comments for a post and that was interesting enough to make a post all of its own.

Basically, Harry is suggesting (I’m paraphrasing, and maybe not too accurately) a potential solution to the problem of having the computed model from all the events stored directly in memory. The idea is that you can pretty easily get machines with enough RAM to store stupendous amount of data in memory. That will give you all the benefits of being able to hold a rich domain model without any persistence constraints. It is also likely to be faster than any other solution.

And to a point, I agree. It is likely to be faster, but that isn’t enough to make this into a good solution for most problems. Let me to point out a few cases where this fails to be a good answer.

If the only way you have to build your model is to replay your events, then that is going to be a problem when the server restarts. Assuming a reasonably size data model of 128GB or so, and assuming that we have enough events to build something like that, let’s say about 0.5 TB of raw events, we are going to be in a world of hurt. Even assuming no I/O bottlenecks, I believe that it would be fair to state that you can process the events at a rate of 50 MB/sec. That gives us just under 3 hours to replay all the events from scratch. You can try to play games here, try to read in parallel, replay events on different streams independently, etc. But it is still going to take time.

And enough time that this isn’t a good technology to have without a good backup strategy, which means that you need to have at least a few of these machines and ensure that you have some failover between them. But even ignoring that, and assuming that you can indeed replay all your state from the events store, you are going to run into other problems with this kind of model.

Put simply, if you have a model that is tens or hundreds of GB in size, there are two options for its internal structure. On the one hand, you may have a model where each item stands on its own, with no relations to other items. Or if there are any relations to other items, they are well scoped to the a particular root. Call it the Root Aggregate model, with no references between aggregates. You can make something like that work, because you have a good isolation between the different items in memory, so you can access one of them without impacting another. If you need to modify it, you can lock it for the duration, etc.

However, if your model is interconnected, so you may traverse between one Root Aggregate to another, you are going to be faced with a much harder problem.

In particular, because there are no hard breaks between the items in memory, you cannot safely / easily mutate a single item without worrying about access from another item to it. You could make everything single threaded, but that is a waste of a lot of horsepower, obviously.

Another problem with in memory models is that they don’t do such a good job of allowing you to rollback operations. If you run your code mutating objects and hit an exception, what is the current state of your data?

You can resolve that. For example, you can decide that you have only immutable data in memory and replace that atomically. That… works, but it requires a lot of discipline and make it complex to program against.

Off the top of my head, you are going to be facing problems around atomicity, consistency and isolation of operations. We aren’t worried about durability because this is purely in memory solution, but if we were to add that, we would have ACID, and that does ring a bell.

The in memory solution sounds good, and it is usually very easy to start with, but it suffer from major issues when used in practice. To start with, how do you look at the data in production? That is something that you do surprisingly often, to figure out what is going on “behind the scenes”. So you need some way to peek into what is going on. If your data is in memory only, and you haven’t thought about how to explore it to the outside, your only option is to attach a debugger, which is… unfortunate. Given the reluctance to restart the server (startup time is high) you’ll usually find that you have to provide some scripting that you can run in process to make changes, inspect things, etc.

Versioning is also a major player here. Sooner or later you’ll probably put the data inside a memory mapped to allow for (much) faster restarts, but then you have to worry about the structure of the data and how it is modified over time.

None of the issues I have raised is super hard to figure out or fix, but in conjunction? They turn out to be a pretty big set of additional tasks that you have to do just to be in the same place you were before you started to put everything in memory to make things easier.

In some cases, this is perfectly acceptable. For high frequency trading, for example, you would have an in memory model to make decisions on as fast as possible as well as a persistent model to query on the side. But for most cases, that is usually out of scope. It is interesting to write such a system, though.

time to read 3 min | 505 words

Computation during indexes open up some nice  features when we are talking about data modeling and working with your data. In this post, I want to discuss predicting the future with it. Let’s see how we can do that, shall we?

Consider the following document, representing a (simplified) customer model:


We have a customer that is making monthly payments. This is a pretty straightforward model, right?

We can do a lot with this kind of data. We can obviously compute the lifetime value of a customer, based on how much they paid us. We already did something very similar in a previous post, so that isn’t very interesting.

What is interesting is looking into the future. Let’s see how we can start simple, but figuring out what is the next charge rate for this customer. For now, the logic is about as simple as it can be. Monthly customers pay by month, basically. Here is the index:


I’m using Linq instead of JS here because I’m dealing with dates and JS support for dates is… poor.

As you can see, we are simply looking at the last date and the subscription, figuring out how much we paid the last three times and use that as the expected next payment amount. That can allow us to do nice things, obviously. We can now do queries on the future. So finding out how many customers will (probably) pay us more than 100$ on the 1st of Feb both easy and cheap.

We can actually take this further, though. Instead of using a simple index, we can use a map/reduce one. Here is what this looks like:


And the reduce:


This may seem a bit dense at first, so let’s de-cypher it, shall we?

We take the last payment date and compute the average of the last three payments, just as we did before. The fun part now is that we don’t compute just the single next payment, but the next three. We then output all the payments, both existing (that already happened) and projected (that will happen in the future) from the map function. The reduce function is a lot simpler, and simply sum up the amounts per month.

This allows us to effectively project data into the future, and this map reduce index can be used to calculate expected income. Note that this is aggregated across all customers, so we can get a pretty good picture of what is going to happen.

A real system would probably have some uncertainty factor, but that touches on business strategy more than modeling, so I don’t think we need to go into that here.

time to read 4 min | 613 words

imageIn my last post on the topic, I showed how we can define a simple computation during the indexing process. That was easy enough, for sure, but it turns out that there are quite a few use cases for this feature that go quite far from what you would expect. For example, we can use this feature as part of defining and working with business rules in our domain.

For example, let’s say that we have some logic that determine whatever a product is offered with a warranty (and for how long that warranty is valid). This is an important piece of information, obviously, but it is the kind of thing that changes on a fairly regular basis. For example, consider the following feature description:

As a user, I want to be able to see the offered warranty on the products, as well as to filter searches based on the warranty status.

Warranty rules are:

  • For new products made in house, full warranty for 24 months.
  • For new products from 3rd parties, parts only warranty for 6 months.
  • Refurbished products by us, full warranty, for half of new warranty duration.
  • Refurbished 3rd parties products, parts only warranty, 3 months.
  • Used products, parts only, 1 month.

Just from reading the description, you can see that this is a business rule, which means that it is subject to many changes over time. We can obviously create a couple of fields on the document to hold the warranty information, but that means that whenever the warranty rules change, we’ll have to go through all of them again. We’ll also need to ensure that any business logic that touches the document will re-run the logic to apply the warranty computation (to be fair, these sort of things are usually done as a subscription in RavenDB, which alleviate that need).

Without further ado, here is the index to implement the logic above:

You can now query over the warranty types and it’s duration, project them from the index, etc. Whenever a document is updates, we’ll re-compute the warranty status and update the index.

This saves you from having additional fields in your model and greatly diminish the cost of queries that need to filter on warranty or its duration (since you don’t need to do this computation during the query, only once, during indexing).

If the business rule definition changes, you can update the index definition and RavenDB will effectively roll out your change to the entire dataset. That is nice, but even though I’m writing about cool RavenDB features, there are some words of cautions that I want to mention.

Putting queryable business rules in the database can greatly ease your life, but be wary of putting too much business logic in there. In general, you want your business logic to reside right next to the rest of your application code, not running in a different server in a mode that is much harder to debug, version and diagnose. And if the level of complexity involved in the business rule exceed some level (hard to define, but easy to know when you hit it), you should probably move from defining the business rules in an index to a subscription.

A RavenDB subscription allow you to get all changes to documents and apply your own logic in response. This is a reliable way to process data in RavenDB, this runs in your own code, under your own terms, so it can enjoy all the usual benefits of… well, being your code, and not mine. You can read more about them in this post and of course, the documentation.

time to read 4 min | 734 words

imageThe title of this post is pretty strange, I admit. Usually, when we think about modeling, we think about our data. If it is a relational database, this mostly mean the structure of your tables and the relations between them. When using a document database, this means the shape of your documents. But in both cases, indexes are there merely to speed things up. Oh, a particular important query may need an index, and that may impact how you lay out the data, but these are relatively rare cases. In relational databases and most non relational ones, indexes do not play any major role in data modeling decisions.

This isn’t the case for RavenDB. In RavenDB, an index doesn’t exist merely to organize the data in a way that make it easier for the database to search for it. An index is actually able to modify and transform the data, on the current document or full related data from related documents. A map/reduce index is even able aggregate data from multiple documents as part of the indexing process. I’ll touch on the last one in more depth later in this series, first, let’s tackle the more obvious parts. Because I want to show off some of the new features, I’m going to use JS for most of the indexes definitions in these psots, but you can do the same using Linq / C# as well, obviously.

When brain storming for this post, I got so many good ideas about the kind of non obvious things that you can do with RavenDB’s indexes that a single post has transformed into a series and I got two pages of notes to go through. Almost all of those ideas are basically some form of computation during indexing, but applied in novel manners to give you a lot of flexibility and power.

RavenDB prefers to have more work to do during indexing (which is batched and happen on the background) than during query time. This means that we can push a lot more work into the background and just let RavenDB handle it for us. Let’s start from what is probably the most basic example of computation during query, the Order’s Total. Consider the following document:


As you can see, we have the Order document and the list of the line items in this order. What we don’t have here is the total order value.

Now, actually computing how much you are going to pay for an order is complex. You have to take into account taxation, promotions, discounts, shipping costs, etc. That isn’t something that you can do trivially, but it does serve to make an excellent simple example and similar requirements exists in many fields.

We can obvious add an Total field to the order, but then we have to make sure that we update it whenever we update the order. This is possible, but if we have multiple clients consuming the data, this can be fairly hard to do. Instead, we can place the logic to compute the property in the index itself. Here is how it would look like:


The same index in JavaScript is almost identical:

In this case, they are very similar, but as the complexity grow, I find it is usually easier to express logic as a JavaScript index rather than use a single (complex) Linq expression.

Such an index give us a computed field, Total, that has the total value of an order. We can query on this field, sort it and even project it back (if this field is stored). It allow us to always have the most up to date value and have the database take care of computing it.

This basic technique can be applied in many different ways and affect the way we shape and model our data. Currently I have at least three more posts planned for this series, and I would love to hear your feedback. Both on the kind of stuff you would like me to talk about and the kind of indexes you are using RavenDB and how it impacted your data modeling.

time to read 13 min | 2462 words

imageMost developers have been weaned on relational modeling and have to make a non trivial mental leap when the times comes to model data in a non relational manner. That is hard enough on its own, but what happens when the data store that you use actually have multi model capabilities. As an industry, we are seeing more and more databases that take this path and offer multiple models to store and query the data as part of their core nature. For example, ArangoDB, CosmosDB, Couchbase, and of course, RavenDB.

RavenDB, for example, gives you the following models to work with:

  • Documents (JSON) – Multi master with any node accepting reads and writes.
    • ACID transactions over multiple documents.
    • Simple / full text queries.
    • Map/Reduce and aggregation queries.
  • Binary data – Attachments to documents.
  • Counters (Map<string, int64>) – CRDT multi master distributed counters.
  • Key/Value – strong distributed consistency via Raft protocol.
  • Graph queries – on top of the document model.
  • Revisions – built in audit trail for documents

With such a wealth of options, it can be confusing to select the appropriate tool for the job when you need to model your data. In this post, I aim to make sense of the options RavenDB offers and guide you toward making the optimal choices.

The default and most common model you’ll use is going to be the document model. It is the one most appropriate for business data and you’ll typically follow the Domain Driven Design approach for modeling your data and entities. In other words, we are talking about Aggregates, where each document is a whole aggregate. References between entities are either purely local to an aggregate (and document) or only between aggregates. In other words, a value in one document cannot point to a value in another document. It can only point to another document as a whole.

Most of your business logic will be focused on the aggregate level. Even when a single transaction modify multiple documents, most of the business logic is done at each aggregate independently. A good way to handle that is using Domain Events. This allow you to compose independent portions of your domain logic without tying it all in one big knot.

We talked about modifying documents so far, but a large part of what you’ll do with your data is query and present it to users. In these cases, you need to make a conscious and explicit decision. Whatever your display model is going to be based on your documents or a different source. You can use RavenDB ETL to project the data out to a different database, changing its shape to the appropriate view model along the way. RavenDB ETL allow you to replicate a portion of the data in your database to another location, with the ability to modify the results as they are being sent. This can be a great tool to aid you in bridging the domain model and the view model without writing a lot of code.

This is useful for applications that have a high degree of complexity in their domain and business rules. For most applications, you can simply project the relevant data out at query time, but for more complex systems, you may want to have strict physical separation between your read model and the domain model. In such a scenario, Raven ETL  can greatly simplify the task of facilitating the task of moving (and transforming) the data from the write side to the read side.

When it comes to modeling, we also need to take into account RavenDB’s map/reduce indexes. These allow you to define aggregation that will run in the background, in other words, at query time, the work has already been done. This in turn leads to blazing fast aggregation queries and can be another factor in the design of the system. It is common to use map/reduce indexes to aggregation the raw data into a more usable form and work with the results. Either directly from the index or use the output collection feature to turn the results of the map/reduce index to real documents (which can be further indexes, aggregated, etc).

Thus far, we have only touched on document modeling, mind. There are a bunch of other options as well. We’ll start from the simplest option, attachments. At its core, an attachment is just that. A way to attach some binary data to the document. As simple as it sounds, it has some profound implications from a modeling point of view. The need to store binary data somewhere isn’t new, obviously and there have been numerous ways to resolve it. In a relational database, a varbinary(max) column is used. In a document database, I’ve seen the raw binary data stored directly in the document (either as raw binary data or as BASE64 encoded value). In most cases, this isn’t a really good idea. It blow up the size of the document (and the table) and complicate the management of the data. Storing the data on the file system lead to a different set of problems, coordinating transactions between the database and the file system, organizing the data, securing paths such as “../../etc/passwd”, backups and restore, and many more.


These are all things that you want your database to handle for you. At the same time, binary data is related to but not part of the document. For those reasons, we use the attachment model in RavenDB. This is meant to be viewed just like attachments in email. The binary data is not stored inside the document, but it is strongly related to it. Common use cases for attachments include profile picture for a user’s document, the signatures on a lease document, the excel spreadsheet with details about a loan for a payment plan document or the associated pictures from a home inspection report. A document can have any number of attachments, and an attachment can be of any size. This give you a lot of freedom to attach (pun intended) additional data to your documents without any hassle. Like documents, attachments also work in multi master mode and will be replicated across the cluster with the document.


While attachments can be any raw binary data and has only a name (and optional mime type) for structure, counters are far more strictly defined. A counter is… well, a counter. It count things. On the most basic level, it is just a named 64 bits integer that is associated with a document. And like attachments, a document may have any number of such counters. But why is it important to have a 64 bits integer attached to the document? How could something so small can be important enough that we would need a whole new concept for it? After all, couldn’t we just store the same counter more simply as a property inside the document?

To understand why RavenDB have counters, we need to understand what they aren’t. The are related to the document, but not of the document. That means that an update to the counter is not going to modify the document as a whole. This, in turn, means that operations on the counters can be very cheap, regardless of how many counters you have in a document or how often you modify the counter. Having the counter separate from the document allow us to do several important things:

  • Cheap updates
  • Distributed modifications

In a multi master cluster, if any node can accept any write, you need to be aware of conflicts, when two updates to the same value were made on two disjoint servers. In the case of documents, RavenDB detect and resolve it according to the pre-defined policy. In the case of counters, there is no such need. A counter in RavenDB is stored using a CRDT. This is a format that allow us to handle concurrent modifications to the same value without losing data or expensive locks. This make counter suitable for for values that changes often. Good examples of counters is tracking views on a page or an ad, you can distribute the operations on a number of servers and still reach the correct final tally. This works both for increment and decrement, obviously.

Given that counters are basically just map<string, int64> you might expect that there isn’t any modeling work to be done here, right? But it turns out that there is actually quite a bit that can be done even with that simple an interface. For example, when tracking views on a page or downloads for a particular package, I’m interested not only in the total number of downloads, but also in the downloads per day. In such a case, whenever we want to note another download, we’ll increment both the counter for overall download and another counter for downloads for that particular day. In other words, the name of the counter can hold meaningful information.


So far, all the data we have talked about were stored and accessed in a multi-master manner. In other words, we could chose any node in the cluster and make a write to it and it would be accepted. Data that is modified on multiple nodes at the same time would either be merged (counters), stored (attachments) or resolved (documents). This is great when you care about the overall availability of your system, we are always accepting writes and always proceed forward. But that isn’t always the case, there are situations where you might need to have a higher degree of consistency in your operations. For example, if you are selling a fixed number of items, you want to be sure that two buyers hitting “Purchase” at the same time don’t cause you problems just because their requests used different database servers.

In order to handle this situation, RavenDB offers the Cmp Xcng model. This is a cluster wide key/value store for your database, it allows you to store named values (integer, strings or JSON objects) in a consistent manner. This feature allow you to ensure consistent behavior for high value data. In fact, you can combine this feature with cluster wide transactions.

Cluster wide transactions allow you to combine operations on documents with Cmp Xcng ops to create a single consistent transaction. This mode enable you to perform conditional operations to modify your documents based on the globally consistent Cmp Xcng values.

A good example for Cmp Xcng values include pessimistic locks and their owners, to generate a cluster wide lock that is guaranteed to be consistent and safe regardless of what is going on with the cluster. Other examples can be to store global configuration for your system in all the nodes in the cluster.

Graph Queries

Graph data stores are used to hold data about nodes and the edges between them. They are a great tool to handle tasks such as social network, data mining and finding patterns in large datasets. RavenDB, as of release 4.2, has support for graph queries, but it does so in a novel manner. A node in RavenDB is a document, quite naturally, but unlike other features, such as attachments and counters, edges don’t have separate physical existence. Instead, RavenDB is able to use the document structure itself to infer the edges between documents. This dynamic nature means that when the time comes to apply graph queries on top of your existing database, you don’t have to do a lot of prep work. You can start issuing graph queries directly, and RavenDB will work behind the scenes to make sure that all the data is found, and quickly, too.

The ability to perform graph queries on your existing document structure is a powerful one, but it doesn’t alleviate the need to model your data properly to best take advantage of this. So what does this mean, modeling your data to be usable both in a document form and for graph operations? Usually, when you need to model your data in a graph manner, you think mostly in terms of the connection between the nodes.

One way of looking at graph modeling in RavenDB is to be explicit about the edges, but I find this awkward and limiting. It is usually better to express the domain model naturally and allow the edges to pop up from the underlying data as you work with it. Edges in RavenDB are properties (or nested objects) that contain a reference to another document. If the edge in a nested object, then all the properties of the object are also the properties on the edge and can be filtered upon.

For best results, you want to model your edge properties as a single nested object that can be referred to explicitly. This is already a best practice when modeling your data, for better cohesiveness, but graph queries make this a requirement.

Unlike other graph databases, RavenDB isn’t limited to just graph representation. A graph queries in RavenDB is able to utilize the full power of RavenDB queries, which means that you can start your graph operation with a spatial query and then proceed to the rest of the graph pattern match. You should aim to do most of the work in the preparatory queries and not spend most of the time in graph operations.

A common example of graph operation is fraud detection, with graph queries to detect multiple orders made using many different credit card for the same address. Instead of trying to do the matches using just graph operations, we can define a source query on a map/reduce index that would aggregate all the results for orders on the same address. This would dramatically cut down on the amount of work that the database is required to do to answer your queries.


The final topic that I want to discuss is this (already very long) post is the notion of Revisions. RavenDB allow the database administrator to define a revisions policy, in which case RavenDB will maintain, automatically and transparently, an immutable log of all changes on documents. This means that you have a built-in audit trail ready for use if you need to. Beyond just having an audit trail, revisions are also very important feature in several key capabilities in RavenDB.

For example, using revisions, you can get the tuple of (previous, current) versions of any change made in the database using subscriptions. This allow you to define some pretty interesting backend processes, which have full visibility to all the changes that happen to the document over time. This can be very interesting in regression analysis, applying business rules and seeing how the data changes over time.


I tried to keep this post at a high level and not get bogged down in the details. I’m probably going to have a few more posts about modeling in general and I would appreciate any feedback you may have or any questions you can raise.

time to read 6 min | 1079 words

I’m certain that the archangel that is responsible for statistical grouping has a strange fondness for jokes. If that isn’t the case, I can’t really explain how we routinely get clear clustering of bug reports on specific issues. We have noticed that several years ago, when we suddenly started to get a lot of reports of a particular issue. There was no recent release and no reason for all of these correlated errors to happen all at once. But they did. Luckily, the critical mass of reports gave us more than enough information to figure out what the bug was and fix it. When we checked, the bug was there, dormant, for 20 months.

The current issue we are handling is a spate of reports about hard errors in RavenDB. The kind of errors that use the terms invalid data, in particular. These kind of errors bring development to a screeching halt, as we direct all available resources to figure out what is going on. We didn’t have a reproduction, but we did have a few cases of people reporting it and the stack trace we got told us that the problem wasn’t isolated to a single location but seemed to be fairly wide spread. This is a Good News / Bad News kind of situation. It is good because it means that the problem is in a low level component, it is bad because it is going to be hard to narrow it down exactly.

The common theme around this issue popped up when the system was already in a bad state. We were able to see that when we run out of disk space or when the memory on the system was very low. The common thread to these was that in both cases these are expected and handled scenarios. What is more, these are the kind of errors that we want to keep on shouldering through.

In general, the error strategy we use in most of RavenDB is fairly simple. Abort the transaction and give the user an error. If the error is considered catastrophic (I/O failure when writing to disk, for example), we also force a restart of the database (which does full recovery). As it turned out, there were three completely separate issues that we found when investigating this issue.

First, and the most critical one was cleanup code that wasn’t aware of the state of the system. For example, consider:

Let’s assume that we have an error during the WriteData. For example, we run out of space on disk. We abort the transaction and throw an exception. The FlushIndex() method, however, will go into the catch clause and try to cleanup its resources. When doing so, it will access the transaction (that was just aborted) and try to use it, still.

The problem is that at this point the state of the transaction is known bad, the only thing you are allowed to do is to dispose it, you cannot access anything on it. The code in question, however, is meant to be used on non transactional resource and require compensating actions to revert to previous state.

For that matter, the code above has another problem, however. Do you see the using statement here? It seemed like we have quite a bit of code that does stuff in the Dispose method. And if that stuff also touch the transaction, you may get what is called in the profession: “hilarity ensued”.

The problem was that our transaction code was too trusting and assumed that the calling code will not call it if it is in invalid state. Indeed, the cases where we did such a thing were rare, but when they happened.

The second problem was when we had code like this:

Except that the real code is a lot more complex and it isn’t as each to see what is going on.

What you see here is that we run commands, and as we process them, we may get (expected) exceptions, which we should report to the caller. The problem is that we mixed expected exceptions with unexpected ones. And these unexpected exceptions could leave us in… a funny state. At which point we would continue executing future commands and even commit the transaction. As you can imagine, that isn’t a good place to be at.

We have gone over our error handling and injected faults and errors at any later of the stack that we could conceive of. We have been able to fix a lot of issues, most of them have probably never been triggered, but it was a sobering moment to look at some of those changes. The most likely cause for the errors was a change that was made (by me, as it turns out) over two years ago. And in all that time, we never seen neither hide nor hair of it. Until suddenly we got several simultaneous cases.

The third and final problem, by the way, was related to not crashing. By default, any severe enough error should cause us to shut down the database and restart it. In the process, we re-apply the journal and ensure that we are in a consistent state. The problem is that we don’t want to do that for certain expected errors. And the issue with staying up was that while Voron (at the storage layer) handled the error properly, the higher level code did not. At that point we had a divergence of the in memory state vs. the on disk state. That usually led to either NRE that would remain until the server was restarted or really scary messages that would typically go away when we recovered the database and reloaded the in memory state from the on disk state.

In short, handling errors in the error handling code is tough.

The primary changes we made were on the transaction side, we made it validate its own state when called, so code that erroneously try to use a transaction that has already failed will error early. We have also added additional validation of operations in several key points. That would allow us to catch errors much more quickly and allow us to pinpoint exactly where things are breaking apart sooner.

The current state is that we pushed these changes to our production system and are running the usual battery of tests to prove that the changes are valid. We’ll also be adding more faults into the process to ensure that we exercise our error handling a lot more often.

time to read 5 min | 944 words

imageWe test RavenDB as thoroughly as we can. Beyond just tests, longevity runs, load test and in general beating it with a stick and seeing if it bleats, we also routinely push early build of RavenDB to our own production systems to see how it behaves on real load.

This has been invaluable in catching several hard to detect bugs. A few days after we pushed a new build to production, we started getting errors about running out of disk space. The admin looked, and it seemed to have fixed itself. Then it repeated the issue.

It took a bit of time to figure out what was going on. A recent change caused RavenDB to hang on to what are supposed to be transient files until an idle moment. Where idle moment is defined as 5 seconds without any writes. Our production systems don’t have an idle moment, here is a typical load on our production system:


We’ll range between lows of high 30s requests per seconds to peeks of 200+ requests per second.

So we never have an idle moment to actually clean up those files, so they gather up. Eventually, we run out of disk space. At this point, several interesting things happen. Here is what our topology looks like:


We have three nodes in the cluster, and currently, node B is the leader of the cluster as a whole. Most of our operations relate to a single database, and for that database, we have the following topology:


Note that in RavenDB, the cluster as a whole and a particular database may have different topologies. In particular, the primary node for the database and the leader of the cluster can and are often different. In this case, Node A is the primary for the database and node B is the leader of the cluster.

Under this scenario, we have Node A accepting all the writes for this database, and then replicating the data to the other nodes. Here is what this looks like for node C.


In other words, Node A is very busy handling requests and writes. Node C is just handling replicated writes. Just for reference, here is what Node A’s CPU looks like, the one that is busy:


In other words, it is busy handling requests. But it isn’t actually busy. We got a lot of spare capacity at hand.

This make sense, our benchmark scenarios is starting with tens of thousands of requests per seconds. A measly few hundreds per second aren’t actually meaningful load. The servers in questions, by the way, are t2.large with 2 cores and 8GB of RAM.

So we have a bug in releasing files when we aren’t “idle”. The files just pile up and eventually we run out of disk space. Here is what this looks like in the studio:


And clicking on details gives us:


So this looks bad, but let’s us see how this actually turns out to work in practice.

We have a failure on one node, which causes the database to be unloaded. RavenDB is meant to run in an environment where failure is likely and is built to handle that. Both clients and the cluster as a whole know that when such things happen, either to the whole node or to one particular database on it, we should respond accordingly.

The clients automatically fail over to a secondary node. In the case of the database in question, we can see that if Node A is failure, the next in like would be Node C. The cluster will also detect that are re-arrange the responsibilities of the nodes to indicate that one of them failed.

The node on which the database has failed will attempt to recover from the error. At this point, the database is idle, in the sense that it doesn’t process any requests, and will be able to cleanup all the files and delete them. In other words, the database goes down, restarts and recover.

Because of the different write patterns on the different nodes, we’ll run into the out of disk space error at different times. We have been running in this mode for a few days now. The actual problem, by the way, has been identified and resolved. We just aren’t any kind of pressure to push the fix to production.

Even under constant load, the only way we can detect this problem is through the secondary affects, the disk space on the machines that is being monitored. None of our production systems has reported any failures and monitoring is great across the board. This is a great testament to the manner in which expecting failure and preparing for it at all levels of the stack really pays off.

My production database is routinely running out of space? No worries, I’ll fix that on the regular schedule, nothing is actually externally visible.


No future posts left, oh my!


  1. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  2. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  3. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
  4. Reviewing Sled (3):
    23 Apr 2019 - Part III
  5. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats