Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:


+972 52-548-6969

, @ Q j

Posts: 6,842 | Comments: 49,140

filter by tags archive
time to read 7 min | 1301 words

Last week I posted about some timeseries work that we have been doing with RavenDB. But I haven’t actually talked about the feature in this space before, so I thought that this would be a good time to present what we want to build.

The basic idea with timeseries is that this is a set of data points taken over time. We usually don’t care that much about an individual data point but care a lot about their aggregation. Common usages for time series include:

  • Heart beats per minute
  • CPU utilization
  • Central back interest rate
  • Disk I/O rate
  • Height of ocean tide
  • Location tracking for a vehicle
  • USD / Bitcoin closing price

As you can see, the list of stuff that you might want to apply this to is quite diverse. In a world that keep getting more and more IoT devices, timeseries storing sensor data are becoming increasingly common. We looked into quite a few timeseries databases to figure out what needs they serve when we set out to design and build timeseries support to RavenDB.

RavenDB is a document database, and we envision timeseries support as something that you use at the document boundary. A good example of that would the heartrate example. Each person has their own timeseries that record their own heartrate over time. In RavenDB, you would model this as a document for each person, and a heartrate timeseries on each document.

Here is how you would add a data point to my Heartrate’s timeseries:


I intentionally starts from the Client API, because it allow me to show off several things at once.

  1. Appending a value to a timeseries doesn’t require us to create it upfront. It will be created automatically on first use.
  2. We use UTC date times for consistency and the timestamps have millisecond precision.
  3. We are able to record a tag (the source for this measurement) on a particular timestamp.
  4. The timeseries will accept an array of values for a single timestamp.

Each one of those items is quite important to the design of RavenDB timeseries, so let’s address them in order.

The first thing to address is that we don’t need to create timeseries ahead of time. Doing so will introduce a level of schema to the database, which is something that we want to avoid. We want to allow the user complete freedom and minimum of fuss when they are building features on top of timeseries. That does lead to some complications on our end. We need to be ab le to support timeseries merging. Allowing you to append values on multiple machines and merging them together into a coherent whole.

Given the nature of timeseries, we don’t expect to see conflicting values. While you might see the same values come in multiple times, we assume that in that case you’ll likely just get the same values for the same timestamps (duplicate writes). In the case of different writes on different machines with different values for the same timestamp, we’ll arbitrarily select the largest of those values and proceed.

Another implication of this behavior is that we need to handle out of band updates. Typically in timeseries, you’ll record values in increasing date order. We need to be able to accept values out of order. This turns out to be pretty useful in general, not just for being able to handle values from multiple sources, but also because it is possible that you’ll need to load archived data to already existing timeseries.  The rule that guided us here was that we wanted to allow the user as much flexibility as possible and we’ll handle any resulting complexity.

The second topic to deal with is the time zone and precision. Given the overall complexity of time zones in general, we decided that we don’t want to deal with any of that and want to store the times in UTC only. That allows you to work properly with timestamps taken from different locations, for example. Given the expected usage scenarios for this feature, we also decided to support millisecond precision. We looked at supporting only second level of precision, but that was far too limiting. At the same time, supporting lower resolution than millisecond would result in much lower storage density for most situations and is very rarely useful.

Using DateTime.UtcNow, for example, we get a resolution on 0.5 – 15 ms, so trying to represent time to a lower resolution isn’t really going to give us anything. Other platforms have similar constraints, which added to the consideration of only capturing the time to millisecond granularity.

The third item on the list may be the most surprising one. RavenDB allows you to tag individual timestamps in the timeseries with a value. This gives you the ability to record metadata about the value. For example, you may want to use this to record the type of instrument that supplied the value. In the code above, you can see that this is a value that I got from a FitBit watch. I’m going to assign it lower confidence value than a value that I got from an actual medical device, even if both of those values are going to go on the same timeseries.

We expect that the number of unique tags for values in a given time period is going to be small, and optimize accordingly. Because of the number of weasel words in the last sentence, I feel that I must clarify. A given time period is usually in the order of an hour to a few days, depending on the number of values and their frequencies. And what matters isn’t so much the number of values with a tag, but the number of unique tags. We can very efficiently store tags that we have already seen, but having each value tagged with a different tag is not something that we designed the system for.

You can also see that the tag that we have provided looks like a document id. This is not accidental. We expect you to store a document id there, and use the document itself to store details about the value. For example, if the type of the device that captured the value is medical grade or just a hobbyist. You’ll be able to filter by the tag as well as filter by the related tag document’s properties. But I’ll show that when I’ll post about queries, in a different post.

The final item on the list that I want to discuss in this post is the fact that a timestamp may contain multiple values. There are actually quite a few use cases for recording multiple values for a single timestamp:

  • Longitude and latitude GPS coordinates
  • Bitcoin value against USD, EUR, YEN
  • Systolic and diastolic reading for blood pressure

In each cases, we have multiple values to store for a single measurement. You can make the case that the Bitcoin vs. Currencies may be store as stand alone timeseries, but GPS coordinates and blood pressure both produce two values that are not meaningful on their own. RavenDB handles this scenario by allowing you to store multiple values per timestamp. Including support for each timestamp coming with a separate number of values. Again, we are trying to make it as easy as possible to use this feature.

The number of values per timestamp is going to be limited to 16 or 32, we haven’t made a final decision here. Regardless of the actual maximum size, we don’t expect to have more than a few of those values per timestamp in a single timeseries.

Then again, the point of this post is to get you to consider this feature in your own scenarios and provide feedback about the kind of usage you want to have for this feature. So please, let us know what you think.

time to read 4 min | 609 words

About five years ago, my wife got me a present, a FitBit. I didn’t wear a watch for a while, and I didn’t really see the need, but it was nice to see how many steps I took and we had a competition about who has the most steps a day. It was fun. I had a few FitBits since then and I’m mostly wearing one. As it turns out, FitBit allows you to get an export of all of your data, so a few months ago I decided to see what kind of information I have stored there, and what kind of data I can get from it.

The export process is painless and I got a zip with a lot of JSON files in it. I was able to process that and get a CSV file that had my heartrate over time. Here is what this looked like:


The file size is just over 300MB and it contains 9.42 million records, spanning the last 5 years.

The reason I looked into getting the FitBit data is that I’m playing with timeseries right now, and I wanted a realistic data set. One that contains dirty data. For example, even in the image above, you can see that the measurements aren’t done on a consistent basis. It seems like ten and five second intervals, but the range varies.  I’m working on a timeseries feature for RavenDB, so that was perfect testing ground for me. I threw that into RavenDB and I got the data to just under 40MB in side.

I’m using Gorilla encoding as a first pass and then LZ4 to further compress the data. In a data set where the duration between measurement is stable, I can stick over 10,000 measurements in a single 2KB segment. In the case of my heartrate, I can store an average of 672 entries in each 2KB segment. Once I have the data in there, I can start actually looking at interesting patterns.

For example, consider the following query:


Basically, I want to know how I’m doing on a global sense, just to have a place to start figuring things out. The output of this query is:


These are interesting numbers. I don’t know what I did to hit 177 BPM in 2016, but I’m not sure that I like it.

What I do like is this number:


I then run this query, going for a daily precision on all of 2016:


And I got the following results in under 120 ms.


These are early days for this feature, but I was able to take that and generate the following (based on the query above).


All of the results has been generated on my laptop, and we haven’t done any performance work yet. In fact, I’m posting about this feature because I was so excited to see that I got queries to work properly now. This feature is early stages yet.

But it is already quite cool.

time to read 1 min | 81 words

Yesterday I talked about the design of the security system of RavenDB. Today I re-read one of my favorite papers ever about the topic.

This World of Ours by James Mickens

This is both one of the most hilarious paper I ever read (I had someone check up on me when I was reading that, because of suspicious noises coming from my office) and a great insight into threat modeling and the kind of operating environment that your system will run at.

time to read 6 min | 1072 words

RavenDB stores (critical) data for customers. We have customers in pretty much every field imaginable, healthcare, finance, insurance and defense. They do very different things with RavenDB, some run a single cluster, some deploy to tens of thousands of locations. The one thing that they all have in common is that they put their data into RavenDB, and they really don’t want to put that data at the hands of an unknown third party.

Some of my worst nightmares are articles such as these:

That is just for the last six months, and just one site that I checked.

To be fair, none of these cases are because of a fault in MongoDB. It wasn’t some clever hack or a security vulnerability. It was someone who left a production database accessible over the public Internet with no authentication.

  1. Production database + Public Internet + No authentication
  2. ?
  3. Profit (for someone else, I assume)

When we set out to design the security model for RavenDB, we didn’t account only for bad actors and hostile networks. We had to account for users who did not care.

Using MongoDB as the example, by default it will only listen on localhost, which sounds like it is a good idea. Because no one external can access it. Safe by default, flowers, parade, etc.

And then you realize that the first result for searching: “mongodb remote connection refused” will lead to this page:


Where you’ll get a detailed guide on how to change what IPs MongoDB will listen to. And guess what? If you follow that article, you’ll fix the problem. You would be able to connect to your database instance, as would everything else in the world!

There is even a cool tip in the article, talking about how to enable authentication in MongoDB. Because everyone reads that, right?


Except maybe the guys at the beginning of this post.

So our threat model had to include negligent users. And that leads directly to the usual conundrum of security.

I’ll now pause this post to give you some time to reflect on the Wisdom of Dilbert:

In general, I find that the best security for a computer is to disconnect it from any power sources. That does present some challenges for normal operations, though. So we had to come up with something better.

In RavenDB, security is binary. You are either secured (encrypted communication and mutual authentication) or you are not (everything is plain text and there everyone is admin). Because the Getting Started scenario is so important, we have to account for it, so you can get RavenDB started without security. However, that will only work when you set RavenDB to bind to local host.

How is that any different than MongoDB? Well, the MongoDB guys have a pretty big set of security guidelines. At one point I took a deep look at that and, excluding the links for additional information, the MongoDB security checklist consisted of about 60 pages. We decided to go a very different route with RavenDB.

If you try to change the binding port of RavenDB from localhost, it will work, and RavenDB will happily start up and serve an error page to all and sundry. That error page is very explicit about what is going on. You are doing something wrong, you don’t have security and you are exposed. So the only thing that RavenDB is willing to do at that point is to tell you what is wrong, and how to fix it.

That lead us to the actual security mechanism in RavenDB. We use TLS 1.2, but it is usually easier to just talk about it as HTTPS. That gives us encrypted data over the wire and it allows for mutual authentication at the highest level. It is also something that you can configure on your own, without requiring an administrator to intervene. The person setting up RavenDB is unlikely to have Domain Admin privileges or the ability to change organization wide settings. Nor should this be required. HTTPS relies on certificates, which can be deployed, diagnosed and debugged without any special requirements.

Certificates may not require you to have a privileges access level, but they are complex. One of the reasons we choose X509 Certificates as our primary authentication system is that they are widely used. Many places already have policies and expertise on how to deal with them. And for the people who don’t know how to deal with them, we could automate a lot of that and still get the security properties that we wanted.

In fact, Let’s Encrypt integration allowed us to get to the point where we can setup a cluster from scratch, with security, in a few minutes. I actually got it on video, because it was so cool to be able to do this.

Using certificates also meant that we could get integration with pretty much anything. We got good support from browsers, we got command line integration, great tools, etc.

This isn’t a perfect system. If you need something that our automated setup doesn’t provide, you’ll need to understand how to work with certificates. That isn’t trivial, but it is also not a waste, it is both interesting and widely applicable.

The end result of RavenDB’s security design is a system that is meant to be deployed in hostile environment, prevent information leakage on the wire and allow strong mutual authentication of clients and servers. It is also a system that was designed to prevent abuses. If you really want to, you can get an unsecured instance on the public internet. Here is one such example: http://live-test.ravendb.net

In this case, we did it intentionally, because we wanted to get this in the browser:


But the easy path? The path that we expect most users to follow? That one ends up with a secured and safe system, without showing up on the news because all your data got away from you.

time to read 2 min | 361 words

I read this post about using Object Relational Mapper in Go with great interest. I spent about a decade immersed deeply within the NHibernate codebase, and I worked on a bunch of OR/Ms in .NET and elsewhere. My first reaction when looking at this post can be summed in a single picture:


This is a really bad modeling decision, and it is a really common one when people are using an OR/M. The problem is that this kind of model fails to capture a really important aspect of the domain, it’s size.

Let’s consider the Post struct. We have a couple of collections there. Tags and Comments, it is reasonable to assume that you’ll never have a post with more than a few tags, but a popular post can easily have a lot of comments. Using Reddit as an example, it took me about 30 seconds to find a post that had over 30,000 comments on it.

On the other side, the Tag.Posts collection may contains many posts. The problem with such a model is that trying to access such properties is that they are trapped. If you hit something that has a large number of results, that is going to cause you to use a lot of memory and put a lot of pressure on the database.

The good thing about Go is that it is actually really hard to play the usual tricks with lazy loading and proxies behind your back. So the GORM API, at least, is pretty explicit about what you load there. The problem with this, however, is that developers will explicitly call the “gimme the Posts” collection and it will work when tested locally (small dataset, no load on the server). It will fail in production in a very insidious way (slightly slower over time until the whole thing capsize).

I would much rather move those properties outside the entities and into standalone queries, ones that come with explicit paging that you have to take into account. That reflect the actual costs behind the operations much more closely.

time to read 1 min | 142 words

Today my team tracked down, hunted and summarily executed a bug. It was a nasty one, because under certain (pretty rare) error conditions, we would get into pretty nasty state and not recover from it.

The problem was that the code to handle this and handle the state properly was there, and on when we tested it locally, worked just fine. But on production, it didn’t work. Unless we tried to figure out why it wasn’t working, in which case it did work.

As you can imagine, this can be a pretty frustrating state of affair. Eventually we tracked it down to the following problem. Here is the relevant code. And the problem is that the exception aren’t thrown (and therefor not handled). Can you see the issue?


time to read 3 min | 568 words

Our test process occasionally crashed with an access violation exception. We consider these to be Priority 0 bugs, so we had one of the most experience developers in the office sit on this problem.

Access violation errors are nasty, because they give you very little information about what is going on, and there is typically no real way to recover from them. We have a process to deal with them, though. We know how to setup things so we’ll get a memory dump on error, so the very first thing that we work toward is to reproduce this error.

After a fair bit of effort, we managed to get to a point where we can semi-reliably reproduce this error. This means, if you wanna know, that do “stuff” and get the error in under 15 minutes. That’s the reason we need the best people on those kind of investigations. Actually getting to the point where this fails is the most complex part of the process.

The goal here is to get to two really important pieces of information:

  • A few memory dumps of the actual crash – these are important to be able to figure out what is going on.
  • A way to actually generate the crash – in a reasonable time frame, mostly because we need to verify that we actually fixed the issue.

After a bunch of work, we were able to look at the dump file and found that the crash originated from Voron’s code. The developer in charge then faulted, because they  tried to increase the priority of an issue with Priority 0 already, and P-2147483648 didn’t quite work out.

We also figured out that this can only occur on 32 bits, which is really interesting. 32 bits is a constrained address space, so it is a great way to run into memory bugs.

We started to look even more closely at this. The problem happened while running memcpy(), and looking at the addresses that were passed to the function, one of them was Voron allocated memory, whose state was just fine. The second value pointed to a MEM_RESERVE portion of memory, which didn’t make sense at all.

Up the call stack we went, to try to figure out what we were doing. Here is where we ended up in (hint: The crash happened deep inside the Insert() call).


This is test code, mind you, exercising some really obscure part of Voron’s storage behavior. And once we actually looked at the code, it was obvious what the problem was.

We were capturing the addresses of an array in memory, using the fixed statement.

But then we used them outside the fixed. If there happened to be a GC between these two lines, and if it happened to move the memory and free the segment, we would access memory that is no longer valid. This would result in an access violation, naturally. I think we were only able to reproduce this in 32 bits because of the tiny address space. In 64 bits, there is a lot less pressure to move the memory, so it remains valid.

Luckily, this is an error only in our tests, so we reduce our DEFCON level to more reasonable value. The fix was trivial (move the Insert calls to the fixed scope), and we were able to test that this fixed the issue.

time to read 2 min | 245 words

I’ll be writing a lot more about our RavenDB C++ client, but today I was reviewing some code and I got a reply that made me go: “Ohhhhh! Nice”, and I just had to blog about it.


This is pretty much a direct transaction of how you’ll write this kind of query in C#, and the output of this is a RQL query that looks like this:


The problem is that I know how the C# version works. It uses Reflection to extract the field names from the type, so we can figure out what fields you are interested in. In C++, you don’t have Reflection, so how can this possibly work?

What Alexander did was really nice. Given that the user already have to provide us with the serialization routine for this type (so we can turn the JSON into the types that will be returned). Inside the select_fields() call, he constructed an empty object, serialize that and then use the field names in the resulting JSON to figure out what fields we want to project from the Users documents.

It make perfect sense, it require no additional work from the user and it gives us consistent API. It is also something that I would probably never think to do.

time to read 6 min | 1078 words

Unusually for me, I had a bit of a pause in reviewing Sled. As a reminder, Sled is an embedded database engine written in Rust. I last stopped looking at the buffer management, but I still don’t really have a good grasp of what is going on.

The next file is the iterator. It looks like it translates between segments and messages in these segments. Here is the struct:


As you can see, the log iterator holds an iterator of segments, and iterating over it looks like it will go over all the messages in the segments in order. Yep, here is the actual work being done:


The next() method is fairly straightforward, I found. But I have to point out this:


First, the will need call is really interesting. Mostly because you have a pretty obvious way to do conditional compiling that doesn’t really sucks. #if is usually much more jarring in the code.

Second, I think that the style of putting really important functions inside an if result in a pretty dense code. Especially since the if is entered only on error. I would have preferred to have it as a stand alone variable, and then check if it failed.

What I don’t understand is the read_segment call. Inside that method, we have:


There are also similar calls on segment trailer. It looks like we have a single file for the data, but stuff that is too large is held externally, in the blob files.

We then get to this guy, which I find really elegant way to handle all the different states.


That is about it for interesting bits in the iterator, the next fun bit is the Log. I do have to admit that I don’t like the term log. It is too easy to get it confused with a debug log. In Voron, I used the term Journal or Write Ahead Journal (OH in the office: “did we waj it yet?”).


The fact that you need to figure out where to get the offset of the data you are about to write is really interesting. This is the method that does the bulk of the work:


Note that it just reserve and complete the operation. This also does not flush the data to disk. That is handled by the flusher or by explicit call. The reserve() method calls to reserve_internal() and there we find this gem:


I know what it does (conditional compilation), but I find it really hard to follow. Especially because it looks like a mistake, with buf being defined twice. This is actually a case where an #if statement would be better, in my eyes.

Most of the code in there is to manage calls to the iobuf, which I already reviewed. So I’m going to skip ahead and look at something that is going to be more interesting, the page cache. Sled has an interesting behavior, in that it can shred a page into multiple location, requiring some logic to bring it all back together. That is going to be really interesting to look at, I hope.

The file stats with this:


And this… takes a while to unpack.  Remember that epoch is manual GC pattern for concurrent data structure without GC.

The cached_ptr value is a shared pointer to a Node (inside a lock free stack) that holds a CacheEntry with static lifetime and thread safe to a generic argument that must have static lifetime and be thread safe. And there is a unsigned long there as well.

No idea yet what is going on. But here is the first method on this struct:


That is… a lot. The cache entry is a discriminated union with the following options:


There are some awesome documentation comments here, including full blown sample code that really help understand what is going on in the code.

There seems to be a few key methods that are involved here:

  • allocate(val) – create a new page and write an initial value, gets back a page id.
  • link(id, val) – make a write to a page id. Which simply write a value out.
  • get(id) – read all the values for a page id, and uses a materializer to merge them all to a single value.
  • replace(id, val) – set the page id to the new value, removing all the other values it had.

The idea here, as I gather. Is to allow sequential writes to the data, but allow fast reads, mostly by utilizing SSD’s random read feature.

I’m trying to follow the code, but it is a bit complicated. In particular, we have:


This try to allocate either a free page or allocate a new one. One of the things that really mess with me here is that the use of the term Page. I’m using to B+Tree, where a page is literally some section of memory. Here it refers to something more nebulous. Key point here, I don’t see where the size is specified. But given that you can link items to page, that sort of make sense. I just need to get used to Pages != storage.

The fact that all of this is generic also make it hard to follow what is actually going on.  I’m getting lost in here, so I think that I’ll stop for now.


  1. RavenDB 4.2 has been released! - 11 hours from now

There are posts all the way to May 22, 2019


  1. Reviewing Sled (3):
    23 Apr 2019 - Part III
  2. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
  3. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  4. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  5. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats