Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,916 | Comments: 49,398

filter by tags archive
time to read 1 min | 102 words

We are running another set of RavenDB workshops in London, Dallas and New York.

These are two days of deep dive into RavenDB. The workshop cover everything from how to talk to a RavenDB database to how to model your data, from how to setup a single node to building a geo-distributed cluster. It is also, at least in my opinion, a lot of fun.

Here are the details:

We’ll also be talking about some of the new features in 4.2 (and later) in the workshop. 

The early bird pricing ends tomorrow.

time to read 4 min | 704 words

One of the changes that we made in RavenDB 4.2 is a pretty radical one, even if we didn’t really talk about it. It is the fact that RavenDB now contains C code. Previously, we were completely managed (with a bunch of P/Invoke) calls. But we now we have some C code in the project. The question is why?

The answer isn’t actually about performance or the power of native C code. We are usually pretty happy with the kind of assembly instructions that we can get from C#. The actual problem was that we needed a proper abstraction. At this moment, RavenDB is running on the following platforms:

  • Windows x86-32 bits
  • Windows x86-64 bits
  • Linux x86-32 bits
  • Linux x86-64 bits
  • Linux ARM 32 bits
  • Linux ARM 64 bits
  • macOS 64 bits

And each of this platform requires some changes in how we do things. The other problem is that .NET is a well specified system, all the types sizes are well known, for example. The same isn’t true for the underlying API. Windows does a really good job of maintaining proper APIs across versions and 32/64 editions. Linux basically doesn’t seem to care. Types sizes change quite often, sometimes in unpredictable ways.

Probably the most fun part was figuring out that on x86, Linux syscall #224 is gettid(), but on ARM, you’ll call to gettime(). The problem is that if you are using C, all of that is masked for you. And it got quite unwieldly. So we decided to create a PAL (platform abstraction layer) in C to handle these details.

The rules for the PAL are simple, we don’t make assumptions about types, sizes or underlying platform. For example, let’s take a look at some declarations.

image

All the types are explicit about their size, and where we need to pass a complex type (SYSTEM_INFORMATION) we define it ourselves, rather than rely on any system type. And here are the P/Invoke definitions for these calls. Again, we are being explicit with the types event though in C# the size of types are fixed.

image

You might have noticed that I care about error handling. And error handling in C is… poor. We use the following convention in these kind of operations:

  • Each method does a single logical thing.
  • Each method return either success or flag indication the internal step in which it fail.
  • On failure, the method will return the system error code for the failure.

The idea is that on the calling side, we can construct exactly where and why we failed and still get good errors.

Yesterday I run into an issue where we didn’t move some code to the PAL, and we actually had a bug there. The problem was that when running on ARM32 machine, we would pass a C# struct to a syscall. The problem was that we defined that struct based on the values in 64 bits Linux. When called on 32 bits system, the values went to the wrong location. Luckily, this was a call that was never used. It is used by our dashboard to let the admin know how much disk space is available. There is no code that actually take action based on this information.

Which was great, because when we actually run the code, we got this value in the Studio:

image

When I dug deeper into the code, it gave really bad results. My Raspberry PI thought it had 700 PB of disk space free. The actual reason we got this funny error? We send the number of bytes to the client, and under these conditions, we can only support up to about 8 PB of free space in the browser.

I moved the code from C# P/Invoke to a simple method to calculate this:

image

Implementing this for all platforms means that we have a much nicer interface and our C# code is abstracted from the gory details on how we actually compute this.

time to read 1 min | 101 words

Just a reminder, next week I’ll be speaking at the CodeNode on: When Select() is Broken

"Select isn't broken" is an important bit of advice from the Pragmatic Programmer book. In most cases, you won't find bugs in your OS, programming languages or core libraries. It is almost always your code that is at fault. Except...

In this session, Oren will discuss how RavenDB has exposed numerous issues in the CoreCLR, the JIT and the underlying operating systems. Along the way, you'll learn deep debugging techniques and how to rule out your own code and pinpoint the actual underlying problem.

Backup goes wild

time to read 1 min | 143 words

image

We have a lot of internal test scenarios that run RavenDB through its paces.

One of them had an issue. It would test that RavenDB backups worked properly, but it would (sometimes) fail to clean up after itself.

The end result was that over time, the test database would accumulate more and more backup tasks that it had to execute (and all of them on the same schedule).

You can see here how RavenDB is allocating them to different nodes in an attempt to spread the load as much as possible.

We fixed the bug in the test case, but I also consider this a win because we now have protection from backup DoS attacks Smile. And I just love this image.

time to read 5 min | 960 words

imageWhat happens when you want to to page through result sets of a query while the underlying data set is being constantly modified?

This seems like a tough problem, and one that you wouldn’t expect to encounter very often, right? But a really good example of just this issue is the notion of a feed in a social network. To take Twitter for simplicity, you have many people generating twits, and other users are browsing through their timeline.

What ends up happening is that the user browsing through their timeline is actually trying to page through the results, but at the same time, you are going to get more updates to the timeline while you are reading it. One of the key requirements that we have here, then, is that we want to be sure that we aren’t actually creating a lot of jitter for the user as they scroll through the timeline. Luckily, because this is a hard problem, the users are already quite familiar with some of the side affects. It would surprise no one to see the same twit multiple times in the timeline. It can be because of a retweet or a like by a user you follow, or it can be a result of the way paging is done.

Now that we understand what we want to achieve, let’s see how you can try getting there. The simplest way to handle this is to ask the database for some sort of a stable reference for the query. So instead of executing the query and being done with it, you’ll have the server maintain it for a period of time and send you the pages as you need them. This is simple, easy to implement and costly in terms of system resources. You’ll need to keep the query results in memory for each one of your users, and that can be quite a lot of memory to keep around just in case. Especially given the different between human’s interaction times and the speed of modern systems.

Another way to do that is to ask the database engine to generate a way to re-create the query as it was at that time. This is sometimes called a continuation token or some such. That works great, usually, but come with its own complications. For example, imagine that I’m doing this on the following query:

from Users order by LastName limit 5

Which gives us the following result:

image

And I got the first five users, and now I want to get the next five. Between the first and second query, a user whose last name is “Aardvark” was inserted in the system. At this point, what would you expect to get from the query? We have two choices here, as you can see below:

image

The problem is that from my perspective, both of those has serious issues. To start with, to compute the results shown in orange, you’ll need to jump through some serious hoops on the backend, and the result looks strange. To get the results in green is quite easy, but it will mean that you missed out on seeing Aardvark.

You might have noticed that the key issue here isn’t so much with the way we build the query, but the order in which we need to traverse it. We ask to sort it by last name, but we also want to get results as they come by. As it turns out, the process becomes a whole lot simpler if we unify these concepts. If we issue the following query, for example, our complexity threshold drops by a lot.

from Messages order by CreatedAt desc limit 5

This is because we now get the following results:

image

And paging through this now is pretty easy, if we want to page down, we can now issue the query to get the next page of data:

from Messages order by CreatedAt desc where CreatedAt < 217 limit 5

By making sure that we are paging and filtering on the same property, we can easily scroll through the results without having to do too much work, either in the application or in our database backend. We can also query if there are new stuff that were missed by checking CreatedAt > 222, of course.

But there is one wrinkle here. I intentionally used the CreateAt field but put numeric values there. Did you notice that there was no 220 value? That one was created on an isolated node and didn’t arrive yet. When it will show up in the local database, we’ll need to decide if we’ll give it a new value (making sure it will show up in the timeline) or store it as is, meaning that it might get lost.

These type of questions are probably more relevant at the business level. You might want to apply different behaviors based on how many likes a twit has, for example.

Another option is to have an UpdatedAt field as well, which can allow you to quickly answer the question: “What items in the range I scan has changed?”. This method also allows for simpler model for handling updates, but much of that depends on the kind of behavior you want to get. This method handles updates, including updates to the parts seen and unseen in a reasonable way and predictable cost.

time to read 2 min | 214 words

imageI’m really happy to announce that RavenDB 4.2 has been RTMed. The newest version of RavenDB is now generally available.

This release brings to the table a host of new features. The major ones are:

We also have a host of minor ones. From theming support in the studio to revisions diffing. This release includes over 1,500 commits and represent over a year of work by our whole team.

But the headline for this release is actually an experimental feature, Graph Queries. This feature allows you to run graph queries against your existing data set. This is part of our strategy of allow you to use the same data inside RavenDB from multiple view points.

There are also a host of features that graduated from being experimental into stable:

And with that, I’m going to go off and bask in the glow of completing the release, paying no mind to the blinking cursor that talks about the next one the big 5.0. Smile

time to read 7 min | 1301 words

Last week I posted about some timeseries work that we have been doing with RavenDB. But I haven’t actually talked about the feature in this space before, so I thought that this would be a good time to present what we want to build.

The basic idea with timeseries is that this is a set of data points taken over time. We usually don’t care that much about an individual data point but care a lot about their aggregation. Common usages for time series include:

  • Heart beats per minute
  • CPU utilization
  • Central back interest rate
  • Disk I/O rate
  • Height of ocean tide
  • Location tracking for a vehicle
  • USD / Bitcoin closing price

As you can see, the list of stuff that you might want to apply this to is quite diverse. In a world that keep getting more and more IoT devices, timeseries storing sensor data are becoming increasingly common. We looked into quite a few timeseries databases to figure out what needs they serve when we set out to design and build timeseries support to RavenDB.

RavenDB is a document database, and we envision timeseries support as something that you use at the document boundary. A good example of that would the heartrate example. Each person has their own timeseries that record their own heartrate over time. In RavenDB, you would model this as a document for each person, and a heartrate timeseries on each document.

Here is how you would add a data point to my Heartrate’s timeseries:

image

I intentionally starts from the Client API, because it allow me to show off several things at once.

  1. Appending a value to a timeseries doesn’t require us to create it upfront. It will be created automatically on first use.
  2. We use UTC date times for consistency and the timestamps have millisecond precision.
  3. We are able to record a tag (the source for this measurement) on a particular timestamp.
  4. The timeseries will accept an array of values for a single timestamp.

Each one of those items is quite important to the design of RavenDB timeseries, so let’s address them in order.

The first thing to address is that we don’t need to create timeseries ahead of time. Doing so will introduce a level of schema to the database, which is something that we want to avoid. We want to allow the user complete freedom and minimum of fuss when they are building features on top of timeseries. That does lead to some complications on our end. We need to be ab le to support timeseries merging. Allowing you to append values on multiple machines and merging them together into a coherent whole.

Given the nature of timeseries, we don’t expect to see conflicting values. While you might see the same values come in multiple times, we assume that in that case you’ll likely just get the same values for the same timestamps (duplicate writes). In the case of different writes on different machines with different values for the same timestamp, we’ll arbitrarily select the largest of those values and proceed.

Another implication of this behavior is that we need to handle out of band updates. Typically in timeseries, you’ll record values in increasing date order. We need to be able to accept values out of order. This turns out to be pretty useful in general, not just for being able to handle values from multiple sources, but also because it is possible that you’ll need to load archived data to already existing timeseries.  The rule that guided us here was that we wanted to allow the user as much flexibility as possible and we’ll handle any resulting complexity.

The second topic to deal with is the time zone and precision. Given the overall complexity of time zones in general, we decided that we don’t want to deal with any of that and want to store the times in UTC only. That allows you to work properly with timestamps taken from different locations, for example. Given the expected usage scenarios for this feature, we also decided to support millisecond precision. We looked at supporting only second level of precision, but that was far too limiting. At the same time, supporting lower resolution than millisecond would result in much lower storage density for most situations and is very rarely useful.

Using DateTime.UtcNow, for example, we get a resolution on 0.5 – 15 ms, so trying to represent time to a lower resolution isn’t really going to give us anything. Other platforms have similar constraints, which added to the consideration of only capturing the time to millisecond granularity.

The third item on the list may be the most surprising one. RavenDB allows you to tag individual timestamps in the timeseries with a value. This gives you the ability to record metadata about the value. For example, you may want to use this to record the type of instrument that supplied the value. In the code above, you can see that this is a value that I got from a FitBit watch. I’m going to assign it lower confidence value than a value that I got from an actual medical device, even if both of those values are going to go on the same timeseries.

We expect that the number of unique tags for values in a given time period is going to be small, and optimize accordingly. Because of the number of weasel words in the last sentence, I feel that I must clarify. A given time period is usually in the order of an hour to a few days, depending on the number of values and their frequencies. And what matters isn’t so much the number of values with a tag, but the number of unique tags. We can very efficiently store tags that we have already seen, but having each value tagged with a different tag is not something that we designed the system for.

You can also see that the tag that we have provided looks like a document id. This is not accidental. We expect you to store a document id there, and use the document itself to store details about the value. For example, if the type of the device that captured the value is medical grade or just a hobbyist. You’ll be able to filter by the tag as well as filter by the related tag document’s properties. But I’ll show that when I’ll post about queries, in a different post.

The final item on the list that I want to discuss in this post is the fact that a timestamp may contain multiple values. There are actually quite a few use cases for recording multiple values for a single timestamp:

  • Longitude and latitude GPS coordinates
  • Bitcoin value against USD, EUR, YEN
  • Systolic and diastolic reading for blood pressure

In each cases, we have multiple values to store for a single measurement. You can make the case that the Bitcoin vs. Currencies may be store as stand alone timeseries, but GPS coordinates and blood pressure both produce two values that are not meaningful on their own. RavenDB handles this scenario by allowing you to store multiple values per timestamp. Including support for each timestamp coming with a separate number of values. Again, we are trying to make it as easy as possible to use this feature.

The number of values per timestamp is going to be limited to 16 or 32, we haven’t made a final decision here. Regardless of the actual maximum size, we don’t expect to have more than a few of those values per timestamp in a single timeseries.

Then again, the point of this post is to get you to consider this feature in your own scenarios and provide feedback about the kind of usage you want to have for this feature. So please, let us know what you think.

time to read 4 min | 609 words

About five years ago, my wife got me a present, a FitBit. I didn’t wear a watch for a while, and I didn’t really see the need, but it was nice to see how many steps I took and we had a competition about who has the most steps a day. It was fun. I had a few FitBits since then and I’m mostly wearing one. As it turns out, FitBit allows you to get an export of all of your data, so a few months ago I decided to see what kind of information I have stored there, and what kind of data I can get from it.

The export process is painless and I got a zip with a lot of JSON files in it. I was able to process that and get a CSV file that had my heartrate over time. Here is what this looked like:

image

The file size is just over 300MB and it contains 9.42 million records, spanning the last 5 years.

The reason I looked into getting the FitBit data is that I’m playing with timeseries right now, and I wanted a realistic data set. One that contains dirty data. For example, even in the image above, you can see that the measurements aren’t done on a consistent basis. It seems like ten and five second intervals, but the range varies.  I’m working on a timeseries feature for RavenDB, so that was perfect testing ground for me. I threw that into RavenDB and I got the data to just under 40MB in side.

I’m using Gorilla encoding as a first pass and then LZ4 to further compress the data. In a data set where the duration between measurement is stable, I can stick over 10,000 measurements in a single 2KB segment. In the case of my heartrate, I can store an average of 672 entries in each 2KB segment. Once I have the data in there, I can start actually looking at interesting patterns.

For example, consider the following query:

image

Basically, I want to know how I’m doing on a global sense, just to have a place to start figuring things out. The output of this query is:

image

These are interesting numbers. I don’t know what I did to hit 177 BPM in 2016, but I’m not sure that I like it.

What I do like is this number:

image

I then run this query, going for a daily precision on all of 2016:

image

And I got the following results in under 120 ms.

image

These are early days for this feature, but I was able to take that and generate the following (based on the query above).

image

All of the results has been generated on my laptop, and we haven’t done any performance work yet. In fact, I’m posting about this feature because I was so excited to see that I got queries to work properly now. This feature is early stages yet.

But it is already quite cool.

time to read 1 min | 81 words

Yesterday I talked about the design of the security system of RavenDB. Today I re-read one of my favorite papers ever about the topic.

This World of Ours by James Mickens

This is both one of the most hilarious paper I ever read (I had someone check up on me when I was reading that, because of suspicious noises coming from my office) and a great insight into threat modeling and the kind of operating environment that your system will run at.

FUTURE POSTS

  1. Researching a disk based hash table - 3 hours from now

There are posts all the way to Nov 14, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats