Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,868 | Comments: 49,212

filter by tags archive
time to read 3 min | 441 words

The upgrade process from RavenDB 3.5 and earlier to RavenDB 4.x is not easy. This is because I made a conscious decision to not have backward compatibility between these versions. I made that decision because we had to be able to make massive changes internally in order to get to the targets that we set to ourselves. I actually discussed that decision in detail in a previous blog post and a talk.

Four years later, I still stand by that decision, but I also regret the spanner that it threw into the works. Migrating RavenDB applications to 4.x from previous versions is harder than it should. In retrospect, we probably should have invested the time in a compatibility layer that would make it easier to migrate.

I wanted to take a moment and talk about RavenDB 5.0, expected in 2020, and our plans for that release. We are going to be doing some minor cleanup of the API. Methods and classes that are marked as [Obsolete] will be removed. These tend to be at the very edge of the explored API and have been marked as such for quite some time. Beyond these change (which you’ll a clear and obvious alternative for), you aren’t going to need to do much at all.

Our goal for converting an application from RavenDB 4.x to 5.x is that the process is for 90% of the projects - Update NuGet packages, compile, you are done. For the 10%, it may mean that you need to make some minor changes. For example, change DisableEntitiesTracking to NoTracking if you are using the low level query API.

We also intend to allow at least the vast majority of operations to just work between 4.x client and 5.x server. In other words, even when you upgrade server versions, you aren’t going to have to upgrade the client version unless you want to use the new features.

There are also additional considerations that we have to take into account:

  • RavenDB now have official clients for: .NET, JVM, Go, Python, Node.JS, C++. As well as a unofficial clients.
  • RavenDB Cloud instances are maintained by us, and will be upgraded to newer versions on a regular schedule.

The cost of making a backward incompatible change at this point is too high for us to take lightly, and we are going to try very hard to avoid it. The move from 3.5 to 4.x was a one time thing that we had to do in order to continue evolving the product, not something that we plan again anytime soon.

We are also offering migration services for clients who want to move their applications from 3.x to 4.x.

time to read 1 min | 180 words

We were asked about best practices for managing the RavenDB session (unit of work) in a .NET Core MVC application. I thought it is interesting enough to warrant its own post.

RavenDB’s client API is divided into the Document Store, which holds the overall configuration required to access a RavenDB and the Document Session, which is a short lived object implementing Unit of Work pattern and typically only used for a single request.

We’ll start by adding the RavenDB configuration to the appsettings.json file, like so:

We bind it to the following strongly typed configuration class:

The last thing to do it to register this with the container for dependency injection purposes:

We register both the Document Store and the Document Session in the container, but note that the session is registered on a scoped mode, so each connection will get a new session.

Finally, let’s make actual use of the session in a controller:

Note that we used to recommend having SaveChangesAsync run for you automatically, but at this time, I think it is probably better to do this explicitly.

time to read 2 min | 368 words

imageLet’s consider the following data (which is actually RavenDB’s sample database). We have a collection of employees, and each one of them have an attachment with the employee’s photo. We want to display a table of the employees as well as the employees’ photos.

The problem is how to do that, exactly. One way of doing that is to loop over the employees, get the relevant attachments and send them all to the client for display. That works, but there are much better ways to go about doing this.

Instead of doing everything ourselves, we can rely on RavenDB and the browser to do things for us. Let’s look at the metadata we have for the employee in question, to see how RavenDB exposes the attachments to us:

image

This is interesting, because the hash means that we can do some interesting stuff. Instead of loading the attachment directly, we’ll create an endpoint that will provide access to the attachment, like so:

GET /employees/photos?id=employees/7-A&name=photo.jpg&hash=97S5UrejdZqHfel4i+/ts5orhNlp92DItxOUVow0maI=

So far, this just looks like we moved the data around, for no good reason. Instead of loading the attachments for the employees and sending them in one roundtrip to the client, we now force the client to generate N requests, to the number of employees we have. Surely that is much worse, no?

The key here is that the endpoint that we expose is going to use the Cache-Control header to ask the browser to cache this request for us forever. Because we have the hash of the file, we know that if we updated the employee’s photo, we will get a new hash, so we don’t need to deal with cache control issues.

By making the browser cache the value, we can significantly speed up the system. Now showing the employee photo is much cheaper.

There is also another advantage, the browser will typically use multiple connections to get the data (either actual multiple TCP connections or multiple streams in HTTP2), so we get additional benefits from this level of parallelization.

time to read 5 min | 946 words

imageThis story started a few years ago, in a very non technical setting. We changed the accountant that we use for Hibernating Rhinos. We outgrew the office we were using at the time and needed better services. Among the changes that was implemented as a result of this move was the usage of new accounting software. Nothing really that interesting, to be frank. I like that my accounting is boring. However, the new accounting software was an on-premise solution. In other words, we are the one running it. Which is perfectly fine, we provisioned a VM in our data center (a fancy name to the single rack that we had at the time) and let it run.

As you can imagine, we consider our accounting data to be mission critical, so to speak. I don’t mind not being able to access it for an hour, for example, but losing it is going to be Bad. So we had a backup, nothing really that interesting. We have a backup that goes to local disk on the VM, remote disk in the office and just to be safe, uploading the backup to S3. I asked one of our developers to take care of this, and aside from specifying that I want backups in triplicates, I didn’t really pay attention. That was around 2017, I believe.  I made sure that if the backup failed, we would get notified of that, and that was pretty much it.

One of the reasons that I like my accounting boring is that it simplify my life and reduces stress. Unfortunately, it seems like my accounting practices has a cost. In particular, it means that I favor paying a bit too much to the taxman. That means that all of the taxes are going out immediately, and the company doesn’t end up the year with a large tax bill that we need to cover. But I overdid it a time or two, and we overpaid on our taxes. Well, that was by design, extra money showing up from the taxman is much better than a surprise bill. But at certain point, we were supposed to get a refund for a non trivial amount. At which point the tax authorities came a-calling and audited us.

Remember that I talked about boring accounting practices. The day we started the audit, I was having dinner with my wide and being audited was the third topic of the day, if I recall properly. They found a few things that we did wrong (we registered an invoice for the wrong currency, so we cancelled it and issued a new one, instead of refunding it and issuing a new one). That was a Thing, it seemed. But the end result was pretty much nothing. I loved it. Since then, we were audited a few more times, always with no repercussions.

Given that the next audit is a question of when (usually every 18 – 30 months or so, it seems), not if. I really care about my accounting data. Hence the triple backups policy. You might have been going through this post expecting to hear that we lost the accounting data, and the backup failed, and now my accountant outlook is decidedly not boring. I’m afraid that this is only half true. We did have a failed backup, but we caught it before we actually needed it.

At one point, I looked at out backup policies, and I noticed that the accounting backup was months old at this point. That was concerning, I gotta say. Here is the timeline, as I could piece is together:

  • Q2 2017 – Backup process is defined and tested. This is a one off process that we use only for the accounting database.
  • Q1 2018 – Routine key rotation is performed on some of our keys. Unbeknownst to us, the backup process lose the ability to report failure. But given that it doesn’t fail, no one notices.
  • Q4 2018 – The developer responsible for setting up the backup process leave the company. As part of the outgoing employee process, we shut down relevant user accounts.
  • Q1 2019 – The accounting server is rebooted. The backup process fails to start, because the user account is disabled.

You might notice the scale of this issue. The underlying problem was that the developer setup this one off process as a… well, one off process. That meant that it wasn’t hooked to any of our usual monitoring / alert systems. It did have a way to report on errors, but the credentials on that went stale after a year. No one paid attention, since the backups continued to run.

The backup process was also running under the user account of the developer, not a service account. I guess it was easier than creating a user, but the end result was that when we deactivated the user account after the developer left the company, we also disabled the backup. But the process was running, and it continued to run for months.  Only much later will the process fail to start, and by then there was no way to report errors, and we noticed it only because we looked for that during routine operations.

One of the reasons we had built backups directly into the core of RavenDB was exactly this sort of situations. A backup process is not something that you cobble together (that’s on us, to be fair), it is something that should be part and parcel of the operations of your database, and being able to do something like get backups in triplicate is essential for good operations experience.

time to read 5 min | 911 words

Recently I had the chance to work on what one could term a “business app”. After a very long time dealing with system level software, I got my hands dirty when writing business level code. You know the kind, logging in a user, showing some data on a page, etc. I have been doing that for a long time, but the past few years I was mostly dealing with storage engines, distributed systems and the like. Even though I’m writing both kind of systems in the same environment, the feeling is quite different.

This is a stream of consciousness type post.

With the business app code, I was using controllers and services that are dynamically composed via dependency injection. For system level code, I have manual dependency management. The business code tends to be fairly short until it hits the database, but the system code tend to do a lot more inline.

A feature in both system is composed of UI, data and behavior, but the way they are structured is very different. For that matter, the way we build them is very different.

For example, the business code accepts an order from a user by writing it to the database, another component in the same system waits for such events and start processing it in an asynchronous manner. This meant that we had a pretty good separation between the different parts. To the point where we pretty much built them in isolation and concurrently. The UI team was generally much faster, so they threw commands at the backend and had something that marked them as completed while the backend team (a hilarious term from our usual perspective, to be frank) worked on accepting the commands and actually implementing the functionality. When writing system code, we typically write the actual implementation first, and figure out what we want from the UI afterward. Sometimes the UI comes a few weeks or months after the code has already been written and merged.

The rate in which features got completed was also astounding. Some of them were minor stuff (this URL shouldn’t have a line break) but even major features got done much faster than I’m used to. Although, to be fair, implementing something such as “optimize I/O writes on Linux 32 bits” vs. “send an email when the user attempts to login but doesn’t actually have an account” are of very different ranks.

Along the same path, the capability for concurrent work was much higher. We could work on different parts of the app with a much reduced chance for conflicts and stepping on each other toes. Even when we were working on roughly the same areas.

Readability and maintainability matters a lot more in business software. Performance trumps those when dealing with system software. That isn’t to say that perf isn’t important for business software, but we go so much added capacity for the things we want to do, it doesn’t usually matter.

I can’t write business level software without ReSharper, I can write system code without it, though.

JavaScript sucks regardless of the project type. There is something deeply wrong in the fact that building my JS based UI takes longer than it takes to compile my actual application.

There are a lot of things that are the same, of course. But probably the most important factor that I have to note is that sensitivity to pain.

What do I mean by that? For example, how fast can you go from hitting F5 to debugging your current issue? How much time does it take you to create a new thing and use it?

When using dependency injection, if you aren’t setting up automatic discovery, you have a recurring pitfall. Every time you add something new, you have to remember to register it. If you do have automatic discovery, you need to be clear what the conventions are. It can seem like magic, and it is easy to lose that  knowledge. Let’s take the command execution as a good example. Once you have a command in the system, debugging it means running F5 and stepping through the code.  If you need to make a change, go ahead and do that, hit F5 again, and you are back in the same location. As an added bonus, this also ensures that your commands are idempotent, since you are re-running all the time while debugging.

The key is that you need to be able to hit F5 and get there. We initially had a setup where you had to run the app from the command line, attach the debugger (manually!) go do something in the UI, and then can debug what you were doing. Not a big deal, if you are doing that once in a blue moon. But during active development? That is horrendous for productivity. I couldn’t stand it, and it was the very first thing that I tackled. It only shaved about 20 – 30 seconds from the launch time, but it had a big impact on the way I approached things.

Because I didn’t have to do any work to get back to the debugging mindset, I found myself working in a very different manner. I would make a change, run it, make a change, etc. When I had to do (a bit) more work, I had a much more careful process. And that slowed things down.

I forgot how much fun you can have when working with business level software, because the challenges you face are so very different.

time to read 6 min | 1051 words

A process running on your system is typically a black box. You don’t have a lot of insight into what is going on inside it. Oh, there are all sort of tools you can use to infer things out (looking at system calls, memory consumption, network connections, etc), but by default… it is a mystery.

RavenDB is a database. It is meant to run unattended for long durations and is designed to mostly run itself. That means that when you look at it, you want to be able to figure out exactly what is going on with the system as soon as possible. To that end, we have included a lot of features inside RavenDB that expose the internal state of the system. From tracing each I/O and its duration to providing detailed statistics about costs and amount of effort invested in various tasks.

These features are invaluable to figure out exactly what is going on in RavenDB at a particular point in time. Of course, nothing beats the ability to open a debugger and inspect the state of the system. But that is something that you can only really do on development. It is not something that can be done on production, obviously. Or can it?

Since RavenDB 3.0, we actually had just this feature, being able to ask RavenDB to capture and display its own state in a format that should be very familiar to developers. When we created RavenDB 4.0, we were able to carry on this feature on Windows (at some cost), but it was a complete non starter on Linux.

On Windows, a process can debug another process if they belong to the same user (somewhat of an over simplification, but good enough). On Linux, the situation is a lot more complex. A process can usually only debug another process if the debugger is running as root or is the parent process of the debugee process.

Another complication was that we are using ClrMD, a wonderful library that allow us to introspect live processes (among many other things). It did not have support for Linux, until about a month ago… as soon as we had the most basic of support there, we jumped into action, seeing how we can bring this feature to Linux as well. A lot of our users are running production systems on Linux, and the ability to look at the system and go: “Hmm. I wonder what this is doing” and then being able to tell is something that we consider a major boost to RavenDB.

It took a lot of fighting and learning a lot more about how debugging permissions work on Linux than I ever wanted to know. But we got it working (details below). You can see how this looks like on a live Linux server:

image

As you can see, there is an indexing thread here doing some work on spatial data. We are going to enhance this view further with the ability to see CPU times as well as job names. The idea is that this is something that you will look at and get enough insight to not need to check the logs or try to infer what is going on. You could just tell.

Now, for the gory details of how this works. We changed the implementation on both Windows and Linux to use passive attach to process, which is much faster. The first thing we tried, once we moved to passive attach is to debug ourselves.

This is a nice enough feature, and quite elegant. We debug ourselves, pull the stack traces and display the data. Unfortunately, this doesn’t work on Linux. A process cannot debug itself. All debugging in Linux is based on the ptrace() system call, and the permissions to that are as specified. I can’t imagine the security implications of letting a process debug itself. After all, it is already can do anything the process can do, because it is the process. But I guess that this is an esoteric enough scenario that no one noticed and then the reaction was, use a workaround.

The usual workaround is to have a process that would spawn RavenDB and then it would be able to debug it. That is… possible, but it would be a major shift in how we deploy, not something that I wanted to do. There is also the ptrace_scope flag, which is supposed to control this behavior. In my tests, at least, disabling the security checks via this flag did absolutely nothing.

Running as root worked, just fine, of course. And then the process crashed. On Linux, when trying to debug your own process, there seems to be an interesting interaction between the debugger and debuggee if an exception is thrown. To the point where it will corrupt the CoreCLR state and kill the process. That was a fun bug to trace, sort of. Linux has a escape hatch in the form of PR_SET_PTRACER option that can be used. However, you can’t designate your own process, unfortunately. That combined with the hard crashes, made self debugging a non starter.

But I still want this feature, and without changing too much about how we are doing things.

Here is what we ended up doing. We have a separate process just to capture the stack trace. When you ask RavenDB to get its stack trace, it will spawn this process, but ask it to wait. It will then grant the new process the permissions necessary to debug RavenDB and signal it to continue. At this point, the debugger child process will capture the stack trace and send it back to RavenDB. RavenDB will reset the permission, enhance the stack trace with additional information that we can provide from inside the process and display it to the user.

The actual debugger process is also marked with setcap to provide it additional permissions it needs. This separation means that we isolate these permissions to a single purpose tool that can be invoked and closed, without increasing the attack surface of RavenDB.

The end result is that you can walk to a production RavenDB server, running on Windows or Linux, and get better information than if you just attached to it with the debugger.

time to read 5 min | 823 words

imageRecently we got a couple of questions in the mailing list about running a RavenDB 4.x cluster with just two nodes in it. This was a fairly common topology in RavenDB 3.x days, because each of the nodes was mostly independent, but that added a lot of operational complexity to the system. In RavenDB 3.x you had to do a lot of stuff on each of the nodes in the system. RavenDB 4.x has brought us a unified cluster management and greatly simplified a lot of operational tasks. But one of the results of this change is that we now have a cluster rather than just a bunch of nodes.

In particular, in order to be able to operate correctly, a RavenDB cluster needs a majority of the voting nodes to operate successfully. In a typical cluster setup, you are going to have three to five nodes and you’ll need two or three of them to be accessible for the cluster to be healthy.

However, in a cluster of only two nodes, a curious problem arises. The get a majority of the nodes, you need… all the nodes. In other words, if you are running a cluster of just two nodes, and one of them is inaccessible, your cluster is not available.

RavenDB’s distributed nature is built on multiple layers. Even while the cluster itself is not available, you can still load and save documents to the database, perform queries, etc. Most of the normal operations that you would do on a day to day basis will work just fine without the cluster available.

However, management functions require that the cluster be up. These include operations such adding or removing nodes to the cluster, creating or deleting a database, creating an index (including creating an auto index) or using advanced features such as ETL or Subscriptions.

In both of the cases that were raised in the mailing list, we had a two node cluster and one of the nodes was down (VM was shutdown). That led to the inability to remove the down node from the cluster (we have emergency operations that allow that, but they are not meant for normal use) or errors during queries that required us to introduce a new index to the system.

It is important to understand that this isn’t actually an error. This is the system operating as designed and is a predictable (and desirable) part of how it is supposed to work in such failure modes.

However, that is true only if you are running your cluster with all the nodes as full voting member nodes. There are other alternatives. If you have a two nodes cluster, a single node being down will take the whole cluster down. At this point, you can usually designate one node as the primary and chose a different topology. Look at the cluster image on the right. As you can see, we have the leader A as well as node B. You might notice that node B is marked with a W. Usually a member node in the cluster will be marked with M (for Member). But the W marking stands for Watcher.

In this case, a watcher node is a silent participant in the cluster. It can listen, but doesn’t interfere in the cluster itself. Node A is the sole node in the cluster that can make decisions. So if node B is down, the cluster is still functional. However, if node A is down, node B is going to operate without the cluster. Given that this would be the same situation anyway if you are running with both A and B as full members, that is a net benefit. And from experience, users who want to run a dual node cluster typically already have pretty firm ideas about which of the nodes is the primary.

You can demote a node from a full member to a watcher (and vice versa) dynamically, in the cluster management page in the Studio. However, remember that this is an operation that requires a majority of the cluster to be available.

image

You can also add a node to the cluster as a watcher immediately, which is probably a better idea.

Aside from not being counted for cluster votes, watcher nodes in RavenDB behave in the exact same manner as other nodes. You can assign them tasks, the cluster manage them as usual, they host databases and in general behave just like any other RavenDB node.

The other use case for watcher nodes are in very large clusters. If your cluster grows being seven nodes, you’ll typically start adding watcher nodes to the cluster, instead of full member nodes. This is to avoid having to get majority vote from a large number of nodes.

time to read 4 min | 704 words

One of the changes that we made in RavenDB 4.2 is a pretty radical one, even if we didn’t really talk about it. It is the fact that RavenDB now contains C code. Previously, we were completely managed (with a bunch of P/Invoke) calls. But we now we have some C code in the project. The question is why?

The answer isn’t actually about performance or the power of native C code. We are usually pretty happy with the kind of assembly instructions that we can get from C#. The actual problem was that we needed a proper abstraction. At this moment, RavenDB is running on the following platforms:

  • Windows x86-32 bits
  • Windows x86-64 bits
  • Linux x86-32 bits
  • Linux x86-64 bits
  • Linux ARM 32 bits
  • Linux ARM 64 bits
  • macOS 64 bits

And each of this platform requires some changes in how we do things. The other problem is that .NET is a well specified system, all the types sizes are well known, for example. The same isn’t true for the underlying API. Windows does a really good job of maintaining proper APIs across versions and 32/64 editions. Linux basically doesn’t seem to care. Types sizes change quite often, sometimes in unpredictable ways.

Probably the most fun part was figuring out that on x86, Linux syscall #224 is gettid(), but on ARM, you’ll call to gettime(). The problem is that if you are using C, all of that is masked for you. And it got quite unwieldly. So we decided to create a PAL (platform abstraction layer) in C to handle these details.

The rules for the PAL are simple, we don’t make assumptions about types, sizes or underlying platform. For example, let’s take a look at some declarations.

image

All the types are explicit about their size, and where we need to pass a complex type (SYSTEM_INFORMATION) we define it ourselves, rather than rely on any system type. And here are the P/Invoke definitions for these calls. Again, we are being explicit with the types event though in C# the size of types are fixed.

image

You might have noticed that I care about error handling. And error handling in C is… poor. We use the following convention in these kind of operations:

  • Each method does a single logical thing.
  • Each method return either success or flag indication the internal step in which it fail.
  • On failure, the method will return the system error code for the failure.

The idea is that on the calling side, we can construct exactly where and why we failed and still get good errors.

Yesterday I run into an issue where we didn’t move some code to the PAL, and we actually had a bug there. The problem was that when running on ARM32 machine, we would pass a C# struct to a syscall. The problem was that we defined that struct based on the values in 64 bits Linux. When called on 32 bits system, the values went to the wrong location. Luckily, this was a call that was never used. It is used by our dashboard to let the admin know how much disk space is available. There is no code that actually take action based on this information.

Which was great, because when we actually run the code, we got this value in the Studio:

image

When I dug deeper into the code, it gave really bad results. My Raspberry PI thought it had 700 PB of disk space free. The actual reason we got this funny error? We send the number of bytes to the client, and under these conditions, we can only support up to about 8 PB of free space in the browser.

I moved the code from C# P/Invoke to a simple method to calculate this:

image

Implementing this for all platforms means that we have a much nicer interface and our C# code is abstracted from the gory details on how we actually compute this.

time to read 5 min | 960 words

imageWhat happens when you want to to page through result sets of a query while the underlying data set is being constantly modified?

This seems like a tough problem, and one that you wouldn’t expect to encounter very often, right? But a really good example of just this issue is the notion of a feed in a social network. To take Twitter for simplicity, you have many people generating twits, and other users are browsing through their timeline.

What ends up happening is that the user browsing through their timeline is actually trying to page through the results, but at the same time, you are going to get more updates to the timeline while you are reading it. One of the key requirements that we have here, then, is that we want to be sure that we aren’t actually creating a lot of jitter for the user as they scroll through the timeline. Luckily, because this is a hard problem, the users are already quite familiar with some of the side affects. It would surprise no one to see the same twit multiple times in the timeline. It can be because of a retweet or a like by a user you follow, or it can be a result of the way paging is done.

Now that we understand what we want to achieve, let’s see how you can try getting there. The simplest way to handle this is to ask the database for some sort of a stable reference for the query. So instead of executing the query and being done with it, you’ll have the server maintain it for a period of time and send you the pages as you need them. This is simple, easy to implement and costly in terms of system resources. You’ll need to keep the query results in memory for each one of your users, and that can be quite a lot of memory to keep around just in case. Especially given the different between human’s interaction times and the speed of modern systems.

Another way to do that is to ask the database engine to generate a way to re-create the query as it was at that time. This is sometimes called a continuation token or some such. That works great, usually, but come with its own complications. For example, imagine that I’m doing this on the following query:

from Users order by LastName limit 5

Which gives us the following result:

image

And I got the first five users, and now I want to get the next five. Between the first and second query, a user whose last name is “Aardvark” was inserted in the system. At this point, what would you expect to get from the query? We have two choices here, as you can see below:

image

The problem is that from my perspective, both of those has serious issues. To start with, to compute the results shown in orange, you’ll need to jump through some serious hoops on the backend, and the result looks strange. To get the results in green is quite easy, but it will mean that you missed out on seeing Aardvark.

You might have noticed that the key issue here isn’t so much with the way we build the query, but the order in which we need to traverse it. We ask to sort it by last name, but we also want to get results as they come by. As it turns out, the process becomes a whole lot simpler if we unify these concepts. If we issue the following query, for example, our complexity threshold drops by a lot.

from Messages order by CreatedAt desc limit 5

This is because we now get the following results:

image

And paging through this now is pretty easy, if we want to page down, we can now issue the query to get the next page of data:

from Messages order by CreatedAt desc where CreatedAt < 217 limit 5

By making sure that we are paging and filtering on the same property, we can easily scroll through the results without having to do too much work, either in the application or in our database backend. We can also query if there are new stuff that were missed by checking CreatedAt > 222, of course.

But there is one wrinkle here. I intentionally used the CreateAt field but put numeric values there. Did you notice that there was no 220 value? That one was created on an isolated node and didn’t arrive yet. When it will show up in the local database, we’ll need to decide if we’ll give it a new value (making sure it will show up in the timeline) or store it as is, meaning that it might get lost.

These type of questions are probably more relevant at the business level. You might want to apply different behaviors based on how many likes a twit has, for example.

Another option is to have an UpdatedAt field as well, which can allow you to quickly answer the question: “What items in the range I scan has changed?”. This method also allows for simpler model for handling updates, but much of that depends on the kind of behavior you want to get. This method handles updates, including updates to the parts seen and unseen in a reasonable way and predictable cost.

time to read 7 min | 1301 words

Last week I posted about some timeseries work that we have been doing with RavenDB. But I haven’t actually talked about the feature in this space before, so I thought that this would be a good time to present what we want to build.

The basic idea with timeseries is that this is a set of data points taken over time. We usually don’t care that much about an individual data point but care a lot about their aggregation. Common usages for time series include:

  • Heart beats per minute
  • CPU utilization
  • Central back interest rate
  • Disk I/O rate
  • Height of ocean tide
  • Location tracking for a vehicle
  • USD / Bitcoin closing price

As you can see, the list of stuff that you might want to apply this to is quite diverse. In a world that keep getting more and more IoT devices, timeseries storing sensor data are becoming increasingly common. We looked into quite a few timeseries databases to figure out what needs they serve when we set out to design and build timeseries support to RavenDB.

RavenDB is a document database, and we envision timeseries support as something that you use at the document boundary. A good example of that would the heartrate example. Each person has their own timeseries that record their own heartrate over time. In RavenDB, you would model this as a document for each person, and a heartrate timeseries on each document.

Here is how you would add a data point to my Heartrate’s timeseries:

image

I intentionally starts from the Client API, because it allow me to show off several things at once.

  1. Appending a value to a timeseries doesn’t require us to create it upfront. It will be created automatically on first use.
  2. We use UTC date times for consistency and the timestamps have millisecond precision.
  3. We are able to record a tag (the source for this measurement) on a particular timestamp.
  4. The timeseries will accept an array of values for a single timestamp.

Each one of those items is quite important to the design of RavenDB timeseries, so let’s address them in order.

The first thing to address is that we don’t need to create timeseries ahead of time. Doing so will introduce a level of schema to the database, which is something that we want to avoid. We want to allow the user complete freedom and minimum of fuss when they are building features on top of timeseries. That does lead to some complications on our end. We need to be ab le to support timeseries merging. Allowing you to append values on multiple machines and merging them together into a coherent whole.

Given the nature of timeseries, we don’t expect to see conflicting values. While you might see the same values come in multiple times, we assume that in that case you’ll likely just get the same values for the same timestamps (duplicate writes). In the case of different writes on different machines with different values for the same timestamp, we’ll arbitrarily select the largest of those values and proceed.

Another implication of this behavior is that we need to handle out of band updates. Typically in timeseries, you’ll record values in increasing date order. We need to be able to accept values out of order. This turns out to be pretty useful in general, not just for being able to handle values from multiple sources, but also because it is possible that you’ll need to load archived data to already existing timeseries.  The rule that guided us here was that we wanted to allow the user as much flexibility as possible and we’ll handle any resulting complexity.

The second topic to deal with is the time zone and precision. Given the overall complexity of time zones in general, we decided that we don’t want to deal with any of that and want to store the times in UTC only. That allows you to work properly with timestamps taken from different locations, for example. Given the expected usage scenarios for this feature, we also decided to support millisecond precision. We looked at supporting only second level of precision, but that was far too limiting. At the same time, supporting lower resolution than millisecond would result in much lower storage density for most situations and is very rarely useful.

Using DateTime.UtcNow, for example, we get a resolution on 0.5 – 15 ms, so trying to represent time to a lower resolution isn’t really going to give us anything. Other platforms have similar constraints, which added to the consideration of only capturing the time to millisecond granularity.

The third item on the list may be the most surprising one. RavenDB allows you to tag individual timestamps in the timeseries with a value. This gives you the ability to record metadata about the value. For example, you may want to use this to record the type of instrument that supplied the value. In the code above, you can see that this is a value that I got from a FitBit watch. I’m going to assign it lower confidence value than a value that I got from an actual medical device, even if both of those values are going to go on the same timeseries.

We expect that the number of unique tags for values in a given time period is going to be small, and optimize accordingly. Because of the number of weasel words in the last sentence, I feel that I must clarify. A given time period is usually in the order of an hour to a few days, depending on the number of values and their frequencies. And what matters isn’t so much the number of values with a tag, but the number of unique tags. We can very efficiently store tags that we have already seen, but having each value tagged with a different tag is not something that we designed the system for.

You can also see that the tag that we have provided looks like a document id. This is not accidental. We expect you to store a document id there, and use the document itself to store details about the value. For example, if the type of the device that captured the value is medical grade or just a hobbyist. You’ll be able to filter by the tag as well as filter by the related tag document’s properties. But I’ll show that when I’ll post about queries, in a different post.

The final item on the list that I want to discuss in this post is the fact that a timestamp may contain multiple values. There are actually quite a few use cases for recording multiple values for a single timestamp:

  • Longitude and latitude GPS coordinates
  • Bitcoin value against USD, EUR, YEN
  • Systolic and diastolic reading for blood pressure

In each cases, we have multiple values to store for a single measurement. You can make the case that the Bitcoin vs. Currencies may be store as stand alone timeseries, but GPS coordinates and blood pressure both produce two values that are not meaningful on their own. RavenDB handles this scenario by allowing you to store multiple values per timestamp. Including support for each timestamp coming with a separate number of values. Again, we are trying to make it as easy as possible to use this feature.

The number of values per timestamp is going to be limited to 16 or 32, we haven’t made a final decision here. Regardless of the actual maximum size, we don’t expect to have more than a few of those values per timestamp in a single timeseries.

Then again, the point of this post is to get you to consider this feature in your own scenarios and provide feedback about the kind of usage you want to have for this feature. So please, let us know what you think.

FUTURE POSTS

  1. The importance of redundancy - 9 hours from now

There are posts all the way to Jul 23, 2019

RECENT SERIES

  1. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  2. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
  3. Reviewing Sled (3):
    23 Apr 2019 - Part III
  4. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
  5. Workflow design (4):
    06 Mar 2019 - Making the business people happy
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats