Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,640
|
Comments: 51,256
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 976 words

In this post, we are going to talk about the actual design for a time series feature, from a storage perspective. For the purpose of the discussion, I am going to assume the low level Voron as the underlying storage engine.

We are going to need to store the data in a series. And the actual data is going to be relatively ordered (time based, usually increasing).

As such, I am going to model the data itself as a set of Voron trees, one tree per each series that is created. The actual data in each tree would be composed of key and value, that are each 8 bytes long.

  • Key (8 bytes): 100-nanosecond interval between Jan 1, 0001 to Dec 31, 9999.
  • Val (8 bytes): 64 bits double precision floating point number.

There can be do duplicate values for a specific time in a series. Since this represent one ten millionth of a second, that should be granular enough, I think.

Reading from the storage will always be done on a series basis. The idea is to essentially use the code from this post, but to simplify to a smaller key.  The idea is that we can store roughly 250 data points in each Voron page. Which should give us some really good performance for both reads & writes.

Note that we need to provide storage API to do bulk writes, since there are some systems that would require it. Those can either be systems with high refresh rates (a whole set of sensors with very low refresh rates) or, more likely, import scenarios. The import scenario can be either a one time (moving to a new system), or something like a nightly batch process where we get the aggregated data from multiple sources.

For our purposes, we aren’t going to distinguish between the two.

We are going to provide and API similar to:

    • void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);

This ugly method can handle updates to multiple series at once. In human speak, this is an enumerable of updates to a series where each update is the time and value for that time. From the storage perspective, this creates a single transaction where all the changes are added at once, or not at all. It is the responsibility of higher level code to make sure that we optimize number of calls vs. in flight transaction data vs. size of transactions.

Adding data to a series that doesn’t exists will create it.

We also assume that series names is up to printable Unicode characters of up to 256 bytes (UTF8).

The read API is going to be composed of:

  • IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
  • IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);

Anything else would have to be done at a higher level.

There is no facility for updates. You can just add again on the same time with a new value, and while this is supported, this isn’t something that is expected.

Deleting data can be done using:

  • void Delete(string series, DateTime start, DateTime end);
  • void DeleteAll(string series);

The first option will delete all the items in range. The second will delete the entire tree. The second is probably going to be much faster. We are probably better off checking to see if the max / min ranges for the tree are beyond the items for this series and falling to DeleteAll if we can. Explicit DeleteAll will also delete all the series tags. While implicit Delete(“series-1”, DateTime.MinValue, DateTime.MaxValue) for example will delete the series’ tree, but keep the series tags.

Series can have tags attached to it. Those can be any string up to 256 bytes (UTF8). By conventions, they would usually be in the form of “key:val”.

Series can be queried using:

  • IEnumerable<string> GetSeries(string start = null);
  • IEnumerable<string> GetSeriesBy(string tag, string start = null);
  • IEnumerable<string> GetSeriesTags(string series);

Series can be tagged using:

  • void TagSeries(string series, string []tags);

There can be no duplicate tags.

In summary, the interface we intend to use for storage would look roughly like the following:

public interface ITimeSeriesStorage
{
    void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);
    IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
    IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);
    void Delete(string series, DateTime start, DateTime end);
    void DeleteAll(string series);
    IEnumerable<string> GetSeries(string start = null);
    IEnumerable<string> GetSeriesBy(string tag, string start = null);
    IEnumerable<string> GetSeriesTags(string series);
    void TagSeries(string series, string []tags);
}

Data sizes – assume 1 value per minute per series, that gives us an update rate of 1,440 updates per day or 525,600 per year. That means that for 100,000 sensors (not an uncommon amount) we need to deal with 52,560,000,000 data items per year. This would probably end up being just over 3 GB or so. Assuming 1 value per second, that gives us 86,400 values per day, 31,536,000 per year and 3,153,600,000,000 values per year for the 100,000 sensors will user about 184 GB or so. Those seems to be eminently reasonable values for the data size that we are talking about here. 

Next, we’ll discuss how this is all going to look like over the wire…

time to read 2 min | 289 words

I mentioned before that there is actually quite a lot that goes into a major feature, usually a lot more than would seem on the surface. I thought that this could serve as a pretty good example of a series of blog posts that we can use to show how we are actually sitting down and doing high level design of features in RavenDB.

I’ve actually already spoken, and we have seen an implementation of the backend, about time series data. But that was just the basic storage, and that was mostly to explore how to deal with Voron for more scenarios, to be honest.

Now, allow me to expand that into a full pledged feature design.

What we want: the ability to store time series data and work with it.

Time series data is a value (double) that is recorded at a specific time for a specific series. A good explanation on that can be found here: https://tempo-db.com/docs/modeling-time-series-data/

We have a bunch of sensors that report data, and we want to store it, and get it back, and get rollups of the information, and probably do more, but that is enough for now.

At first glance, we need to support the following operations:

  • Add data (time, value) to a series.
  • Add bulk data to a set of series.
  • Read data (time,value) from a series or a set of them in a given range.
  • Read rollup (avg, mean, max, min) per period (minutes, hour, day, month) in a given range for a set of series.

In the next few posts, I’ll discuss storage, network, client api, user interface and distribution design for this. I would truly love to hear from you about any ideas you have.

time to read 3 min | 593 words

Just sit right back and you'll hear a tale, a tale of a fateful bug. That started from a simple request, about a feature that was just a bit too snug.

Okay, leaving aside my attempts at humor. This story is about a customer reporting an issue. “Most of the time we have RavenDB running really fast, but sometimes we have high latency requests”.

After a while, we managed to narrow it down to the following scenario:

  • We have multiple concurrent requests.
  • Those requests contains a Lazy request that has a facet query.
  • The concurrent requests appears to all halt and then complete together.

In other words, it looks like we had all those requests waiting on a lock, then when it is released, all of them are free to return. This makes sense, there is a cache lock in the facet code that should behave in this manner. But when we looked at that, we could see that this didn’t really behave in the way we expected it to.

Eventually we got to test this out on the client data, and that is when we were able to pin point the issue.

Usually, you have facets like this:

The one of the left is when searching Amazon for HD, the one on the right is when you search Amazon for TV.

imageimage

In RavenDB, you typically express this sort of query using:

session.Query<Product>().Search(“Search”, query).ToFacets(new Facet { Name = “Brand”} );

And we expect the number of facets that you have in a query to be in the order of a few dozens.

However, the client in question has done something a bit different. I think that this is because they brought the system over from a relational database. Each product in the system had a list of facets associated with it. It looked something like:

“Facets”: [13124,87324,32812,65743]

Obvious, this means that this product belongs to the “2,000 – 3,500” price range facet (electronics) the “Red” color facet (electronics, handheld) etc…

In total, we had over 70,000 facets in the database, and that is just something that we never really expected. Because we didn’t expect it, we reacted… poorly when we had to deal with it. In fact, what happened was that pattern of behavior meant that we effectively had worse than not having a cache, we would always have to do the work, and never really gain any benefit from it (there wasn’t enough sharing to actually trigger the benefits of the cache). And because we did locks on the cache to ensure that we don’t get into a giant mess… Well, you can figure out how it went from there.

The fix was to actually devolve the code in to a simpler method. Instead of trying to be smart and just figure out what we needed to compute for this query, we can be aggressive and load everything we needed. All the next requests will result in no wait time, because the data is already there. The code became much simpler.

Oh, and we got it deployed to production and saw a 400% decrease in the average request time, and no more sudden waits when we had requests piling up.

time to read 1 min | 124 words

This is one of those things that I had to read several times to realize what was actually going on.

The code for that is here: https://github.com/ayende/ravendb/blob/c0c9ccf98011fb64b5eb5406a900ec1338ea78e4/Raven.Tests/Issues/RavenDB_1603.cs#L32

And it appears that along with every else, RavenDB also include a proxy server.

Now, to be fair, this is required for our tests, to see what happens when we have a forced disconnect / timeout at the network level, so it make sense. And the whole thing is under 100 lines of code.

This sort of thing explains why we really need to do a whole bunch of work on our tests. We want to get to a 500 – 1000 tests (currently we have close to 3,200) that run in under 5 minutes.

time to read 2 min | 217 words

One of the interesting* tidbits I learned is that just having an idea about a feature is far from being able to actually do something with it.  Hell, even implementing the feature is just the first step in a very long road.

* Read: annoying instead of interesting.

Here are the things that you need to do just in order to actually call a feature Ready To Show (vs. Done **):

  • Implement backend.
  • Expose over the wire.
  • Write client API.
  • Create user interface.
  • Distribution (replication/sharding)

Those things are just what you need so you can actually show something meaningful, rather than a bunch of code and a lot of hand waving. This can be pretty annoying at times, mostly because it puts a lot of work that has to be done before we can actually show it to people.

** And just because people will ask, to get to done you need:

  • Logging
  • Monitoring
  • The ability to test easily
  • Performance trials
  • Longevity trails
  • Production proofing

I’m probably forgetting a bunch of stuff, but those are what pops to mind.

The nice thing about getting to show something is that we can usually parallelize the work for the all of those by handing this to different people.

time to read 2 min | 265 words

This is a review of the Metrics.NET project (commit cb52da325c0a88336e09412638f72620d9ba7992).

The project is supposed to give us a way to track metrics about our applications, and we want to make use of it in RavenDB instead of the highly unreliable performance counters. This is going to be a pretty short review, mostly because I don’t really have much to say.  There are a few things that I take issues with (async tasks using Thread.Sleep instead of Task.Delay, but that is probably because it is targeting .NET 4.5).

Other than that, most of the code is actually doing crazy math stuff, and there are proper links to the explanations, and you can see why it is doing so. The impressive thing is that pretty much everything that I wanted to do was already there. Including an easy way to expose metrics over the wire, and that the whole things seems pretty seamless.

Very good work, all around.

For our purposes, however, I think we’ll need to do some other things. In particular, one of the major assumptions throughout the code is that there is always a type associated with a metric, which isn’t the case for our purposes.  More than that, the code now assumes a static and fixed set of metrics for the entire system. That doesn’t work very well for us when we have different metrics for each database, so we’ll probably need to change that as well.

But I am very impressed, this looks like it could sort out a lot of the things that we need to do very quickly.

time to read 3 min | 500 words

We got a report from a user about severe issues with RavenDB. It reports resource exhaustion with plenty of resources still available, and once that happens, it will refuse to even restart itself, forcing a process kill.

As you can imagine, that was a pretty big deal for us, so we set out to investigate. And we found some interesting results.

One of the things that we like to keep in mind with RavenDB is that it is a safe choice. Whenever we need to make a decision between various tradeoffs, we’ll always chose the safe choice. That means, among other things, that we are pretty careful about the way that we approach external input. And in this case, we are actively protecting ourselves from the outside world. One of the ways we do that is by limiting the number of requests that we will concurrently process.

The idea is that it is better to flat out reject requests than put such a load on the system that it will eventually crash. Indeed, that has been such a successful tactic that to this day, there has been exactly zero production issues with it. To my knowledge, it hasn’t ever been even noticed by any of our users.

The actual issue is that we have an internal limit that is set by default to 256 concurrent transactions. And by default, we will accept up to 192 concurrent requests. Then I looked at the actual logs, and I found:

image

And that explains much, but not nearly all. We had this in our code base for roughly 8 months. There are still other things that protect us from those issues, not the least of which is that it is actually hard to generate that number of requests against us (you really have to try very hard, usually from multiple machines). But there was one scenario that we didn’t consider for the purpose of protecting ourselves from the barbarians at the gate. Multi Get requests.

Multi Get requests allows you to package multiple requests to RavenDB into a single physical request. Those requests are going to cost you a single round trip to the server, and you can run as many of those as you want. In the dump we received, we could see 17 pending Multi Get request, and about 400 queries being executed, each of them requiring their own session. No wonder we got out of session errors.

Final note: for what it is worth, I changed our limits to 1,024 concurrent sessions and 512 concurrent requests, which is more reasonable considering the kind of hardware we usually run on. Multi Get has another 192 sessions that it can utilize, and the rest are dedicated for background processes.

time to read 8 min | 1438 words

So far, we have just put the data in and out. And we have had a pretty good track record doing so. However, what do we do with the data now that we have it?

As you can expect, we need to read it out. Usually by specific date ranges. The interesting thing is that we usually are not interested in just a single channel, we care about multiple channels. And for fun, those channel might be synchronized or not. An example of the first might be the current speed and the current engine temperature in a car. They are generally share the exact same timestamps. An example of out of sync is when you have a sensor on a rooftop measuring rainfall, and another sensor in the sewer measuring water flow rates. (Again, thanks to Dan for helping me with the domain).

This is interesting, because it present quite a few interesting problems:

  • We need to merge different streams into a unified view.
  • We need to handle both matching and non matching sequences.
  • We need to handle erroneous data, what happens when we have two reading for the same time for the same sensor? Yes, that shouldn’t happen, but it does.

I solved this with the following API:

public class RangeEntry
{
    public DateTime Timestamp;
    public double?[] Values;
}

IEnumerable<RangeEntry> results = dts.ScanRanges(DateTime.MinValue, DateTime.MaxValue, new[] { "6febe146-e893-4f64-89f8-527f2dbaae9b", "707dcb42-c551-4f1a-9203-e4b0852516cf", "74d5bee8-9a7b-4d4e-bd85-5f92dfc22edb", "7ae29feb-6178-4930-bc38-a90adf99cfd3", });

This API gives me the results in the time order, with the same positions as the ids requested for the values. With nulls if there isn’t a value matching the value from that time in that particular sensor channel.

The actual implementation relies on this method:

IEnumerable<Entry> ScanRange(DateTime start, DateTime end, string id)

All this does it provide the entries all the entries in a particular date range, for a particular channel. Let us see how we implement multi channel scanning on top of this:

private class PendingEnumerator
{
    public IEnumerator<Entry> Enumerator;
    public int Index;
}

private class PendingEnumerators
{
    private readonly SortedDictionary<DateTime, List<PendingEnumerator>> _values =
        new SortedDictionary<DateTime, List<PendingEnumerator>>();

    public void Enqueue(PendingEnumerator entry)
    {
        List<PendingEnumerator> list;
        var dateTime = entry.Enumerator.Current.Timestamp;
        if (_values.TryGetValue(dateTime, out list) == false)
        {
            _values.Add(dateTime, list = new List<PendingEnumerator>());
        }
        list.Add(entry);
    }

    public bool IsEmpty { get { return _values.Count == 0; } }

    public List<PendingEnumerator> Dequeue()
    {
        if (_values.Count == 0)
            return new List<PendingEnumerator>();

        var kvp = _values.First();
        _values.Remove(kvp.Key);
        return kvp.Value;
    }
}

public IEnumerable<RangeEntry> ScanRanges(DateTime start, DateTime end, string[] ids)
{
    if (ids == null || ids.Length == 0)
        yield break;

    var pending = new PendingEnumerators();
    for (int i = 0; i < ids.Length; i++)
    {
        var enumerator = ScanRange(start, end, ids[i]).GetEnumerator();
        if(enumerator.MoveNext() == false)
            continue;
        pending.Enqueue(new PendingEnumerator
        {
            Enumerator = enumerator,
            Index = i
        });
    }

    var result = new RangeEntry
    {
        Values = new double?[ids.Length]
    };
    while (pending.IsEmpty == false)
    {
        Array.Clear(result.Values,0,result.Values.Length);
        var entries = pending.Dequeue();
        if (entries.Count == 0)
            break;
        foreach (var entry in entries)
        {
            var current = entry.Enumerator.Current;
            result.Timestamp = current.Timestamp;
            result.Values[entry.Index] = current.Value;
            if(entry.Enumerator.MoveNext())
                pending.Enqueue(entry);
        }
        yield return result;
    }
}

We are getting a single entry from each channel into the pending enumerators. Then, we collate all the entries that share the same time into a single entry.

We use the Index property to track the actual expected index of the entry in the output. And we handle duplicate times in the same channel by outputting multiple entries.

Testing this on my 1.1 million records data set, we can get 185 thousands records back in 0.15 seconds.

time to read 2 min | 259 words

I wonder what is says about what I am doing right now that I really wish that I could have the OS give me more control over virtual memory allocation. At any rate, the point of this post is to point out something quite important to people writing databases, especially databases that make use of virtual memory a lot.

There isn’t quite as much of it as I thought it would be. Oh, on 32 bits the 4GB limits is really hard to swallow. But on 64 bits, the situation is much better, but still constrained.

On Windows, using x64, you are actually limited to merely 8TB of address space. In Windows 8.1, (and I assume, but couldn’t verify, Windows 2012 R2) you can use up to 128TB of virtual address space. With Linux, at least since 2.6.32, and probably earlier, the limit is 128TB per process.

Implications for Voron, by the way, is that the total size of of all databases in a single process can be up to 8TB (probably somewhat less than that, the other stuff will also need memory). Currently the biggest RavenDB database that I am aware of was approaching the 1.5 – 2.0 TB mark last I checked (several months ago), and Esent, our current technology, is limited to 16TB per database.

So it isn’t great news, but it is probably something that I can live with. And at least I can give proper recommendations. In practice, I don’t think that this would be an issue. But that is good to know.

time to read 9 min | 1684 words

Dan Liebster has been kind enough to send me a real world time series database. The data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this.

This is what this looks like:

image

The first thing that I did was take the code in this post, and try it out for size. I wrote the following:

   1: int i = 0;
   2: using (var parser = new TextFieldParser(@"C:\Users\Ayende\Downloads\TimeSeries.csv"))
   3: {
   4:    parser.HasFieldsEnclosedInQuotes = true;
   5:    parser.Delimiters = new[] {","};
   6:    parser.ReadLine();//ignore headers
   7:    var startNew = Stopwatch.StartNew();
   8:    while (parser.EndOfData == false)
   9:    {
  10:        var fields = parser.ReadFields();
  11:        Debug.Assert(fields != null);
  12:  
  13:        dts.Add(fields[1], DateTime.ParseExact(fields[2], "o", CultureInfo.InvariantCulture), double.Parse(fields[3]));
  14:        i++;
  15:        if (i == 25*1000)
  16:        {
  17:            break;
  18:        }
  19:        if (i%1000 == 0)
  20:            Console.Write("\r{0,15:#,#}          ", i);
  21:    }
  22:    Console.WriteLine();
  23:    Console.WriteLine(startNew.Elapsed);
  24: }

Note that we are using a separate transaction per line, which means that we are really doing a lot of extra work. But this simulate very well incoming events coming one at a time. We were able to process 25,000 events in 8.3 seconds. At a rate of just over 3 events per millisecond.

Now, note that we have in here the notion of “channels”. From my investigation, it seems clear that some form of separation is actually very common in time series data. We are usually talking about sensors or some such, and we want to track data across different sensors over time. And there is little if any call for working over multiple sensors / channels at the same time.

Because of that, I made a relatively minor change in Voron, that allows it to have an infinite number of separate trees. That means that I can use as many trees as you want, and we can model a channel as a tree in Voron. I also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. That dropped the time to insert 25,000 lines to 0.8 seconds. Or a full order of magnitude faster.

That done, I inserted the full data set, which is just over 1,096,384 records. That took 36 seconds. In the data set I have, there are 35 channels.

I just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. That allows doing things like doing averages over time, comparing data, etc.

You can see the code implementing this in the following link.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. API Design (10):
    29 Jan 2026 - Don't try to guess
  2. Recording (20):
    05 Dec 2025 - Build AI that understands your business
  3. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  4. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  5. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
View all series

Syndication

Main feed ... ...
Comments feed   ... ...