Voron & Time Series: Working with real data

time to read 9 min | 1684 words

Dan Liebster has been kind enough to send me a real world time series database. The data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this.

This is what this looks like:


The first thing that I did was take the code in this post, and try it out for size. I wrote the following:

   1: int i = 0;
   2: using (var parser = new TextFieldParser(@"C:\Users\Ayende\Downloads\TimeSeries.csv"))
   3: {
   4:    parser.HasFieldsEnclosedInQuotes = true;
   5:    parser.Delimiters = new[] {","};
   6:    parser.ReadLine();//ignore headers
   7:    var startNew = Stopwatch.StartNew();
   8:    while (parser.EndOfData == false)
   9:    {
  10:        var fields = parser.ReadFields();
  11:        Debug.Assert(fields != null);
  13:        dts.Add(fields[1], DateTime.ParseExact(fields[2], "o", CultureInfo.InvariantCulture), double.Parse(fields[3]));
  14:        i++;
  15:        if (i == 25*1000)
  16:        {
  17:            break;
  18:        }
  19:        if (i%1000 == 0)
  20:            Console.Write("\r{0,15:#,#}          ", i);
  21:    }
  22:    Console.WriteLine();
  23:    Console.WriteLine(startNew.Elapsed);
  24: }

Note that we are using a separate transaction per line, which means that we are really doing a lot of extra work. But this simulate very well incoming events coming one at a time. We were able to process 25,000 events in 8.3 seconds. At a rate of just over 3 events per millisecond.

Now, note that we have in here the notion of “channels”. From my investigation, it seems clear that some form of separation is actually very common in time series data. We are usually talking about sensors or some such, and we want to track data across different sensors over time. And there is little if any call for working over multiple sensors / channels at the same time.

Because of that, I made a relatively minor change in Voron, that allows it to have an infinite number of separate trees. That means that I can use as many trees as you want, and we can model a channel as a tree in Voron. I also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. That dropped the time to insert 25,000 lines to 0.8 seconds. Or a full order of magnitude faster.

That done, I inserted the full data set, which is just over 1,096,384 records. That took 36 seconds. In the data set I have, there are 35 channels.

I just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. That allows doing things like doing averages over time, comparing data, etc.

You can see the code implementing this in the following link.