Ayende @ Rahien

It's a girl

Voron & Time Series: Working with real data

Dan Liebster has been kind enough to send me a real world time series database. The data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this.

This is what this looks like:

image

The first thing that I did was take the code in this post, and try it out for size. I wrote the following:

   1: int i = 0;
   2: using (var parser = new TextFieldParser(@"C:\Users\Ayende\Downloads\TimeSeries.csv"))
   3: {
   4:    parser.HasFieldsEnclosedInQuotes = true;
   5:    parser.Delimiters = new[] {","};
   6:    parser.ReadLine();//ignore headers
   7:    var startNew = Stopwatch.StartNew();
   8:    while (parser.EndOfData == false)
   9:    {
  10:        var fields = parser.ReadFields();
  11:        Debug.Assert(fields != null);
  12:  
  13:        dts.Add(fields[1], DateTime.ParseExact(fields[2], "o", CultureInfo.InvariantCulture), double.Parse(fields[3]));
  14:        i++;
  15:        if (i == 25*1000)
  16:        {
  17:            break;
  18:        }
  19:        if (i%1000 == 0)
  20:            Console.Write("\r{0,15:#,#}          ", i);
  21:    }
  22:    Console.WriteLine();
  23:    Console.WriteLine(startNew.Elapsed);
  24: }

Note that we are using a separate transaction per line, which means that we are really doing a lot of extra work. But this simulate very well incoming events coming one at a time. We were able to process 25,000 events in 8.3 seconds. At a rate of just over 3 events per millisecond.

Now, note that we have in here the notion of “channels”. From my investigation, it seems clear that some form of separation is actually very common in time series data. We are usually talking about sensors or some such, and we want to track data across different sensors over time. And there is little if any call for working over multiple sensors / channels at the same time.

Because of that, I made a relatively minor change in Voron, that allows it to have an infinite number of separate trees. That means that I can use as many trees as you want, and we can model a channel as a tree in Voron. I also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. That dropped the time to insert 25,000 lines to 0.8 seconds. Or a full order of magnitude faster.

That done, I inserted the full data set, which is just over 1,096,384 records. That took 36 seconds. In the data set I have, there are 35 channels.

I just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. That allows doing things like doing averages over time, comparing data, etc.

You can see the code implementing this in the following link.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Scooletz
01/30/2014 03:27 PM by
Scooletz

The very same results can be accomplished writing each channel to a separate file in a directory or saving the entries in sstables, one for each channel. Both would produce similar results. Do you want to show is that the full collection scans in Voron take not so long?

Ayende Rahien
01/30/2014 03:42 PM by
Ayende Rahien

Scooletz, You can't do that in separate files / sstables. To start with, while generally true, you can't assume that the time series data is always increasing. Sometimes you get backdated data.

Next, sstables, in particular, require one run to build them. You can't add to them after the fact. (well, not without a whole lot of trouble).

In addition, doing this with one file / sstable per channel is going to result in really bad performance overall. Most FS does not handle a lot of files very well.

Scooletz
01/30/2014 09:36 PM by
Scooletz

I've based my assumptions on the provided image, that shows a small fragment of increasing timestamps - my bad. Iff you consider one sstable for the whole load, that's for sure. I thought about creating file per channel and using increasing timestamps (see above). I'm aware of FS limitations. For the case you provided, I did the math assuming that one channel is ~10000 entries, which, for the whole number over 1million gives ~100 channels. That's feasible for any FS I suppose. If the distribution is different, then my calculations are wrong. What I meant was that, 1 million entries is not a big number, especially considering measurements, event processing, etc. If we are talking about test loading of db - that's ok, if we're talking about mentioned measurements, it'd be nice to see much bigger numbers.

Ayende Rahien
01/30/2014 09:40 PM by
Ayende Rahien

Scooletz, You have access to the source, feel free to test it yourself. I'm playing around with what we can do with Voron, and I happened to have a 1 million records timeseries dataset handy. So I used that.

Comments have been closed on this topic.