Ayende @ Rahien

It's a girl

Time series feature design: Querying over large data sets

What happens when you want to do an aggregation query over very large data set? Let us say that you have 1 million data points within the range you want to query, and you want to get a rollup of all the data in the range of a week.

As it turns out, there is a relatively simple solution for optimizing this and maintaining a relatively constant query speed, regardless of the actual data set size.

Time series data has the great benefit of being easily aggregated. Most often, the data looks like this:

image

The catch is that you have a lot of it.

The set of aggregation that you can do over the data is also relatively simple. You have mean, max, min, std deviation, etc.

The time ranges are also pretty fixed, and the nice thing about time series data is that the bigger the range you want to go over, the bigger your rollup window is. In other words, if you want to look at things over a week, you would probably use a day or hour rollup. If you want to see things over a month, you will use a week or a day, over a year, you’ll use a week or a month, etc.

Let us assume that the cost of aggregation is 10,000 operations per second (just some number I pulled because it is round and nice to work with, real number is likely several orders of magnitude higher). So if we have to run this over a set that is 1 million data points in size, with the data being entered on per minute basis. With 1 million data points, we are going to wait almost two minutes for the reply. But there is really no need to actually check all those data points manually.

What we can do is actually prepare, ahead of time, the rollups on an hourly basis. That gives us a summary on a per hour basis of just over 16 thousand data points, and will result in a query that runs in under 2 seconds. If we also do a daily rollup, we move from having a million data points to less than a thousand.

Actually maintaining those computed rollups would probably be a bit complex, but it won’t be any more complex than how we are computing map/reduce indexes in RavenDB (this is a private, and simplified, case of map/reduce). And the result would be instant query times, even on very large data sets.

Comments

Khalid Abuhakmeh
02/20/2014 12:32 PM by
Khalid Abuhakmeh

I love Rollups, especially the red ones that leave a tattoo on your tongue.... oh wait that's not what you're talking about here. Still equally cool. I can't wait to see a working sample soon :)

Ryan
02/21/2014 02:05 AM by
Ryan

Doing agg rollups at write-time is a great optimization for relatively simple datasets. 1 timeseries w/ 1MM datapoints is no problem. How about 1MM timeseries each w/ 1MM datapoints and a handful of other meta-fields that users will want to aggregate over? This is closer to a real-world scenario(albeit a fairly large one) and makes write-time aggregates nearly impossible to maintain.

Carsten Hansen
02/21/2014 06:53 AM by
Carsten Hansen

I hope you add data to the end else there might be a lot of cascades when add old time values. One old value and you need to change all aggs after that.

Ayende Rahien
02/21/2014 12:27 PM by
Ayende Rahien

Ryan, You are still going to be doing work on each of them in turn. Having 1 million of them isn't any different. Each timeseries is handled indepedently.

Ayende Rahien
02/21/2014 12:28 PM by
Ayende Rahien

Carsten, Yes, you typically add at the end of the series, but you have to be ready to handle cases where you have updated value midway through. That isn't really a problem, because you are just splitting the data based on a certain [start,end] - when the data for that range changes, you recalculate it and only it. You can even do that lazily so you won't have to do that at write time.

Comments have been closed on this topic.