Ayende @ Rahien

Hi!
My name is Ayende Rahien
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

@

Posts: 5,947 | Comments: 44,543

filter by tags archive

Time series feature designQuerying over large data sets


What happens when you want to do an aggregation query over very large data set? Let us say that you have 1 million data points within the range you want to query, and you want to get a rollup of all the data in the range of a week.

As it turns out, there is a relatively simple solution for optimizing this and maintaining a relatively constant query speed, regardless of the actual data set size.

Time series data has the great benefit of being easily aggregated. Most often, the data looks like this:

image

The catch is that you have a lot of it.

The set of aggregation that you can do over the data is also relatively simple. You have mean, max, min, std deviation, etc.

The time ranges are also pretty fixed, and the nice thing about time series data is that the bigger the range you want to go over, the bigger your rollup window is. In other words, if you want to look at things over a week, you would probably use a day or hour rollup. If you want to see things over a month, you will use a week or a day, over a year, you’ll use a week or a month, etc.

Let us assume that the cost of aggregation is 10,000 operations per second (just some number I pulled because it is round and nice to work with, real number is likely several orders of magnitude higher). So if we have to run this over a set that is 1 million data points in size, with the data being entered on per minute basis. With 1 million data points, we are going to wait almost two minutes for the reply. But there is really no need to actually check all those data points manually.

What we can do is actually prepare, ahead of time, the rollups on an hourly basis. That gives us a summary on a per hour basis of just over 16 thousand data points, and will result in a query that runs in under 2 seconds. If we also do a daily rollup, we move from having a million data points to less than a thousand.

Actually maintaining those computed rollups would probably be a bit complex, but it won’t be any more complex than how we are computing map/reduce indexes in RavenDB (this is a private, and simplified, case of map/reduce). And the result would be instant query times, even on very large data sets.

More posts in "Time series feature design" series:

  1. (04 Mar 2014) Storage replication & the bee’s knees
  2. (28 Feb 2014) The Consensus has dRafted a decision
  3. (25 Feb 2014) Replication
  4. (20 Feb 2014) Querying over large data sets
  5. (19 Feb 2014) Scale out / high availability
  6. (18 Feb 2014) User interface
  7. (17 Feb 2014) Client API
  8. (14 Feb 2014) System behavior
  9. (13 Feb 2014) The wire format
  10. (12 Feb 2014) Storage

Comments

Khalid Abuhakmeh

I love Rollups, especially the red ones that leave a tattoo on your tongue.... oh wait that's not what you're talking about here. Still equally cool. I can't wait to see a working sample soon :)

Ryan

Doing agg rollups at write-time is a great optimization for relatively simple datasets. 1 timeseries w/ 1MM datapoints is no problem. How about 1MM timeseries each w/ 1MM datapoints and a handful of other meta-fields that users will want to aggregate over? This is closer to a real-world scenario(albeit a fairly large one) and makes write-time aggregates nearly impossible to maintain.

Carsten Hansen

I hope you add data to the end else there might be a lot of cascades when add old time values. One old value and you need to change all aggs after that.

Ayende Rahien

Ryan, You are still going to be doing work on each of them in turn. Having 1 million of them isn't any different. Each timeseries is handled indepedently.

Ayende Rahien

Carsten, Yes, you typically add at the end of the series, but you have to be ready to handle cases where you have updated value midway through. That isn't really a problem, because you are just splitting the data based on a certain [start,end] - when the data for that range changes, you recalculate it and only it. You can even do that lazily so you won't have to do that at write time.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB Sharding (3):
    22 May 2015 - Adding a new shard to an existing cluster, splitting the shard
  2. The RavenDB Comic Strip (2):
    20 May 2015 - Part II – a team in trouble!
  3. Challenge (45):
    28 Apr 2015 - What is the meaning of this change?
  4. Interview question (2):
    30 Mar 2015 - fix the index
  5. Excerpts from the RavenDB Performance team report (20):
    20 Feb 2015 - Optimizing Compare – The circle of life (a post-mortem)
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats