Ayende @ Rahien

Refunds available at head office

Time series feature design: System behavior

It is easy to forget that a database isn’t just about storing and retrieving data. A lot of work goes into the actual behavior of the system beyond the storage aspect.

In the case of time series database, just storing the data isn’t enough, we very rarely actually want to access all the data. If we have a sensor that send us a value once a minute, that comes to 43,200 data points per month. There is very little that we actually want to do for that amount of data. Usually we want to do things over some rollup of the data. For example, we might want to see the mean per day, or the standard deviation on a weakly basis, etc.

We might also want to do some down sampling. By that I mean that we take a series whose value is stored on a per minute / second basis and we want to store just the per day totals and delete the old data to save space.

The reason that I am using time series data for this series of blog posts is that there really isn’t all that much that you can do for a time series data, to be honest. You store it, aggregate over it, and… that is about it. Users might be able to do derivations on top of that, but that is out of scope for a database product.

Can you think about any other behaviors that the system needs to provide?

Comments

Jakub Konecki
02/14/2014 11:37 AM by
Jakub Konecki

Just some wild ideas:

  1. Querying I want to get all the points in a series where the value exceeded certain value.

  2. Aggregating multiple series I want to have a daily average of all series for sensors in certain area.

  3. Notifications I want to be notified when a moving average of last x minutes exceeds certain value.

Patrick Huizinga
02/14/2014 12:18 PM by
Patrick Huizinga

I suspect a certain pattern is an early indicator of a failure. When did this pattern occur before and how often was it actually followed by a failure?

This means you would need to keep all data. Yes, I specifically choose this as an argument against your aggregate & purge. :P

JPWKeeper
02/14/2014 12:50 PM by
JPWKeeper

As someone who works in the Telemetry industry and whose every day is consumed with processing and handling time series data, I can tell you that there are endless things that people want to do with that data.

My examples are flavored with flight test, but I've seen the exact same things in other areas.

Generally a single data file we have to be able to process can be in the 20s of GB, which may contain up to 60,000 different measurements (but 10-30K measurements is more common). A "slow" measurement is usually around once per second, and that can be for up to 8 hours. 5-10Hz is more common for most quality parameters (airspeed, altitude, etc). However, for some types of data like Vibration data the rate can be much, much higher. I've seen up to 100KHz, but that's rare. And you have to keep every single point. Most of our customers do various things such as Flutter Analysis, FFT computations to break out resonant frequencies, etc. If you decimate the data, you can fundamentally alter how those computations are going to work thus giving false results.

Prior comments are correct, though. You do need to be able search for pre-existing patterns that may indicate a failure, and those searches won't necessarily be against one single measurement. For example, and this is an extremely simple example, you may need to show when an aircraft exceeded 500 knots while below 10,000 feet, which is known as "Speeding". You can't always do that with aggregates.
You also need to break portions of the data out into files that applications like Matlab can ingest for further analysis.

It also helps if you can transform that data using customizable equations, not so much for analysis, but to help the customer identify issues that aren't always easily visible in the raw data stream.

You are not wrong that data aggregates are valuable for visualizing the big picture, but that doesn't mean you can throw out the rest of the data. I've also found that most people still want it to look like it's not aggregated, but want to look at an entire data run in one screen, so that is the level we generally do our aggregates (about 1,000 aggregates for each measurement of a single data run usually does the trick). Even that can get big, though, depending on how many measurements you are aggregating.

Admittedly, this may be a bit heavier than what you are writing your app for. We don't put our data into databases because of its sheer size.

kvleeuwen
02/14/2014 01:34 PM by
kvleeuwen

A query like 'give me the last X readings of sensors Y1, Y2 before moment Z'. A query that reports changes in a time series. Bulk update to correct clock skew of a sensor. Also, not sure whether it is in scope, interpolation to get 10 minute readings of a sensor that reports intermittently between 5-7 minutes. That's a can of worms though since there are many ways to do that.

Ayende Rahien
02/14/2014 02:17 PM by
Ayende Rahien

@Jakub, 1) What is the point in doing that, and in what time frame? 2) That is just aggregating a single series, in a batch, isn't it? 3) That might be a very good thing, yes.

@Patrick, You don't have to purge, but if you have 3 years of minute by minute data, you aren't going to look at it very often, except in aggregate.

JWPKeeper, That is facinating. Thanks for the detailed information.

Regarding patterns, how would you detect them? For that matter, what is the role of the database in such a scenario? Except serving up the data? I haven't consider such cases as what you are describing, very high frequency over relatively small time.

If you could put this in a database, what would you like to see it do?

kvleeuwen, - last X reading for sensor Y before Z is an interesting query, should be easy to do. - what do you mean, changes in a time series? Values out of a certain bound? Out of a moving average? - bulk update - probably not going to do that, easier to do a range delete & import. - interpolation - that is something that probably should be done out side the DB

Carsten Hansen
02/14/2014 02:52 PM by
Carsten Hansen

Provide graph and statistical package like SAS/SPSS or at least a good interface.

Outlier dectection as well as simple measures such as Max/Min/Median/Quartils/Mean/StdVar/R2 might be included as standard.

It might be faster to embed a statistical package than trying to import/export to another system.

There are already many statistical packages even Open Source. See http://en.wikipedia.org/wiki/Listofstatistical_packages

The DAP - GNU Project might be of interest: http://www.gnu.org/software/dap/

Jakub Konecki
02/14/2014 02:55 PM by
Jakub Konecki

AD 1) This is similar requirement to 3). For example, I want to find out when my CPU, memory, HTTP traffic spikes so I can scale.

AD 2) Yes, but obviously needs explicit support in API so one can pass a list of series to rollup.

AD 3) I'm thinking of 'Greg Young's GetEventStore like' projections that create new series by projecting existing series or rollups.

James Farrer
02/14/2014 04:05 PM by
James Farrer

This is really exciting work Ayende - Historians in the manufacturing industry are all about time series data and they suck - enterprise, expensive, clunky etc.

You might like to check out OpenHistorian on code plex and the GSF (Grid Solutions Framework) which is a bunch of open source projects created by some people in the power industry and is all about time series data.

Jacobs point 1 I see as essential - you want to look at when values for speed, temperature, pressure had exceeded set points.

Really excited about this and Voron / RavenDB and the future

kvleeuwen
02/14/2014 04:50 PM by
kvleeuwen

what do you mean, changes in a time series?

Hmm, that was a poorly formulated one :) I meant a 'delta' projection where the numerical difference between two adjacent values is returned. So when you have readings of 3,6,5 the delta query returns +3, -1. Even better to have the delta of the ticks too.

This is trivial to do client-side but if you're building notifications, sudden jumps in sensor values are interesting too.

Piers Lawson
02/14/2014 07:21 PM by
Piers Lawson

When you are considering allowing the mean to be found, don't forget the other averages which can be interesting as well, Mode and especially Median.

Doug Paice
02/17/2014 04:18 PM by
Doug Paice

In our system we receive periodic data from meters (gas, electric, water etc) and it's not particularly frequent, minute level. Most of the meters give us an accumulator value of usage and we have to subtract a previous value from the current to determine the power usage over time. Our most common calculation is to find the latest value in each period, and perform a rolling calculation subtracting each previous period from the current.

Given all that 2 features that would be useful for us would be "for a given periodicity, give me the latest value and the time it was recorded" and something that easily allows us to access the previous value.. sort of a windowed aggregate. The other thing is being able to project 2 series onto each other, joining based on time period.

The alerts side would be very helpful, but the alerts but you'd want to have the alerts based on a view or projection of the arriving data.

Ayende Rahien
02/18/2014 08:34 AM by
Ayende Rahien

Doug, We actually already did data joining from multiple series. I am not sure that I understand the first request, though.

Isn't this just an issue of "give me the first value previous to time X", storing that, then moving on?

Comments have been closed on this topic.