Ayende @ Rahien

Refunds available at head office

Time series feature design

I mentioned before that there is actually quite a lot that goes into a major feature, usually a lot more than would seem on the surface. I thought that this could serve as a pretty good example of a series of blog posts that we can use to show how we are actually sitting down and doing high level design of features in RavenDB.

I’ve actually already spoken, and we have seen an implementation of the backend, about time series data. But that was just the basic storage, and that was mostly to explore how to deal with Voron for more scenarios, to be honest.

Now, allow me to expand that into a full pledged feature design.

What we want: the ability to store time series data and work with it.

Time series data is a value (double) that is recorded at a specific time for a specific series. A good explanation on that can be found here: https://tempo-db.com/docs/modeling-time-series-data/

We have a bunch of sensors that report data, and we want to store it, and get it back, and get rollups of the information, and probably do more, but that is enough for now.

At first glance, we need to support the following operations:

  • Add data (time, value) to a series.
  • Add bulk data to a set of series.
  • Read data (time,value) from a series or a set of them in a given range.
  • Read rollup (avg, mean, max, min) per period (minutes, hour, day, month) in a given range for a set of series.

In the next few posts, I’ll discuss storage, network, client api, user interface and distribution design for this. I would truly love to hear from you about any ideas you have.

Comments

Rémi BOURGAREL
02/11/2014 10:51 AM by
Rémi BOURGAREL

Is this linked to your interest in metrics-net ?

Khalid Abuhakmeh
02/11/2014 01:35 PM by
Khalid Abuhakmeh

TempoDB makes the claim that although your data may be growing, your query time will remain constant. They claim they are able to take snapshots and roll ups automatically for you, that way instead of adding a million documents up, you are adding 10 documents together. This reminds me of what EventStore does. That might be a feature you want to explore here. To sum it up, query time is important :)

Ayende Rahien
02/11/2014 03:53 PM by
Ayende Rahien

Remi, No, that is for doing metrics in RavenDB itself.

Ayende Rahien
02/11/2014 03:54 PM by
Ayende Rahien

Khalid, That is actually relatively easy to handle. All you need to do is automatic rollups based on the actual size of the data. You can do a 1 : 10,000 or 1 : 100,000 rollups, and you'll get that behavior.

Khalid Abuhakmeh
02/11/2014 03:58 PM by
Khalid Abuhakmeh

Where would that responsibility lie? Would it be the database's responsibility and be a configurable option in the feature or would it be my responsibility as a developer to write a tool that would do that?

I only ask because although the roll ups might be easy to create, taking them into account might be slightly trickier. For example: Take 4 roll ups, and then the rest are regular time series value documents. Then do the analysis.

That query behavior would be a nice abstraction, rather than having to reimplement it every time.

Ayende Rahien
02/11/2014 04:03 PM by
Ayende Rahien

Khalid, If I was building this? I would say that it is the DB responsibility to do this. It should be relatively easy to handle this in general. Especially since most of the time, time series db is always forward only. You get little to no past updates. That makes it a lot simple.

As for queries, that is effectively just doing map/reduce over the whole set. It is pretty easy to merge them.

Khalid Abuhakmeh
02/11/2014 04:09 PM by
Khalid Abuhakmeh

Agreed. It would be cool to see this make it into RavenDB (or maybe even a spinoff project). With Voron and the ability to run embedded on multiple platforms, I could see mobile device developers and embedded systems developers wanting something like this. It would be great for Home Automation, Production facilities, and Enterprise Monitoring. Through in a Dashboard sample and most people would be sold :)

Andrew Bryson
02/11/2014 04:27 PM by
Andrew Bryson

I'm looking forward to this blog series!

I've just started doing something very similar using RavenDB, where I want to provide an API onto time:value data I'll scrape from UK Environment Agency river level gauges (http://www.environment-agency.gov.uk/homeandleisure/floods/riverlevels/riverstation.aspx?StationId=7073&sensor=D) - there's lots of flooding around here at the moment!

Ayende Rahien
02/11/2014 04:28 PM by
Ayende Rahien

Andrew, Is there a lot of data there that we can play with?

Andrew Bryson
02/11/2014 04:41 PM by
Andrew Bryson

Not that I'm aware of, hence me planning to screen scrape frequently to get the current value for the hour.

But looking around I found:

http://www.geostore.com/environment-agency/WebStore?xml=environment-agency/xml/ogcDataDownload.xml, which has:

'Surface Water Temperature Archive up to 2007' (879mb).

Reading http://www.geostore.com/environment-agency/WebStore?xml=environment-agency/xml/dataLayers_SWTA.xml I think it might be Access 2003 formatted...

hpcd
02/11/2014 10:26 PM by
hpcd

Hi Ayende,

I worked with embedded systems in the past and typically we've used data sources called "historians". There are many: PI, eDNA, InfoPlus,...etc, etc. Another thing that is painful is people often want to transform this from (narrow) time series to:

timestamp1, tag1, quality1, value1 timestamp1, tag2, quality1, value2 timestamp1, tag3, quality1, value3

timestamp2, tag1, quality1, value11 timestamp2, tag2, quality1, value22 timestamp2, tag3, quality1, value33

to (wide--tabular format) tag1 tag2 tag3 time1 v1 v2 v3 time2 v11 v22 v33

Nice if you can build this into the feature.

Hpcd
02/11/2014 10:28 PM by
Hpcd

formatting gone bad:) What formatting tags you support?

Brent Seufert
02/12/2014 04:52 PM by
Brent Seufert

Here would be an interesting times series data source... NOAA Hourly Climate Data. I thought a while ago to give it a go in RavenDB and compare to their use of SAP HANA, but alas, the time has not surface.

I bookmarked this post http://wattsupwiththat.com/2013/11/09/virginia-is-for-warmers-data-says-no/ which leads to this article http://scn.sap.com/community/lumira/blog/2013/11/04/big-data-geek--finding-and-loading-noaa-hourly-climate-data--part-1 containing the ftp link to the data.

Quote from WUWT post: Here are the facts!

  • 500,000 uncompressed sensor files and 500GB

  • 335GB of CSV files, once processed

  • 2.5bn sensor readings since 1901

  • 82GB of Hana Data

  • 31,000 sensor locations in 288 countries

Combines sensor and location.

Ayende Rahien
02/12/2014 05:09 PM by
Ayende Rahien

Brent, Thanks, I'll be going over that

Pete
02/12/2014 08:23 PM by
Pete

@Ayende, I disagree and don''t think you can make the assumption that the data will always be forward only.

In some cases here in Alberta internet connectivity is not always available especially in some of the oil patches, so imagine the following situation.

2 Site supervisors collect the time metrics when on shift, they both have their own laptops with data storage, supervisor A finishes work and goes back to camp (no internet). Supervisor B takes over his shift, and when finished his shift he leaves and his time in the field is finished so he goes home. Once Supervisor B reaches an area with internet access his data gets submitted. Supervisor A finishes his time in the field and leaves, but his data is both older and newer than supervisor B's already submitted data therefore the time series data when synced could not be assumed to be submitted in order.

The server at the company HQ would be the place where the analysis of the time series data takes place. An example of these metrics might be pressure / flow through a pipe throughout a specific period. Admittedly the time entry submissions could be placed on hold until the previous entries are submitted or processes could be changed to accommodate this, I just think the forward only is a dangerous assumption especially if we consider disconnected architectures.

Ayende Rahien
02/13/2014 03:19 AM by
Ayende Rahien

Pete, That is why I said, "most of the time". Yes, you need to handle out of date information, but most of the time, you do not .That means that it is much easier to handle such things as optimizing ranges, etc.

Bevan
02/13/2014 05:43 AM by
Bevan

Should we assume the posts are a demo, or is this for a real feature?

Why? Because time series data is often not a double value - the values in a series might be money, requiring accuracy to 0.001c (where a double introduces a potential for rounding error) or might be non-numeric, like a Standard and Poor's credit rating recorded each month.

With a S&P series, asking for an average might not make sense, but both Min & Max can be useful.

Also interested to see how you handle missing data vs null observations vs blanks vs zeros.

Ayende Rahien
02/13/2014 05:48 AM by
Ayende Rahien

Bevan, This is a good feature to discuss, mostly because it is simple conceptually, but is wide enough to go into details. For performance reasons, it is usually best to fix the size of the value. Decimal is 16 bytes, and S&P is usually one character, IIRC. Those things can be added on relatively easily, though.

Alastair
02/17/2014 08:38 AM by
Alastair

Earlier commentators have mentioned the use of swinging door compression with data historians. Related to this, some systems would attempt to infill missing data via interpolation or some other method if no datapoint was received within an expected period (interval value and replacement method were configured on a per series basis). The historian we used would store an 8 bit integer value per datapoint that indicated whether the recorded value represented an actual value or was interpolated. There were other status codes apart from this, but I don't recall what they were exactly.I think that they did all indicate that there was something unusual about the data quality though.

Not sure if the above is out of scope but you may want to consider it

Piers Lawson
03/05/2014 11:02 PM by
Piers Lawson

How would you see us working with more complex objects? For example if I wanted to track my investment in a fund over time I might have a number of related data items. The fund's unit price. The number of units I own. The percentage breakdown of the fund into different asset types (e.g. today it is 10% invested in UK Stocks and 90% in American Bonds, then tomorrow the split changes). I could imagine for the unit price and number of units that is the equivalent of two "sensors" that I must remember are related to each other, but the asset type split is more difficult unless I have a "sensor" for each possible asset type.

I also like Bevan's question regarding type... decimal would be better for the types of data I track... so being able to dictate the type would be useful.

A couple of other things to consider... if I take my example of tracking my investment in a fund... I might want to track my investment in a policy, where the policy is actually invested in a number of funds, the amount in each fund changing over time. Again this could be achieved with more streams of data... but I may want to aggregate across that product. Queries such as let me see how I've been invested in each asset type across all the funds in this product over time (even though the raw data has been recorded at a fund by fund level).

This then leads on to scalability... but no in terms of the length of each data stream, but also in terms of the number of streams. For example, I may now want to track the performance of a million policies. So I have a million times the number of streams to represent a single policies that will each be updated every day. This would be a far wider set of data than it is long! Or are you at this point thinking... that's not what I meant it to be used for?!?

Ayende Rahien
03/06/2014 08:00 AM by
Ayende Rahien

Piers, The current thinking is that you are usually looking at recording sensor data, or stock data, not trying to track complex data structures. For such complex structures, you would use RavenDB.

Comments have been closed on this topic.