Ayende @ Rahien

It's a girl

Time series feature design: Client API

We have gone over the system behavior, the wire protocol and how we actually store the data on disk. Now, let us talk about the actual client API. The entry point is going to the TimeSeries class, which will have the following behavior:

Stateless operations:

  • Queries:
    • timeSeries.Query(“sensor1.heat”, “sensor1.flow”)
         .Range(start,end)
         .Rollup(Rollup.Weekly)
         .Aggergation(AggergateBy.Max, AggergateBy.Min, AggergateBy.Mean);
    • timeSeries.SeriesBy(“temp:C”);
  • Operations:
    • timeSeries.Delete(“sensor1.heat”, start, end);
    • timeSeries.Tag(“sensor1.heat”, “temp:C”);

Those types of operations have no state, require nothing beyond just knowing where the server is located and can be immediately executed without requiring any external state. The returned results aren’t tracked or managed  by us in any way, so there is no need for a session. 

Stateful operation - The only stateful operation we have (at least so far) is adding data to the database. We do that using the connection abstraction. This is very close to the actual on the wire representation, which is always good. We have something like:

   1: using(var con = timeSeries.OpenConnection(waitForServerFlush: true))
   2: {
   3:     using(var series = con.AddToSeries("sensor1.heat"))
   4:     {
   5:         for(var i = 0; i < 100; i++) 
   6:         {
   7:             series.Add(time.AddMinutes(i), value + i);
   8:         }
   9:     }
  10: }

This is a bit of an awkward API, but it serves a purpose, it is very close to the way the on-wire format is, and it is optimized for performance, not for being nice.

We can also have:

con.Add(“sensor1.heat”, time, value);

But if you are mixing things up (add sensor1.heat, sensor1.flow and then sensor1.heat again, etc), it probably won’t be as efficient. (It is important to be able to expose those optimizations all the way from the disk to the wire to the client API. Most times, they don’t matter, which is why we have the higher level API, but when they do, they really do.

And… this is pretty much it.

The API will probably be an async one, to keep up with the times, but those are pretty much the high level things that we have here.

Comments

Khalid Abuhakmeh
02/17/2014 12:33 PM by
Khalid Abuhakmeh

The client API for tags doesn't make a lot of sense to me. You will rarely ever add tags on the fly, instead you are likely to just add all of them at the same time at creation time. "timeSeries.Tags.Add(key, value)" or "timeSeries.Tags.AddRange(Dictionary)". You said "This is a bit of an awkward API", well there is no need to be that awkward :)

Ayende Rahien
02/17/2014 12:56 PM by
Ayende Rahien

Khalid, You cannot assume that users will first create the series, then add values. Instead, we allow to create the series implicitly by just creating it.

The awkward client API I was refering to was the batch operation stuff.

Khalid Abuhakmeh
02/17/2014 01:08 PM by
Khalid Abuhakmeh

I see what you are saying, you want series to "come online" even if they were never explicitly created. I guess to me it would still be useful not to have the string "temp:C" as a tag, but instead have a key value pair of "temp" and "C".

You might also want a metadata dictionary on a time series that you could pull and use for processing. Things you might store in metadata include Coordinates, sensor Id (External Database Id), Owner, Etc... You might never group by them, but you might pull the metadata and do something with the data.

A scenario for metadata is "A series hit a threshold, now notify the customer that they hit it."

Overall I like the API and it looks promising. Interested to read the rest of the posts.

Ayende Rahien
02/17/2014 01:28 PM by
Ayende Rahien

Khalid, It is much easier to handle tags if they are just arbitrary strings that the user brings meaning to. With conventions of temp:C, temp:F, etc.

The problem with "metadata" is that the moment you start doing that, you are starting to talk about doing more and more complex things. In that rate, put the actual series behavioral aspects in RavenDB document, and just use the timeseries for the time series stuff.

Juan Lopes
02/17/2014 02:23 PM by
Juan Lopes

Hi, Ayende.

I work at a brazilian company called Intelie (http://www.intelie.com/en/), and most of my work is to write an event processing language that does exactly what you are doing by writing an internal DSL.

This language is written in Java and unfortunatelly is closed source but the main idea is to chain processing steps. The syntax looks like:

type:sensor => avg(temp1), avg(temp2) by sensor every week

Basically: lucene-ish query [=> transfomration or aggregation]*

It's just the basic syntax, it has many more useful constructions. But our language is specialized in dealing with realtime data, so we have some builtins to deal with output rate vs time window.

Also, the whole language is designed to be distributed, with the aggregations storing not only its result, but also the information needed to merge it with other nodes results.

If you wish, I can provide you with more details.

Rafal
02/17/2014 03:14 PM by
Rafal

Juan, be careful with announcing publicly that you're ready to give out company product details. Someone might overreact. Maybe it's not a secret, but you never know..

David Cuccia
02/17/2014 03:43 PM by
David Cuccia

Typo: aggregate/aggregation

Juan Lopes
02/17/2014 05:56 PM by
Juan Lopes

Rafal, it's not a secret. In fact, talking about it is one of the company's objectives. I've been giving talks about it for the last year (all in portuguese, unfortunately). If you're going to CeBIT this year, some of my colleagues will be there also talking about it (we're one the CODE_n finalists).

We really want to opensource it, but still struggling with proper documentation and licensing.

Rafal
02/17/2014 06:47 PM by
Rafal

ok, looks like your company is not a corporation ;)

Hpcd
02/19/2014 08:31 PM by
Hpcd

Are you guys planning to provide an OPC HDA driver for ravendb timeseries data?

Ayende Rahien
02/20/2014 09:23 AM by
Ayende Rahien

Hpcd, I don't understand what this is, so I don't know. Reading about it, this looks like a DCOM interface, which should be doable, but is probably complex / hard. I am not sure how worth it this would be.

hpcd
02/20/2014 08:31 PM by
hpcd

Hi Ayende:

Well, I am reading that you are implementing a timeseries--historian type functionality in RavenDB--which is great!

In process industry (Oil/gas, wood, chemical, etc), a whole lot has been invested in the OPC standard for communicating data from devices. Typically, you would expose your ravendb as an OPC HDA/DA server, and then SCADA and other software can connect to it and expose the data or write data. Sometimes, the facility to execute C# client api might not be available:

Check http://www.opcfoundation.org/. Most mainstream historians have HDA server for exposing their historians. Makes it more competitive.

I know their is timeseries data for more business data, etc; but having OPC hda interface to your ravendb historian, would be appealing to process industry.

Hpcd
02/20/2014 08:40 PM by
Hpcd

I don't know what was the inspiration for the ravendb timeseries db, but I assumed it to be process historian data--since I came from that background.

Anyways, its perhaps a whole other topic. Typical things some OPC hda servers will allow you to, is create virtual tags that generate historical computed data, as if it were coming from the device. So, as your "incoming", realtime device temperature and pressure changes, it triggers a computation on another virtual tag to store come relevant computed value--that is reported as if it were a tag from the device. These extra bits may not be part of the OPC HDA standard, but vendors add extra functionality to one up each other...

Ayende Rahien
02/21/2014 12:17 PM by
Ayende Rahien

hpcd, I am just talking about design principals at the moment, I am not really getting down to implemen tthis.

Beside, this looks expensive: http://www.advosol.com/pc-17-4-opc-hda-net-server-toolkit.aspx

At any rate, if/when we get around to doing this, that will be something to consider.

Ayende Rahien
02/21/2014 12:18 PM by
Ayende Rahien

Hpcd, How do you define/ create the computation?

Hpcd
02/21/2014 10:05 PM by
Hpcd

The computation is defined in your OPC HDA server--by the user/engineer.

This OPC server can connecto your ravendb historian to get the current tags. It will need to have some UI to configure simple math relations.

It is not uncommon to have some scripting support with predefined functions, attached to your virtual tag.

Leaving OPC servers, etc aside. A useful feature for your api, would be to allow

a. Raw query b. Delta query c. Filter options for data quality.

For a., if a sensor produces 100 deg C value for 10000 samples, and then it changes to 102 deg C for one value, you will read 10001 values.

For b, you only read 2 values time1 and 100 deg, time2 and 102 deg.

Matt Johnson
02/24/2014 05:18 AM by
Matt Johnson

Hi Oren. In regards to Rollup.Weekly, keep in mind that not everyone defines their start of week by the same criteria. For that matter, not everyone will agree on when the day begins and ends, not just because of time zones, but because a business day might roll over into a different calendar date depending on what kind of business you're operating. How will you tackle these issues in the time series API? Hopefully in some way that is highly configurable and/or extensible?

Ayende Rahien
02/24/2014 08:55 AM by
Ayende Rahien

Matt, Yes... you are quite correct. For that matter, it would be crazy to do this in a daily fashion as well. Considering things like daylight savings, etc.

Maybe I'll just do that on an hourly basis, you can define 24 hours, 168 hours, etc.

To my knowledge, we don't have a lot of issues with different definitions of what an hour is. And I don't care about Martian time

Comments have been closed on this topic.