Soliciting feedback: RavenDB time series queries
RavenDB 5.0 will come out with support for time series. I talked about this briefly in the past, and now we are the point where we are almost ready for the feature to stand on its own. Before we get to that point, I have a few questions before the design is set. Here is what a typical query is going to look like:
We intend to make the queries as obvious as possible, so I’m not going to explain it. If you can’t figure out what the query above is meant to do, I would like to know, though.
What sort of queries would you look to do with the data? For example, here is something that we expect users to want to do, compare and contrast different time periods, which you’ll be able to do with the following manner:
The output of this query will give you the daily summaries for the last two months, as well as a time based diff between the two (meaning that it will match on the same dates, ignoring missing values, etc).
What other methods for the “timeseries.*” would you need?
The other factor that we want to get feedback on is what sort of visualization do you want to see on top of this data in the RavenDB Studio?
Comments
Running average/slope over a specified number of points & extrapolation (Markov etc)
Interpolation would be nice, so you can get sensible data for all timestamps. A thermometer that logs every few minutes probably won't have data for exactly 12:00:00, but interpolation will give useful results without complicating the query with prev/next/avg... It doesn't have to be fancy, just linear and nearest neighbor are already very useful for most use cases.
Would it work with multiple group by values aka dimensions? For example group by 1d and activity. We are already doing something like this in a CrossFilter (similar to https://square.github.io/crossfilter/).
In general, my use cases for time series usually center around either (a) operational metrics (cpu, memory, network, RPS etc.) or (b) Business activity metrics (e.g. the ratio or successful to unsuccessful logins). All are stored in timeseries with a number of dimensions and sometimes more than one metric.
In terms of queries, it usually centers around aggregations (mean etc.) or selectors (max, min, percentile), or some simple math (getting a ratio of one metric to another). Period over Period comparison (e.g. YoY or MoM) is also useful.
Filtering and grouping happen on the dimensions.
More advanced scenarios include things like finding outliers or anomaly detection (especially for seasonal data) - although I'm finding streaming analytics more useful for an alert mechanism for anomalies than querying historical sets.
A useful reference would be to look at InfluxDB and their query language. We currently use that, plus Grafana for most of the metrics generated by our infra and applications.
The topic of compression over time (i.e. achieving a constant size database for unlimited data points by decreasing the resolution and aggregating older data) is interesting, as is aggregating high frequency data before it gets written (e.g. statsd).
David,
The query for average / slope would be something like:
So you'll get hourly heartbeats average and the slope of each hour.
For extrapolation, you would do something like:
The idea is that with extrapolate, we fill in the missing values so we have at least one value every minute (based on previous / next values)
Does this make sense? Would it work for your scenario?
kpvleeuwen,
Something like this?
This will look for the value nearest to 10AM (but limited to 1 hour, and find the best match). If there isn't an exact match, it will try to adjust the value based on prev/next.
Ricky,
Grouping by dimensions other than time is supported, but you'll need to use a tag for that. Each timestamp has multiple values and has a tag associated with it. You can filter based on that.
For example:
In this way, each time we take the temp, we tag the thermometer used. Then we can group by the hour as well as the make of the thermometer.
Trevor,
See the other answers for samples of how some of these can be done, but you brought up some unique ones.
First of all, we are going to provide you with a way to write your own logic, so you can do something like:
For checking different ranges against each other, you can do:
Can you give some examples on what you mean by grouping and filtering by dimensions?
Outliers / anomalies are probably not going to be in the next release. As you mentioned, there are dedicated tools for that purpose.
We are going to allow to do downsampling on the fly, yes. See: https://issues.hibernatingrhinos.com/issue/RavenDB-14223
Aggregating high freq data is interesting, but I'm not sure about the need. We can do > 150K / sec of documents on a single node. I'm pretty sure that we can do higher than that on time series, so I don't know how much that would be needed.
Yes, that would be very useful. The limited interpolation range is a nice touch. It could also be used to map/reduce two independent timeseries to the same timebase, without using the (min, max, avg) tuple. For example, some wireless sensors report their values at more or less random intervals between 50-70 seconds, but you want to unify that to a time series with a record at exactly every minute so they can be easily correlated. Using min/max/avg would just complicate things.
Does something like
make sense?
I can understand if you decide that making up data points is outside the scope of the database.
kpvleeuwen,
The problem here is how do you determine what to do here?
For example, if you have a minute with no values? Do you invent one?
I think that something like that should be left for the user defined function like the
arbitrage
above.How would timezones be handled?
Group by 1d would have different values if we query in GMT, WEST, PST, etc.
Postgresql for instance allows to convert data to a specificy timezone (or offset, I admit I don't have the details off the top of my head).
Bruno,
When you store the data, you must store it in UTC. But when you query, you can shift it to the desired time. This looks something like this:
Makes sense. I was about to suggest 'Would you be willing to also support "as timezone X"', but I'm not exactly sure if that's a sane request, as aggregations over daylight savings might make no sense...
I am trying to understand the added value. Today you can achieve the same right with existing capabilities of RavenDB? Or not?
To persist data, just update a ICollection<T> of values with a UTCDateTime on an entity and persist it. You could add a nested structure To read it, just query the object and group by if you want to aggregate the data. To squeeze out more performance you make use of a map/reduce index.
So perhaps I am missing something here.
Bruno,
Doing daylight savings sensitive computation is _expensive_. Leaving aside the fact that you'll need to do really crazy amount of work per date, it is also really complex.
Last week Israel moved its clock back one hour. Twice in the same day we have a 1 AM - 2 AM. If you want to get an hourly heartbeat status on top of that, it is going to be quite confusing to figure out if you merge them into the same local timezone. That is why we simply do an offset.
M. Schopman,
You can do this today, yes. But it means that you have to keep the data as a document, and if you have a lot of data (and we tested this with millions of data points), this gets really expensive.
For example, see this post: https://ayende.com/blog/187297-C/analyzing-five-years-of-fitbit-data As a document data, it would take roughly 300 MB.
When storing as timeseries directly, it is less than 40MB.
It also allows me to do fast queries and aggregations on date ranges, without having to predefine map/reduce indexes.
Basically, it is all about optimizations.
My god you are fast ... :)
On the 300MB vs 40MB. Does this mean the data will not be persisted as part of the JSON document, e.g. a reference to a specific time series storage type of object?
M. Schopman ,
Fast?
Yes, the data is not stored in the document, it is stored separately. Similar to how we deal with attachments or counters.
Comment preview