Soliciting feedback: RavenDB time series queries

time to read 2 min | 218 words

RavenDB 5.0 will come out with support for time series. I talked about this briefly in the past, and now we are the point where we are almost ready for the feature to stand on its own. Before we get to that point, I have a few questions before the design is set. Here is what a typical query is going to look like:

We intend to make the queries as obvious as possible, so I’m not going to explain it. If you can’t figure out what the query above is meant to do, I would like to know, though.

What sort of queries would you look to do with the data? For example, here is something that we expect users to want to do, compare and contrast different time periods, which you’ll be able to do with the following manner:

The output of this query will give you the daily summaries for the last two months, as well as a time based diff between the two (meaning that it will match on the same dates, ignoring missing values, etc).

What other methods for the “timeseries.*” would you need?

The other factor that we want to get feedback on is what sort of visualization do you want to see on top of this data in the RavenDB Studio?

Tweet Share Share 18 comments

Tags:

Comments

01 Nov 2019
15:46 PM

David Cuccia

Running average/slope over a specified number of points & extrapolation (Markov etc)

01 Nov 2019
16:09 PM

kpvleeuwen

Interpolation would be nice, so you can get sensible data for all timestamps. A thermometer that logs every few minutes probably won't have data for exactly 12:00:00, but interpolation will give useful results without complicating the query with prev/next/avg... It doesn't have to be fancy, just linear and nearest neighbor are already very useful for most use cases.

01 Nov 2019
17:42 PM

Ricky Allman

Would it work with multiple group by values aka dimensions? For example group by 1d and activity. We are already doing something like this in a CrossFilter (similar to https://square.github.io/crossfilter/).

02 Nov 2019
02:04 AM

Trev Hunter

In general, my use cases for time series usually center around either (a) operational metrics (cpu, memory, network, RPS etc.) or (b) Business activity metrics (e.g. the ratio or successful to unsuccessful logins). All are stored in timeseries with a number of dimensions and sometimes more than one metric.

In terms of queries, it usually centers around aggregations (mean etc.) or selectors (max, min, percentile), or some simple math (getting a ratio of one metric to another). Period over Period comparison (e.g. YoY or MoM) is also useful.

Filtering and grouping happen on the dimensions.

More advanced scenarios include things like finding outliers or anomaly detection (especially for seasonal data) - although I'm finding streaming analytics more useful for an alert mechanism for anomalies than querying historical sets.

A useful reference would be to look at InfluxDB and their query language. We currently use that, plus Grafana for most of the metrics generated by our infra and applications.

The topic of compression over time (i.e. achieving a constant size database for unlimited data points by decreasing the resolution and aggregating older data) is interesting, as is aggregating high frequency data before it gets written (e.g. statsd).

03 Nov 2019
19:09 PM

Oren Eini

David,

The query for average / slope would be something like:

from Users
where id() == 'users/1-A'
select timeseries(
   from Heartbeat
   between '2019-10-01' and '2019-11-01'
   group by 1h
   select avg(), slope()
)

So you'll get hourly heartbeats average and the slope of each hour.

For extrapolation, you would do something like:

from Users
where id() == 'users/1-A'
select timeseries(
   from Heartbeat (extrapolate(1m))
   between '2019-10-01' and '2019-11-01'
   group by 1h
   select avg(), count()
)

The idea is that with extrapolate, we fill in the missing values so we have at least one value every minute (based on previous / next values)

Does this make sense? Would it work for your scenario?

03 Nov 2019
19:12 PM

Oren Eini

kpvleeuwen,

Something like this?

from Patients
where id() = 'users/steve'
select timeseries.interpolate(Temperature, '2019-11-01T10:00:00', 1h)

This will look for the value nearest to 10AM (but limited to 1 hour, and find the best match). If there isn't an exact match, it will try to adjust the value based on prev/next.

03 Nov 2019
19:15 PM

Oren Eini

Ricky,

Grouping by dimensions other than time is supported, but you'll need to use a tag for that. Each timestamp has multiple values and has a tag associated with it. You can filter based on that.

For example:

from Patients
where id() = 'users/steve'
select timeseries(
   from Temperature
   load Tag as thermometer
   group by 1h, thermometer.Make
   select min(), max(), thermometer.Make

In this way, each time we take the temp, we tag the thermometer used. Then we can group by the hour as well as the make of the thermometer.

03 Nov 2019
19:32 PM

Oren Eini

Trevor,

See the other answers for samples of how some of these can be done, but you brought up some unique ones.

First of all, we are going to provide you with a way to write your own logic, so you can do something like:

declare function arbitrage(results) {
     return results.filter(x => {
             var bc2Usd = x.Values[0];
             var bc2Eur = x.Values[1];
             return (bc2Usd - bc2Eur) > 0.01;
     });
}
from Exchanges
where id() == 'exchanges/123-C'
select timeseries(
   from Rates between $yeserday and $today
   group by 1h
   select arbitrage()
)

For checking different ranges against each other, you can do:

declare timeseries hourly_rates(d, $start, $end){
    from d.Rates between $start and $end
    group by 1d
    select min(), max(), avg()
}
from Exchanges
where id() == 'exchanges/123-C'
select timeseries.diff(
    hourly_rates('2019-10-31', '2019-11-01'),
      hourly_rates('2019-11-01', '2019-11-02') 
)

Can you give some examples on what you mean by grouping and filtering by dimensions?

Outliers / anomalies are probably not going to be in the next release. As you mentioned, there are dedicated tools for that purpose.

We are going to allow to do downsampling on the fly, yes. See: https://issues.hibernatingrhinos.com/issue/RavenDB-14223

Aggregating high freq data is interesting, but I'm not sure about the need. We can do > 150K / sec of documents on a single node. I'm pretty sure that we can do higher than that on time series, so I don't know how much that would be needed.

04 Nov 2019
10:43 AM

kpvleeuwen

Yes, that would be very useful. The limited interpolation range is a nice touch. It could also be used to map/reduce two independent timeseries to the same timebase, without using the (min, max, avg) tuple. For example, some wireless sensors report their values at more or less random intervals between 50-70 seconds, but you want to unify that to a time series with a record at exactly every minute so they can be easily correlated. Using min/max/avg would just complicate things.

Does something like

declare timeseries interpolated_by_minute(series, $start, $end){
    from timestamp in timeseries.create($start, $end, 1min)
    select new { timestamp, value = timeseries.interpolate(series, timestamp, 2min) }
}

make sense?

I can understand if you decide that making up data points is outside the scope of the database.

04 Nov 2019
16:21 PM

Oren Eini

kpvleeuwen,

The problem here is how do you determine what to do here?

For example, if you have a minute with no values? Do you invent one?

I think that something like that should be left for the user defined function like the arbitrage above.

06 Nov 2019
09:23 AM

Bruno Lopes

How would timezones be handled?

Group by 1d would have different values if we query in GMT, WEST, PST, etc.

Postgresql for instance allows to convert data to a specificy timezone (or offset, I admit I don't have the details off the top of my head).

06 Nov 2019
10:14 AM

Oren Eini

Bruno,

When you store the data, you must store it in UTC. But when you query, you can shift it to the desired time. This looks something like this:

from Users
select timeseries(
   from doc.Heartbeat between $start and $end
   group by 1h
   select min(), max()
   offset '02:00'
) as Heartbeat

06 Nov 2019
10:53 AM

Bruno Lopes

Makes sense. I was about to suggest 'Would you be willing to also support "as timezone X"', but I'm not exactly sure if that's a sane request, as aggregations over daylight savings might make no sense...

06 Nov 2019
10:57 AM

M. Schopman

I am trying to understand the added value. Today you can achieve the same right with existing capabilities of RavenDB? Or not?

To persist data, just update a ICollection<T> of values with a UTCDateTime on an entity and persist it. You could add a nested structure To read it, just query the object and group by if you want to aggregate the data. To squeeze out more performance you make use of a map/reduce index.

So perhaps I am missing something here.

06 Nov 2019
10:58 AM

Oren Eini

Bruno,

Doing daylight savings sensitive computation is _expensive_. Leaving aside the fact that you'll need to do really crazy amount of work per date, it is also really complex.

Last week Israel moved its clock back one hour. Twice in the same day we have a 1 AM - 2 AM. If you want to get an hourly heartbeat status on top of that, it is going to be quite confusing to figure out if you merge them into the same local timezone. That is why we simply do an offset.

06 Nov 2019
11:00 AM

Oren Eini

M. Schopman,

You can do this today, yes. But it means that you have to keep the data as a document, and if you have a lot of data (and we tested this with millions of data points), this gets really expensive.

For example, see this post: https://ayende.com/blog/187297-C/analyzing-five-years-of-fitbit-data As a document data, it would take roughly 300 MB.

When storing as timeseries directly, it is less than 40MB.

It also allows me to do fast queries and aggregations on date ranges, without having to predefine map/reduce indexes.

Basically, it is all about optimizations.

06 Nov 2019
11:18 AM

M. Schopman

My god you are fast ... :)

On the 300MB vs 40MB. Does this mean the data will not be persisted as part of the JSON document, e.g. a reference to a specific time series storage type of object?

06 Nov 2019
13:24 PM

Oren Eini

M. Schopman ,

Fast?

Yes, the data is not stored in the document, it is stored separately. Similar to how we deal with attachments or counters.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB