Ayende @ Rahien

Refunds available at head office

Time series feature design: Storage

In this post, we are going to talk about the actual design for a time series feature, from a storage perspective. For the purpose of the discussion, I am going to assume the low level Voron as the underlying storage engine.

We are going to need to store the data in a series. And the actual data is going to be relatively ordered (time based, usually increasing).

As such, I am going to model the data itself as a set of Voron trees, one tree per each series that is created. The actual data in each tree would be composed of key and value, that are each 8 bytes long.

  • Key (8 bytes): 100-nanosecond interval between Jan 1, 0001 to Dec 31, 9999.
  • Val (8 bytes): 64 bits double precision floating point number.

There can be do duplicate values for a specific time in a series. Since this represent one ten millionth of a second, that should be granular enough, I think.

Reading from the storage will always be done on a series basis. The idea is to essentially use the code from this post, but to simplify to a smaller key.  The idea is that we can store roughly 250 data points in each Voron page. Which should give us some really good performance for both reads & writes.

Note that we need to provide storage API to do bulk writes, since there are some systems that would require it. Those can either be systems with high refresh rates (a whole set of sensors with very low refresh rates) or, more likely, import scenarios. The import scenario can be either a one time (moving to a new system), or something like a nightly batch process where we get the aggregated data from multiple sources.

For our purposes, we aren’t going to distinguish between the two.

We are going to provide and API similar to:

    • void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);

This ugly method can handle updates to multiple series at once. In human speak, this is an enumerable of updates to a series where each update is the time and value for that time. From the storage perspective, this creates a single transaction where all the changes are added at once, or not at all. It is the responsibility of higher level code to make sure that we optimize number of calls vs. in flight transaction data vs. size of transactions.

Adding data to a series that doesn’t exists will create it.

We also assume that series names is up to printable Unicode characters of up to 256 bytes (UTF8).

The read API is going to be composed of:

  • IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
  • IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);

Anything else would have to be done at a higher level.

There is no facility for updates. You can just add again on the same time with a new value, and while this is supported, this isn’t something that is expected.

Deleting data can be done using:

  • void Delete(string series, DateTime start, DateTime end);
  • void DeleteAll(string series);

The first option will delete all the items in range. The second will delete the entire tree. The second is probably going to be much faster. We are probably better off checking to see if the max / min ranges for the tree are beyond the items for this series and falling to DeleteAll if we can. Explicit DeleteAll will also delete all the series tags. While implicit Delete(“series-1”, DateTime.MinValue, DateTime.MaxValue) for example will delete the series’ tree, but keep the series tags.

Series can have tags attached to it. Those can be any string up to 256 bytes (UTF8). By conventions, they would usually be in the form of “key:val”.

Series can be queried using:

  • IEnumerable<string> GetSeries(string start = null);
  • IEnumerable<string> GetSeriesBy(string tag, string start = null);
  • IEnumerable<string> GetSeriesTags(string series);

Series can be tagged using:

  • void TagSeries(string series, string []tags);

There can be no duplicate tags.

In summary, the interface we intend to use for storage would look roughly like the following:

public interface ITimeSeriesStorage
{
    void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);
    IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
    IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);
    void Delete(string series, DateTime start, DateTime end);
    void DeleteAll(string series);
    IEnumerable<string> GetSeries(string start = null);
    IEnumerable<string> GetSeriesBy(string tag, string start = null);
    IEnumerable<string> GetSeriesTags(string series);
    void TagSeries(string series, string []tags);
}

Data sizes – assume 1 value per minute per series, that gives us an update rate of 1,440 updates per day or 525,600 per year. That means that for 100,000 sensors (not an uncommon amount) we need to deal with 52,560,000,000 data items per year. This would probably end up being just over 3 GB or so. Assuming 1 value per second, that gives us 86,400 values per day, 31,536,000 per year and 3,153,600,000,000 values per year for the 100,000 sensors will user about 184 GB or so. Those seems to be eminently reasonable values for the data size that we are talking about here. 

Next, we’ll discuss how this is all going to look like over the wire…

Comments

Rafal
02/12/2014 10:55 AM by
Rafal

One question: why do you think such functionality needs to be abstracted into a dedicated database type with a dedicated API? Basically, you are talking about storing of key-value pairs with 8 byte key and 8 byte value, and with key range lookup/delete functionality. There's nothing inherently specific to time series here and any key-value store can do that. Is it the right moment to design the record structure, time resolution, or database API if you don't discuss any higher level functionality that will rely on that?

Ayende Rahien
02/12/2014 11:49 AM by
Ayende Rahien

Rafal, Primarily because right now what we are doing is just storing the data as efficiently as possible. The next steps on top of that is to actually expose behavior and semantics that would allow you to handle this in a more natural manner. And I generally, starting with some idea about the actual storage is a really good idea. You can change things around, but you need to start somewhere.

Federico Lois
02/12/2014 12:30 PM by
Federico Lois

From my experience dealing with time series I vouch on Ayende's insight. Store time-series in a both read/write and processing efficient format is a must.

In fact, I was going to ask Ayende why he decided to go for DateTime instead of the tick (long) value. We were able to cut the query time in half with Raven using the tick value. And we would probably would be able to go even further if we cut it to the ms level.

Ayende Rahien
02/12/2014 12:32 PM by
Ayende Rahien

Federico, The actual 8 bytes is the tick value, we'll show it to the user as a date time, probably. The idea is to give them as format that is easy to work with.

HarryDev
02/12/2014 03:06 PM by
HarryDev

I would discourage use of Tuple as this is a reference type and would induce a lot of GC activity for long time series. Create a custom value type (always recommended) or use a KeyValuePair.

Daniel
02/12/2014 03:34 PM by
Daniel

Ayende, I'm really interested in this series of posts, as it's one of those repeating patterns I've seen in several projects and want to 'get right'. One thing I think would be really awesome to explore is fun stuff you can do with time series once you have it. For example, I'd really love to see your take on using the series data to "predict" future values. I've been tinkering with code to record disk space and database size as time series and predict what date the disk will run out or the db will reach an unmanagable size. This is key data that OPs need to plan, but DBs leave up to you to figure on your own. It seems like a simple slope formula calculation would do, but it turns out there are many ways to approach this.

Ayende Rahien
02/12/2014 03:55 PM by
Ayende Rahien

Daniel, I'm just interested of storing and giving you the data in a nice way. How / what you do with that is out of scope :-)

Drazen
02/12/2014 07:56 PM by
Drazen

Just an observation... While 10 million discrete data points per second sounds overkill, there is at least one system which (incredibly) can process (and users of that system might want to ultimately store this data) up to 25 million operations per second.

The system is called Disruptor, it's coded in Java but there is a .NET port.

FYI: http://lmax-exchange.github.io/disruptor/

Karhgath
02/12/2014 09:15 PM by
Karhgath

Ayende, are you planning on creating more IStore interfaces for data other than time series? Or provide facilities to easily create our own storage by using low level Voron methods?

For example, aggregating event streams containing different documents types, but all pertaining to a specific stream. If we keep the time series subject, it could be events that record changes in each sensors (instead of reading an absolute value every seconds). You'd have to aggregate the changes (TempChangeEvent.Difference) and apply them on an initial value (SensorStartupEvent.Initial) in a sequential order to get the temp reading at a specific time.

Eli Weinstock-HErman
02/13/2014 12:09 AM by
Eli Weinstock-HErman

It's been a while, but I worked with some of the dedicated industrial historians in a prior job. At the time, disk was a lot more expensive and different historians dealt with this in different ways.

At least one of the commercial ones and one in-house one I saw (sitting on SQL Server, on a single core in those days, processing tons of sensors) would store only relevant data points, based on options like a deadband setting (any change within this was considered line noise) and a delta, slope, or other line function evaluation. One method I remember seeing a paper on would collect a series of values, evaluating as it received each one if any point fell more than X distance from a line from the first to last point. When this distance was violated, it would enter the first value in permanent storage with a timestamp and slope from the 1st point to point LAST-1. it would then flush the temporary points, keeping only the last two (LAST-1 and LAST), and start building up another series.

By defining the bounds of what was relevant data (as opposed to line noise or small changes that weren't relevant to the use of the data), you could get an accurate picture f the time series for much smaller storage and lower memory and network traffic on the query side. For a sensor value that often didn't change for minutes at a time or had long periods of inactivity, this would condense the data significantly.

When the data was retrieved, the series could be rebuilt easily from those relevant points + slopes and an interpolation method, which also gave you some interesting options for zooming in and out and predicting forward a short ways.

Anyway, I'm throwing it out there because I wasn't sure if you had run into these types of systems in the past or not, I've seen at least two projects implement their own time-series collection in a "store all the points!" method and quickly start drowning in data. I suspect drive size, IO, and network pushed a lot of the research in those areas, but even though storage has gotten far cheaper, I wonder if there is still some value in the CPU/IO trade-off or in already having the shape f the data and a smaller data set when querying and evaluating the results.

Ayende Rahien
02/13/2014 02:56 AM by
Ayende Rahien

Drazen, Apples & oranges. Note that this is just in memory messaging library. As such, it isn't really that interesting. A simple lock + queue would give you several million ops/sec in memory, with zero config cost. What we are talking here is persisted data, as in, actually flushing it to disk safely.

Ayende Rahien
02/13/2014 02:56 AM by
Ayende Rahien

Karhgath, We are planning on doing dedicated data solutions, yes.

Ayende Rahien
02/13/2014 02:58 AM by
Ayende Rahien

Eli, That is a very good point, yes. The problem here is that I'm not in the "make business decisions about data" game, I'm just in the "gimme your data and I'll keep it safe and healthy" game :-) But it is a good idea to be able to decide if to accept a write or not.

Drazen
02/13/2014 12:07 PM by
Drazen

Ayende,

I know the comparison isn't necessarily "fair".

That's why I said that users "might want to eventually persist this data" (emphasis on eventually).

It all depends on the context. But I've seen Disruptor used for stock trading and operations there might very well need to be persisted.

OTOH, how does one store that much information reasonably fast is beyond me. The best I've achieved was about 2000 ops/sec (4KB each), actually stored to disk (so no caching on OS level or the SSD*).

  • where we can't really be sure that SSD didn't just store data in its own buffers, hence the need for SSD with capacitors so that buffers can be flushed in case of power failure.
Ayende Rahien
02/13/2014 12:49 PM by
Ayende Rahien

Drazen, You aren't doing this right. We can do 1 mil / sec on a standard disk, including fully flushing all buffers. It isn't easy, and require a lot of dancing around, but it works.

And if you don't care for how much it takes to read the information, it can be made even faster.

Drazen
02/13/2014 08:14 PM by
Drazen

Ayende,

let's do the math: 1 mil/s, say 2KB each write (roughly what we had), equals to 1.9GB/s. If we're talking 4KB per write (usually the case as file systems force you to write that much) than it's 3.8GB /s.

I triple dare you to show me the SSD which can sustain this kind of throughput :) assuming we're talking about 1 mil independent writes, no coalescing, no caching, anywhere - assume for a moment that this is a requirement, because of some external API which requires ACKs over the network for each "message" received and stored (assume also that network is infinite speed and it does not slow down the whole thing).

Ayende Rahien
02/14/2014 02:24 PM by
Ayende Rahien

Drazen, Our writes tend to be smaller, IIRC, we tested with ~130 bytes per write. That gives you roughly 130MB per second.

However ,there is absolutely no reason why you can't do more. The fact that your data is 2GB in size doesn't mean you have to write 2GB. You can (and we do) compress the whole thing pretty efficiently.

And yes, you would need to do things like coalescing writes, otherwise you would be trying to hit the disk million times a second, and that ain't going to work. But doing tx merging gives you a big perf boost.

James Farrer
02/14/2014 04:20 PM by
James Farrer

What about storing time series of strings?

Of course this could be converted to a number but it would be nice if the API just handled it...

(I'm thinking here from prior experience in manufacturing systems where you want to track the batch and when it changes and that could well be a string value)

Ayende Rahien
02/14/2014 05:01 PM by
Ayende Rahien

James, What is the scenario for a time series on strings.