The importance of a data formatPart III – The solution

time to read 9 min | 1683 words

In my previous two posts I explained the problem we run into, and discuss the requirement of the solution. To quickly reiterate, we don't want to use JSON as the internal storage format in RavenDB because it requires parsing whenever we need to do anything to the document. We want to have a format that is ready to be used without any parsing.

In order to do that, we split the behavior into two distinct responsibilities. We'll have writer that will take JSON as an input and produce a block of memory that represent this JSON, and we have a reader that can read from this block of memory and if needed restore the original JSON.

For now, I'm going to focus primarily on the write side of things. Because we want to reduce the amount of allocation and retained memory, we cannot use any format that requires reading the whole document before writing it. So we need to use a forward only mode for writing. This plats nice with memory allocation strategies, because we don't have the need to move data once we written it. In other words, whenever we read a value from the JSON stream, we need to write it to our own format. That doesn't seems to be so useful, until you realize that we can delay some decisions.

It will probably be easier if I we used a real example:

{
   "Name":"Oren",
   "Dogs":[
      "Arava",
      "Oscar",
      "Phoebe"
   ],
   "Age":34,
   "Office":{
      "Name":"Hibernating Rhinos",
      "Street":"Hanashi 21",
      "City":"Hadera"
   }
}

This document shows quite a lot of features from JSON, so it is very suitable for demonstration. We read a token at a time from the document, and write it in the new format.

Let us look at how we'll deal with the Dogs array value (just this: ["Arava","Oscar","Phoebe"] ). Loosely speaking, this will look like this:

image

And here it is with explanations:

  • Len 5 – Arava
  • Len 5 – Oscar
  • Len 6 – Phoebe
  • Array size – 3
  • Offsets from metadata start – [19,13,7]
  • Types – [5,5,5]

First we have the values, as length prefixed strings, then we have the number of elements in the array, and then we have a run of offsets. So far, it is obvious, but what are the offsets? They indicate how far back (from the array metadata start position), we need to go to get to a particular value. In other words, in order to read this array we actually need to get a pointer to the array metadata start, and from there we can tell how many elements we have in the array, and then we can just directly to the relevant value.

For example, going to the second element will mean going 13 bytes backward from the metadata beginning, which will give us a byte whose value is 5 (length of the string) and then the actual string bytes.

The array after the offset contains the types of each value. In this case, this is an array of 3 strings, so we have [5,5,5]. Splitting the offsets and the types in this manner means that we have better alignment for the values, and it is simpler to deal with in general. By the way, type 5 is a string.

Why are we using backward references in this manner? Why not hold the absolute position in the document? That would certainly be much simpler, no? We use offsets because a large JSON document is typically composed of many small objects and arrays. By using an offset from the metadata start, we can ensure that most of the time this data will be small enough to fit in a single byte. If we can do, we are able to save quite a significant amount of memory. Of course, we also handle larger objects, including those that need offsets spanning the full 32 bits range. We have scoped the decision of the offset size to the level of a single array / object, because even in big documents, we have a lot of places where byte or short offset would be more than good enough.

One thing to note about this, in JSON format, the array takes 27 bytes, in this format, it takes 26. We saved a whole whopping byte, cue the party noise.

But that isn't really interesting yet, let us see how we deal with objects, in particular, the Office value (just this: {"Name":"Hibernating Rhinos","Street":"Hanashi 21","City":"Hadera"} ). This looks like the following:

image

Again, changing to textual format gives:

  • Len 18 – Hibernating Rhinos
  • Len 10 – Hanashi 21
  • Len 6 – Hadera
  • Properties count – 3
    • Offset – 7, PropertyId – 5, Type – 5
    • Offset –37, PropertyId – 0, Type – 5
    • Offset – 18, PropertyId – 4, Type – 5

(To the sharp eyed among you, the last byte (00), is not really there, and shouldn't be accounted for, it is an artifact of how I printed those values.)

In JSON, this object is 67 bytes long, in this format, it is merely 48 bytes. Yep, maidens swoon when they hear about how much space this saved.

But we are still missing things. To start with, where are the actual property names, and what are those property ids that we have been seeing?

One piece of data that we aren't writing as we read the source document are the properties. Indeed, so far, the biggest saving we have seen is from not writing the properties. So the question is, where are they? The answer is simple, we wait until we finish with the whole document before we write the property names. Instead of writing a property name, we write only the property id. A property id is just the index of the property into the properties we have seen so far.

In other words, we have a small benefit from not having to repeat property names. In the document above, we saved the need to write "Name" twice, because of this feature. At the end of the document, we write the property names just like we write values, but because we need to reference them by id, we also write their offsets into an array.

Finally, we store the root metadata offset, the offset of the property names offset and the size of offsets that we'll encounter in this document, and we are done.

Almost, there is another subtle issue to deal with. Look at the properties of the Office object that we have above. Notice that they don't look like they are stored in order of arrival. Instead, the offset into the properties of the object are actually stored sorted based on the value of the property name.

In other words, given that I want to access a particular property in an object, what I'll do is run a binary search over the properties of this object. I'll lookup the property for each property id and compare it to the property name I want to find. So searching a property in O(logN), where N is the number of properties in a particular object.

We have taken things a bit further, and when we encounter a large string value. Currently, above 128 bytes, we'll also try to compress it. We're using LZ4 to compress the data, and for large values, it generally represent quite a nice saving. But if the savings isn't good enough (doesn't save at least 10% of the value size), we'll just emit the value as is.

So overall we take the document, only store the property names once, arrange things so accessing any property would be very cheap, and compress large values. We get some space saving out of that (more on this in my next post), but while that is nice, and we certainly paid attention to this, it isn't our main goal.

Let us see what sequence of operation we'll need to access the Office.City property, shall we?

  1. O(log 4) search in the properties of the root object:
    • This search is done on the four properties id that are stored in the root object metadata, sorted so we can search them.
    • The actual property names are stored elsewhere, and we have to go through an array specifying their offsets to get to the actual values.
  2. The search result in the offset of the metadata of the Office object.
  3. O(log 3) search in the properties of the Office object.
  4. The search result in the offset of the City property.
  5. We read the city property and do something with it.

Compare that to "parse the entire document" and then access he relevant value or just parse in a streaming fashion the entire document and extract the value you need. There is also another issue, which I almost forgot to mention, but is quite important.

The blittable format (and please help us find another name) is actually writing all the data into unmanaged memory. This means that it is entirely invisible to the GC, which means that we won't have to wait for the GC to clean it all up. Instead, we can allocae a pool of unmanaged memory and just use that. It makes it drastically simpler to manage how we are actually managing memory usage and growth in our systems.

Because we use memory mapped files for our storage, actually moving around in the document using this format is trivial, and quite fast. But I'm getting ahead of myself. Benchmarks and actual results are for the next post…

More posts in "The importance of a data format" series:

  1. (25 Jan 2016) Part VII–Final benchmarks
  2. (15 Jan 2016) Part VI – When two orders of magnitude aren't enough
  3. (13 Jan 2016) Part V – The end result
  4. (12 Jan 2016) Part IV – Benchmarking the solution
  5. (11 Jan 2016) Part III – The solution
  6. (08 Jan 2016) Part II–The environment matters
  7. (07 Jan 2016) Part I – Current state problems