The importance of a data formatPart IV – Benchmarking the solution

time to read 7 min | 1257 words

Based on my previous posts, we have written an implementation of this blittable format. And after making sure that it is correct, it was time to actually test it out on real world data and see what we came up with.

We got a whole bunch of relatively large real world data set and run them through a comparison of JSON.Net vs. the Blittable format. Note that for reading the JSON, we still use JSON.Net, but instead of generate a JObject instance, we generate our own format.

Here are the results of running this on a set of JSON files.

image

As you can see, there doesn't appear to be any major benefit according to those numbers. In multiple runs of those files, the numbers are roughly the same, with a difference of a few ms either way. So it is a wash, right?

Except… that those aren't the complete truth. Let us look at some other aspects of the problem. How about the persisted size?

image

Here we see a major difference. The Blit format shows a rather significant reduction in space in most cases. In fact, in some cases this is in the 50% – 60% of the original JSON size.And remember, this is in kilobytes.

In other words, that orders document that was cut by half, that is a 10KB that we won't have to write to the disk or read from the disk. If we need to read the last 50 orders, that is 0.5 MB that won't have to travel anywhere, because it just isn't there. And what about the blog post? That one was reduced by almost half, and we are looking at almost 200KB that aren't there. When we start accounting for all of this I/O that was saved, the computation costs above more than justify themselves.

A lot of that space saving is done by compressing long values, but a significant amount of that comes from not having to repeat property names (in particular, that is where most of the space saving in the comments document came from).

However, there are a bunch of cases where the reverse is true. the logs.json is an empty document. Which can represented in just 2 bytes in JSON, and requires 11 bytes in the Blit format. That isn't an interesting case for us. The other problem child is the attributes document. This document actually grew by about 7% after converting from JSON to Blit format.

The reason is its internal structure. It is a flat list of about 2,00 properties and short string values. The fact that it is a single big object means that we have to use 32 bits offsets, which means that about 8Kb are actually taken just by the offsets. But we consider this a pathological and non representative case. This type of document is typically accessed for specific properties, and here we shine, because we don't need to parse 120Kb of text to get to a specific configuration value.

Now, let us see what it cost to write the documents back in JSON format. Remember the Blit format (and someone please find me a good alternative for this name, I really get tired of it) is internal only. Externally, we consume and output JSON strings.

image

Well, there is a real problem here, because we are dealing with small documents, the timing is too short to really tell.

So let us try what happens with larger ones… I selected documents in the 2MB to 70MB range. And here are the results:

image

There is one outlier, and that is the Enron dataset. This dataset is composed of relatively large number of large fields (those are emails, effectively). Because the Blit format also compresses large fields, we spent more time on parsing and building the Blit format than we do when parsing JSON. Even in this case, however, we are only seeing 17% increase in the time.

Remember that both JSON and Blit format are actually reading the data using the same JsonTextReader, the only difference is what they do about it. JSON creates a JObject instance, which represent the entire document read.

Blit will create the same thing, but it will actually generate that in unmanaged memory. That turn out to have a pretty big impact on performance. We don't do any managed allocations, so the GC doesn't have to do any major amount of work during the process. It also turns out that with large documents, the cost of inserting so many items to so many dictionaries is decidedly non trivial. In contrast, storing the data in Blit format requires writing the data and incrementing a pointer, along with some book keeping data.

As it turns out, that has some major ramifications. But that is one side of that, what about looking at the final sizes?

image

Here we are pretty unambiguous, the moment you get to real size (typically starting from a few dozen KB) you start to see some pretty major difference in the final document size.  And now for the cost of writing those documents again as JSON:

image

Here, too, we are much faster. This has several reasons. Profiling has shown us that quite a bit of time is spent in making sure that the values that we write are properly escaped. With the Blit format, our source data was already valid JSON, so we can skip that. In the graph above, we also don't return the same exact JSON. To be rather more exact, we return the same document, but the order of the fields is different (they are returned in lexical order, instead of the order in which they were defined).

We also have a mode that keep the same order of properties, but it comes with a cost. You can see that in some of those cases, the time to write the document out went up significantly.

image

Now, so far I have only dealt with the cost of parsing a single document. When dealing with multiple small documents? The large datasets from before was effectively a lot of small documents that were group into a single large document. For example, companies.json contains just over 18,000 small documents, and zips has 25,000, etc. Here are the perf results:

image 

As you can see, here, too, we are doing much better.

This is because the Blit code is actually smart enough to extract the structure of the documents from the incoming stream, and is able to use that to do a whole bunch of optimizations.

Overall, I'm pretty happy with how this turned out to be. Next challenge, plugging this into the heart of RavenDB…

More posts in "The importance of a data format" series:

  1. (25 Jan 2016) Part VII–Final benchmarks
  2. (15 Jan 2016) Part VI – When two orders of magnitude aren't enough
  3. (13 Jan 2016) Part V – The end result
  4. (12 Jan 2016) Part IV – Benchmarking the solution
  5. (11 Jan 2016) Part III – The solution
  6. (08 Jan 2016) Part II–The environment matters
  7. (07 Jan 2016) Part I – Current state problems