Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 800 words

After quite a bit of work, we have finished (at least for now) the Blittable format. As you can probably guess from the previous posts, we have decided to also implement small string compression in the Blittable format, based on the Smaz algorithm.

We also noticed that there are two distinct use cases for the Blittable format. The first is when we actually want to end up with the document on disk. In this case, we would like to trade CPU time for I/O time (especially since usually we write once, but read a lot, so we'll keep paying that I/O cost). But there are also many cases in which we read JSON from the network jus to do something with it. For example, if we are going to read an index definition (which is sent as a JSON document), there is no need to compress it, we'll just have to immediately uncompress it again.

So we added the option to choose what mode you'll use the data. Here are the benchmarks for speed and size for Json, Blittable (for disk) and Blittable (for memory)

image image

As expected, we see a major difference in performance between Blit to disk and Blit to mem. Blit to mem is comparable or better from Json in nearly all scenarios, while Blit to disk is significantly smaller than Json while having comparable or better performance at parsing time, even though it is compressing a lot of the data.

For large documents, this is even more impressive:

image image

Blit to mem is 40% – 75% of the speed of Json at parsing, while Blit to disk can be much more expensive (us to x2.5 times more than Json), depending on how compressible the data is. On the other hand, Blit to mem is already smaller than Json in 33% – 20% for most datasets, but Blit to disk improves on that significantly, often by as much as 60% of the original JSON. In the companies.json case, the original file size is 67.25 MB, using Blit to mem we have a 45.29 MB file, and using Blit to disk gives us a 40.97MB.

This is a huge reduction in size (compared to appreciable fraction of a second on most I/O systems).

So far, we look at the big boys, but what happens if we try this on a large number of small documents? For example, as would commonly happen during indexing?

In this case, we see the following:

image

Note that Blit to disk is very expensive in this mode, this is because we actually do a lot. See the next chart:

image

Blit to disk is actually compressing the data by close to 20MB, and it is 5 MB smaller than Blit to memory.

The question on what exactly is the difference between Blit to memory and Blit to disk came up in our internal mailing list. The actual format is identical, there are currently only two differences:

  • Blit to disk will try to compress all strings as much as possible, to reduce I/O. This is often at the cost of parsing time. Blit to mem just store the strings as is, which is much faster, obviously. It takes more memory, but we generally have a lot of that available, and when using Blit to memory, we typically work with small documents, anyway.
  • Blit to disk also does additional validations (for example, that a double value is a valid double), while Blit to memory skip those. The assumption is that if the double value isn’t valid, it will fail when you actually access the double’s value, whereas we want to have such issues happen when we persist the data if we are actually going to access it later when using Blit to disk.

Overall, a pretty good job all around.

time to read 2 min | 396 words

So our current version can index 18,000 documents in 22 ms vs. the JSON approach which takes 1.8 seconds. That is an amazing improvement, but I got to tell you, I got a good feeling about it.

Without running a profiler, I'm guessing that a major part of the 22 ms cost is going to be comparisons during a binary search process. We store the property names as a sorted array, and when you need to get a property by name, we need to do a binary search in the properties name to match it.

Our test case get 3 properties from a total of 18,801 documents. We do this 30 times, to reduce the chance of jitter getting in the way. Here are the profiler results:

image

As you can see, the cost is indeed where I expected it to be. Basically, the ReadNumber and GetComparerFor are the expressions of the number of binary searches.

But here we can take advantage on the fact that we know that most of those documents are going to have the same structure. In other words, once we have found the position of  a particular property, there is a strong chance that the next time we'll look a property with the same name, we'll find it in the same position. Caching that and using this as the initial position for doing the binary search means that most of the time, we hit the right property right on the nose, and in the worst case, we have a single extra comparison to make.

This single change gives us a really nice impact:

image

That is about 40% improvement in speed, from this single change.

We have been following similar patterns throughout the development of this feature. Write the feature, make sure it passes all its tests, then profiler it. Usually that indicate a problem that we need to solve. Either by optimizing code, or by changing the way we do things in the code. Sometimes radically.  But that is a topic for the next blog post.

time to read 2 min | 226 words

So far I have written about the problem we had, the requirement for the solution, then did a deep dive into the actual implementation and finally I talked about the improvement in performance, which was a nice double digits improvements in percentage. There are a couple of tricks there that I still want to talk about, but that is pretty much it.

Except that this actually misses the entire point of this exercise. What the blittable format gives us is immediate access to any property without the need to parse the whole thing. How important is that for us?

Well, for one specific scenario, that is actually quite important. Let us imagine that we have the following RavenDB index:

from c in docs.Companies
select new
{
	c.Name,
	c.Overview
}

Here are the results, counting purely the time to load about 18,000 documents and get the relevant data:

image

I removed all I/O from the benchmark, and I'm testing only the cost of loading documents already saved in RavenDB and getting those properties from them.

And that is what I'm talking about Smile.

time to read 7 min | 1257 words

Based on my previous posts, we have written an implementation of this blittable format. And after making sure that it is correct, it was time to actually test it out on real world data and see what we came up with.

We got a whole bunch of relatively large real world data set and run them through a comparison of JSON.Net vs. the Blittable format. Note that for reading the JSON, we still use JSON.Net, but instead of generate a JObject instance, we generate our own format.

Here are the results of running this on a set of JSON files.

image

As you can see, there doesn't appear to be any major benefit according to those numbers. In multiple runs of those files, the numbers are roughly the same, with a difference of a few ms either way. So it is a wash, right?

Except… that those aren't the complete truth. Let us look at some other aspects of the problem. How about the persisted size?

image

Here we see a major difference. The Blit format shows a rather significant reduction in space in most cases. In fact, in some cases this is in the 50% – 60% of the original JSON size.And remember, this is in kilobytes.

In other words, that orders document that was cut by half, that is a 10KB that we won't have to write to the disk or read from the disk. If we need to read the last 50 orders, that is 0.5 MB that won't have to travel anywhere, because it just isn't there. And what about the blog post? That one was reduced by almost half, and we are looking at almost 200KB that aren't there. When we start accounting for all of this I/O that was saved, the computation costs above more than justify themselves.

A lot of that space saving is done by compressing long values, but a significant amount of that comes from not having to repeat property names (in particular, that is where most of the space saving in the comments document came from).

However, there are a bunch of cases where the reverse is true. the logs.json is an empty document. Which can represented in just 2 bytes in JSON, and requires 11 bytes in the Blit format. That isn't an interesting case for us. The other problem child is the attributes document. This document actually grew by about 7% after converting from JSON to Blit format.

The reason is its internal structure. It is a flat list of about 2,00 properties and short string values. The fact that it is a single big object means that we have to use 32 bits offsets, which means that about 8Kb are actually taken just by the offsets. But we consider this a pathological and non representative case. This type of document is typically accessed for specific properties, and here we shine, because we don't need to parse 120Kb of text to get to a specific configuration value.

Now, let us see what it cost to write the documents back in JSON format. Remember the Blit format (and someone please find me a good alternative for this name, I really get tired of it) is internal only. Externally, we consume and output JSON strings.

image

Well, there is a real problem here, because we are dealing with small documents, the timing is too short to really tell.

So let us try what happens with larger ones… I selected documents in the 2MB to 70MB range. And here are the results:

image

There is one outlier, and that is the Enron dataset. This dataset is composed of relatively large number of large fields (those are emails, effectively). Because the Blit format also compresses large fields, we spent more time on parsing and building the Blit format than we do when parsing JSON. Even in this case, however, we are only seeing 17% increase in the time.

Remember that both JSON and Blit format are actually reading the data using the same JsonTextReader, the only difference is what they do about it. JSON creates a JObject instance, which represent the entire document read.

Blit will create the same thing, but it will actually generate that in unmanaged memory. That turn out to have a pretty big impact on performance. We don't do any managed allocations, so the GC doesn't have to do any major amount of work during the process. It also turns out that with large documents, the cost of inserting so many items to so many dictionaries is decidedly non trivial. In contrast, storing the data in Blit format requires writing the data and incrementing a pointer, along with some book keeping data.

As it turns out, that has some major ramifications. But that is one side of that, what about looking at the final sizes?

image

Here we are pretty unambiguous, the moment you get to real size (typically starting from a few dozen KB) you start to see some pretty major difference in the final document size.  And now for the cost of writing those documents again as JSON:

image

Here, too, we are much faster. This has several reasons. Profiling has shown us that quite a bit of time is spent in making sure that the values that we write are properly escaped. With the Blit format, our source data was already valid JSON, so we can skip that. In the graph above, we also don't return the same exact JSON. To be rather more exact, we return the same document, but the order of the fields is different (they are returned in lexical order, instead of the order in which they were defined).

We also have a mode that keep the same order of properties, but it comes with a cost. You can see that in some of those cases, the time to write the document out went up significantly.

image

Now, so far I have only dealt with the cost of parsing a single document. When dealing with multiple small documents? The large datasets from before was effectively a lot of small documents that were group into a single large document. For example, companies.json contains just over 18,000 small documents, and zips has 25,000, etc. Here are the perf results:

image 

As you can see, here, too, we are doing much better.

This is because the Blit code is actually smart enough to extract the structure of the documents from the incoming stream, and is able to use that to do a whole bunch of optimizations.

Overall, I'm pretty happy with how this turned out to be. Next challenge, plugging this into the heart of RavenDB…

time to read 9 min | 1683 words

In my previous two posts I explained the problem we run into, and discuss the requirement of the solution. To quickly reiterate, we don't want to use JSON as the internal storage format in RavenDB because it requires parsing whenever we need to do anything to the document. We want to have a format that is ready to be used without any parsing.

In order to do that, we split the behavior into two distinct responsibilities. We'll have writer that will take JSON as an input and produce a block of memory that represent this JSON, and we have a reader that can read from this block of memory and if needed restore the original JSON.

For now, I'm going to focus primarily on the write side of things. Because we want to reduce the amount of allocation and retained memory, we cannot use any format that requires reading the whole document before writing it. So we need to use a forward only mode for writing. This plats nice with memory allocation strategies, because we don't have the need to move data once we written it. In other words, whenever we read a value from the JSON stream, we need to write it to our own format. That doesn't seems to be so useful, until you realize that we can delay some decisions.

It will probably be easier if I we used a real example:

{
   "Name":"Oren",
   "Dogs":[
      "Arava",
      "Oscar",
      "Phoebe"
   ],
   "Age":34,
   "Office":{
      "Name":"Hibernating Rhinos",
      "Street":"Hanashi 21",
      "City":"Hadera"
   }
}

This document shows quite a lot of features from JSON, so it is very suitable for demonstration. We read a token at a time from the document, and write it in the new format.

Let us look at how we'll deal with the Dogs array value (just this: ["Arava","Oscar","Phoebe"] ). Loosely speaking, this will look like this:

image

And here it is with explanations:

  • Len 5 – Arava
  • Len 5 – Oscar
  • Len 6 – Phoebe
  • Array size – 3
  • Offsets from metadata start – [19,13,7]
  • Types – [5,5,5]

First we have the values, as length prefixed strings, then we have the number of elements in the array, and then we have a run of offsets. So far, it is obvious, but what are the offsets? They indicate how far back (from the array metadata start position), we need to go to get to a particular value. In other words, in order to read this array we actually need to get a pointer to the array metadata start, and from there we can tell how many elements we have in the array, and then we can just directly to the relevant value.

For example, going to the second element will mean going 13 bytes backward from the metadata beginning, which will give us a byte whose value is 5 (length of the string) and then the actual string bytes.

The array after the offset contains the types of each value. In this case, this is an array of 3 strings, so we have [5,5,5]. Splitting the offsets and the types in this manner means that we have better alignment for the values, and it is simpler to deal with in general. By the way, type 5 is a string.

Why are we using backward references in this manner? Why not hold the absolute position in the document? That would certainly be much simpler, no? We use offsets because a large JSON document is typically composed of many small objects and arrays. By using an offset from the metadata start, we can ensure that most of the time this data will be small enough to fit in a single byte. If we can do, we are able to save quite a significant amount of memory. Of course, we also handle larger objects, including those that need offsets spanning the full 32 bits range. We have scoped the decision of the offset size to the level of a single array / object, because even in big documents, we have a lot of places where byte or short offset would be more than good enough.

One thing to note about this, in JSON format, the array takes 27 bytes, in this format, it takes 26. We saved a whole whopping byte, cue the party noise.

But that isn't really interesting yet, let us see how we deal with objects, in particular, the Office value (just this: {"Name":"Hibernating Rhinos","Street":"Hanashi 21","City":"Hadera"} ). This looks like the following:

image

Again, changing to textual format gives:

  • Len 18 – Hibernating Rhinos
  • Len 10 – Hanashi 21
  • Len 6 – Hadera
  • Properties count – 3
    • Offset – 7, PropertyId – 5, Type – 5
    • Offset –37, PropertyId – 0, Type – 5
    • Offset – 18, PropertyId – 4, Type – 5

(To the sharp eyed among you, the last byte (00), is not really there, and shouldn't be accounted for, it is an artifact of how I printed those values.)

In JSON, this object is 67 bytes long, in this format, it is merely 48 bytes. Yep, maidens swoon when they hear about how much space this saved.

But we are still missing things. To start with, where are the actual property names, and what are those property ids that we have been seeing?

One piece of data that we aren't writing as we read the source document are the properties. Indeed, so far, the biggest saving we have seen is from not writing the properties. So the question is, where are they? The answer is simple, we wait until we finish with the whole document before we write the property names. Instead of writing a property name, we write only the property id. A property id is just the index of the property into the properties we have seen so far.

In other words, we have a small benefit from not having to repeat property names. In the document above, we saved the need to write "Name" twice, because of this feature. At the end of the document, we write the property names just like we write values, but because we need to reference them by id, we also write their offsets into an array.

Finally, we store the root metadata offset, the offset of the property names offset and the size of offsets that we'll encounter in this document, and we are done.

Almost, there is another subtle issue to deal with. Look at the properties of the Office object that we have above. Notice that they don't look like they are stored in order of arrival. Instead, the offset into the properties of the object are actually stored sorted based on the value of the property name.

In other words, given that I want to access a particular property in an object, what I'll do is run a binary search over the properties of this object. I'll lookup the property for each property id and compare it to the property name I want to find. So searching a property in O(logN), where N is the number of properties in a particular object.

We have taken things a bit further, and when we encounter a large string value. Currently, above 128 bytes, we'll also try to compress it. We're using LZ4 to compress the data, and for large values, it generally represent quite a nice saving. But if the savings isn't good enough (doesn't save at least 10% of the value size), we'll just emit the value as is.

So overall we take the document, only store the property names once, arrange things so accessing any property would be very cheap, and compress large values. We get some space saving out of that (more on this in my next post), but while that is nice, and we certainly paid attention to this, it isn't our main goal.

Let us see what sequence of operation we'll need to access the Office.City property, shall we?

  1. O(log 4) search in the properties of the root object:
    • This search is done on the four properties id that are stored in the root object metadata, sorted so we can search them.
    • The actual property names are stored elsewhere, and we have to go through an array specifying their offsets to get to the actual values.
  2. The search result in the offset of the metadata of the Office object.
  3. O(log 3) search in the properties of the Office object.
  4. The search result in the offset of the City property.
  5. We read the city property and do something with it.

Compare that to "parse the entire document" and then access he relevant value or just parse in a streaming fashion the entire document and extract the value you need. There is also another issue, which I almost forgot to mention, but is quite important.

The blittable format (and please help us find another name) is actually writing all the data into unmanaged memory. This means that it is entirely invisible to the GC, which means that we won't have to wait for the GC to clean it all up. Instead, we can allocae a pool of unmanaged memory and just use that. It makes it drastically simpler to manage how we are actually managing memory usage and growth in our systems.

Because we use memory mapped files for our storage, actually moving around in the document using this format is trivial, and quite fast. But I'm getting ahead of myself. Benchmarks and actual results are for the next post…

time to read 4 min | 656 words

When designing a new data format, it is important to remember in what environment we’ll operate in, what are the requirements and what type of scenarios we’ll face.

With RavenDB, we are talking about the internal storage format, so it isn’t something that is available externally. That means that we don’t have to worry about interchange with anything, that frees up the design by quite a bit. We want to reduce parsing costs, we want to reduce size on disk and we want to reduce managed allocations.

That leads us to a packed binary format, but not all binary formats are born equal. In particular, we need to consider whatever we’ll have a streaming format or a complete format. What is the meaning of that?

A streaming format means that you read it one byte at a time to construct the full details. JSON is a streaming format, for example. That is not something that we want to do, because a streaming format requires us to have an in memory representation to deal with the object. And even if we wanted a known particular value from the document, we would still need to parse through all the document to get all the relevant fields.

So we want a complete format. A complete format means that we don’t need to parse the document to get to a particular value.  Internally, we refer to such a format as Blittable. I’m not fond of this term, and I would appreciate suggestions to replace it.

I’ll get to the details about how this is actually laid out in my next post, in this post, I want to outline the overall design for it.

We want a format that can be read in a single call (or, more commonly for us, mmap in its entirety), and once that is done, we can start working with it without additional work. Traversing through this format should be a cheap operation, so this code:

foreach(var child in doc.children)
{ Console.WriteLine(child.firstName); }

Should only materialize the strings for the children’s names (which we accessed), but will have no further costs regarding the rest of the document.

Because we assume that the full document will reside in memory (either by loading it all from disk or by mmaping it), we don’t need to worry about costs of traversing through the document. We can simply and cheaply jump around inside the document.

In other words, we don’t have to put related information close, if we have reason to place it elsewhere. In order to reduce memory consumption during the write phase, we need to make sure that we are mostly forward only writers. That is, the process of writing the document in the new format should not require us to hold the entire document in memory. We should also take the time to reduce the size of the document as much as possible. At the same time, just compressing the whole thing isn’t going to be good for us, we’ll lose the ability to just go to any location on the document cheaply.

Note that for the purpose of this work, we are interested in reducing work only for a single document. There are additional optimizations that we can apply across multiple documents, but they are complex to manage in a dynamic system.

So this is the setup, the previous post talked about the problem we have with JSON, and this one about what kind of a solution we want to have. Next post will discuss the actual format.

time to read 3 min | 459 words

JSON is a really simple format. It make it very easy to work with it, interchange it, read it, etc. Here is the full JSON format definition:

  • object = {} | { members }
  • members = pair | pair , members
  • pair = string : value
  • array = [] | [ elements ]
  • elements = value | value , elements
  • value = string | number | object | array | true | false | null

So far, so good. But JSON also has a few major issues. In particular, JSON require that you’ll read and parse the entire document (at least until the part you actually care about) before you can do something with it. Reading JSON documents into memory and actually working with them means loading and parsing the whole thing, and typically require the use of dictionaries to get fast access to the data. Let us look at this typical document:

{
  "firstName": "John",
  "lastName": "Smith",
  "address": {
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "children": [{"firstName": "Alice"}]
}

How would this look in memory after parsing?

  • Dictionary (root)
    • firstName –> John
    • lastName –> Smith
    • address –> Dictionary
      • state –> NY
      • postalCode –> 10021-3100
    • children –> array
      • [0] –> Dictionary
        • firstName –> Alice

So that is three dictionaries and an array (even assuming we ignore all the strings). Using Netwonsoft.Json, the above document takes 3,840 bytes in managed memory (measured using objsize in WinDBG). The size of the document is 126 bytes as text. The reason for the different sizes is dictionaries. Here is 320 bytes allocation:

new Dictionary<string,Object>{ {“test”, “tube”} };

And as you can see, this adds up fast. For a database that mostly deals with JSON data, this is a pretty important factor. Controlling memory is a very important aspect of the work of a database. And the JSON is really inefficient in this regard. For example, imagine that we want to index documents by the names of the children. That is going to force us to parse the entire document, incurring a high penalty in both CPU and memory. We need a better internal format for the data.

In my next post, I’ll go into details on this format and what constraints we are working under.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}