Ayende @ Rahien

Refunds available at head office

RavenDB: Let us write our own JSON Parser, NOT

One accusation that has been leveled at me often is that I keep writing my own implementation of Xyz (where Xyz is just about anything). The main problem is that I can get overboard with that, but for the most part, I think that I managed to strike the right balance. Wherever possible, I re-use existing, but when I run into problems that are easier to solve by creating my own solution, I would go with that.

A case in point is the JSON parser inside RavenDB. From the get go, I used Newtonsoft.Json.dll. There wasn’t much to think of, this is the default implementation from my point of view. And indeed, it has been an extremely fine choice. It is a rich library, it is available for .NET 3.5, 4.0 & Silverlight and it meant that I had opened up a lot of extensibility for RavenDB users.

Overall, I am very happy. Except… there was just one problem, with large JSON documents, the library showed some performance issues. In particular, a 3 MB JSON file took almost half a second to parse. That was… annoying. Admittedly, most documents tends to be smaller than that, but it also reflected on overall performance when batching, querying, etc. When you are querying, you are also building large json documents (a single document that contains a list of results, for example), so that problem was quite pervasive for us.

I set out to profile things, and discovered that the actual cost wasn’t in the JSON parsing itself, that part was quite efficient. The costly part was actually in building the JSON DOM (JObject, JArray, etc). When people usually think about JSON serialization performance, they generally think about the perf from and to .NET objects. The overriding cost in that sort of serialization is actually how fast you can call the setters on the objects. Indeed, when looking at perf metrics on the subject, most of the comparisons were concentrated on that aspect almost exclusively.

That make sense, since for the most part, that is how people use it. But for RavenDB, we are using JSON DOM for pretty much everything. This is how we are representing a document, after all, and that idea is pretty central to a document database.

Before setting out to write our own, I looked at other options.

ServiceStack.Json - that one was problematic for three reasons:

  • It wasn’t really nearly as rich in terms of API and functionality.
  • It was focused purely on reading to and from .NET objects, with no JSON DOM supported.
  • The only input format it had was a string.

The last one deserves a bit of explanation. We cannot afford to use a JSON implementation that accepts a string as input, because that JSON object we are reading may be arbitrarily large. Using a string means that we have to allocate all of that information up front. Using a stream, however, means that we can allocate far less information and reduce our overall memory consumption.

System.Json – that one had one major problem:

  • Only available on Silverlight, FAIL!

I literally didn’t even bother to try anything else with it. Other stuff we have looked on had those issues or similar as well, mostly, the problem was no JSON DOM available.

That sucked. I was seriously not looking to writing my own JSON Parser, especially since I was going to add all the bells & whistles of the Newtonsoft.Json. :-(

Wait, I can hear you say, the project is open source, why not just fix the performance problem? Well, we have looked into that as well.

The actual problem is pretty much at the core of how the JSON DOM is implemented in the library. All of the JSON DOM are basically linked lists, and all operations on the DOM are O(N). With large documents, that really starts to hurt. We looked into what it would take to modify that, but it turned out that it would have to be a breaking change (which pretty much killed the notion that it would be accepted by the project) or a very expensive change. That is especially true since the JSON DOM is quite rich in functionality (from dynamic support to INotifyPropertyChanged to serialization to… well, you get the point).

Then I thought about something else, can we create our own JSON DOM, but rely on Newtonsoft.Json to fill it up for us? As it turned out, we could! So we basically took the existing JSON DOM, stripped it out of everything that we weren’t using. Then we changed the linked list support to a List and Dictionary, wrote a few adapters (RavenJTokenReader, etc) and we were off to the races. We were able to utilize quite a large fraction of the things that Newtonsoft.Json already did, we resolved the performance problem and didn’t have to implement nearly as much as I feared we would.

Phew!

Now, let us look at the actual performance results. This is using a 3 MB JSON file:

  • Newtonsoft Json.NET - Reading took 413 ms
  • Using Raven.Json - Reading took 140 ms

That is quite an improvement, even if I say so myself :-)

The next stage was actually quite interesting, because it was unique to how we are using JSON DOM in RavenDB. In order to save the parsing cost (which, even when optimized, is still significant), we are caching in memory the parsed DOM. The problem with caching of mutable information is that you have to return a clone of the information, and not the actual information (because then it would be mutated by the called, corrupting the cached copy).

Newtonsoft.Json supports object cloning, which is excellent. Except for one problem. Cloning is also an O(N) operation. With Raven.Json, the cost is somewhat lower. But the main problem is that we still need to copy the entire large object.

In order to resolve this exact issue, we introduced a feature called snapshots to the mix. Any object can be turned into a snapshot. A snapshot is basically a read only version of the object, which we then wrap around another object which provide local mutability while preserving the immutable state of the parent object.

It is much easier to explain in code, actually:

public void Add(string key, RavenJToken value)
{
    if (isSnapshot)
        throw new InvalidOperationException("Cannot modify a snapshot, this is probably a bug");

    if (ContainsKey(key))
        throw new ArgumentException("An item with the same key has already been added: " + key);

    LocalChanges[key] = value; // we can't use Add, because LocalChanges may contain a DeletedMarker
}

public bool TryGetValue(string key, out RavenJToken value)
{
    value = null;
    RavenJToken unsafeVal;
    if (LocalChanges != null && LocalChanges.TryGetValue(key, out unsafeVal))
    {
        if (unsafeVal == DeletedMarker)
            return false;

        value = unsafeVal;
        return true;
    }

    if (parentSnapshot == null || !parentSnapshot.TryGetValue(key, out unsafeVal) || unsafeVal == DeletedMarker)
        return false;

    value = unsafeVal;

    return true;
}

If the value is on the local changes, we use that, otherwise if the value is in the parent snapshot, we use that. We have the notion of local deletes, but that is about it. All changes happen to the LocalChanges.

What this means, in turn, is that for caching scenarios, we can very easily and effectively create a cheap copy of the item without having to copy all of the items. Where as cloning the 3MB json object in Newtonsoft.Json can take over 100 ms to clone, we can create a snapshot (it involves a clone, so the first time it is actually expensive, around the same cost as Newtonsoft.Json is) and from the moment we have a snapshot, we can generate children for the snapshot at virtually no cost.

Overall, I am quite satisfied with it.

Oh, and in our tests runs, for large documents, we got 100% performance improvement from this single change.

Comments

mattmc3
04/24/2011 06:46 PM by
mattmc3

The simplest way to draw a distinction between whether you write it yourself or use a 3rd party library is to ask yourself whether it's your organization's or your application's core competency. If it isn't, then you're wasting time and resources creating something that's probably available in a "good enough" form elsewhere. An example of this might be Mozilla's XUL... neat, but totally tangential to creating a nice browser. If it is within your app's core competency, then "good enough" might prove not to be, well, good enough. If your application stores documents as JSON like RavenDB does, you better believe that if you're successful enough you're going to end up highly customizing something or writing your own JSON parser entirely. It's the classic build vs. buy conversation, and it's always surprised me how often people get it wrong.

Rafiki
04/24/2011 10:28 PM by
Rafiki

Great work! Will your DOM be avaliable under same license as Newtonsoft.Json (CC as far as I know)?

Ayende Rahien
04/25/2011 06:54 AM by
Ayende Rahien

Rafiki,

It is part of the ravendb source code, and available under the same license as RavenDB

Towa
04/25/2011 03:01 PM by
Towa

You wrote: "and all operations on the DOM are O(N)".

But O(N) is actually pretty good. Did you mean O(N^2)?

Chris Wright
04/25/2011 04:18 PM by
Chris Wright

Towa, O(N) key lookup time is bad. That should be O(1) ideally, though O(log N) might also be acceptable.

I can't think of a data structure that would have O(N**2) lookup times.

Demis Bellot
04/30/2011 09:29 PM by
Demis Bellot

Hi Ayende,

Good to see you're tackling the JSON in .NET problem - IMO perf of JSON parsers were pretty lame in .NET (i.e. MS's JSON parser is slower than their XML one, etc).

Since JSON is an increasingly important serialization format it was frustrating to see the poor options we had shipped with the .NET framework.

Even today I'm still seeing tutorials recommending the use of JavaScriptSerializer which I've found over 100x slower than protobuf-net in my benchmarks ( http://www.servicestack.net/mythz_blog/?p=344) which I've had to exclude because it was infeasible to run benchmarks with a high N.

Anyway since you've evaluated ServiceStack.JsonSerializer I would like to provide my own feedback of the limitations you've listed:

  • It wasn’t really nearly as rich in terms of API and functionality.

Well it's a JSON C# POCO Serializer which can serialize basically any .NET collection, C# POCO types, anonymous and late-bound types, etc - basically any clean .NET DTO/domain object graph. I hear about Json Serializers with DataSet, XML > JSON support, etc but am really not sure/convinced of the use-cases requiring this. IMO these are features/solutions to design problems you shouldn't have to being with.

  • It was focused purely on reading to and from .NET objects, with no JSON DOM supported.

Another area where I don't see value of is JSON DOM compared to just using normal Generic Dictionary / List to build and maintain a dynamic data payload. C# has intrinsic support for populating Generic collection types (i.e. collection initializers) making it much easier and more natural to populate from C#. Also any C# POCO type that is serialized can be deserialized as a Dictionary <string,string> and vice-versa allowing you to still parse a JSON payload without the C# POCO type that created it. For examples of dynamic JSON parsing with ServiceStack.JsonSerializer see:

http://goo.gl/G8CNI (parsing GitHub pull request) and http://goo.gl/k8ayt these examples show you you can trivially parse an adhoc JSON payload, populating your own strong-typed model. It would be nice to know of scenarios where a JSON DOM would prove beneficial.

  • The only input format it had was a string.

That's a little misleading which I hope you will clarify. The JsonSerializer exposes APIs to parse JSON via a string, TextReader or Stream (see the src for JsonSerializer: http://goo.gl/IH5hN). What you're referring to is that behind the scenes I read that into a buffer which is what I use to deserialize. This is done purely for perf reasons in the light that CPU efficiency is better than Memory efficiency (which is becoming more plentiful). I've decided to do this because there is a small fraction of use-cases that would benefit from a streaming JSON api, i.e. who benefits/uses a partially populated domain model? Is RavenDB doing any processing on a partial JSON document/dataset? So with the 80/20 rule in mind I've discarded the calling overhead from reading from a stream for better perf.

Note for serialization I still write to a stream since it's important to not buffer the output where writing to a stream will yield perf benefits pushing the serialized output to the response stream as soon as you can.

Also do you have your benchmarks available? I'm personally curious on how ServiceStack.JsonSerializer stacks up against your latest efforts :)

Anyway great to see you're still blogging and focused on perf, it's a feature that is continually unconsidered in the monolithic frameworks being produced in the .NET space today. BTW I'm trying to start a body of knowledge around perf/scaling in .NET so if you ever want to contribute a piece on either, maybe a piece on where RavenDB is faster or scales better than the traditional .NET persistance options it'd be a very welcome addition to: https://github.com/mythz/ScalingDotNET - Once I get enough info on the subject I'll make a website dedicated to the subject.

Ayende Rahien
05/01/2011 07:56 AM by
Ayende Rahien

Demis,

1) For the most part, most people would find whatever they are using to be more than enough. Performance only matter if you run into a perf issue.

2) Regarding the API. I am talking about support for things like converters, selecting which properties/fields will be serialized and which will not be, modifying the way we are reading/saving the data, etc.

3) I need to have access to the DOM in RavenDB. Dictionary <string,string> loses the minimal amount of type information that is already there in JSON, and is not acceptable.

4) Having only string as an input (and if you have a Stream and read that to a string it is the same) is a big problem when you are dealing with large documents, because you are going to create a single continious (and very large) string. That results in having a lot of fragmentation in the LOB, which can result in Out Of Memory Exceptions in a server application.

That is not acceptable for us in RavenDB.

I understand CPU vs. Mem tradeoffs, but not considering the LOB and the implications on fragmentation means that you are leaving yourself open to some very bad issues down the road.

5) Regarding my benchmark, take a look at Raven.Tryouts, the PerfTest class.

It isn't a realy interesting case, but we are using 3 MB document as our source information here. And we saw about 100% improvement between the two options.

Demis Bellot
05/01/2011 08:47 AM by
Demis Bellot

1) For the most part, most people would find whatever they are using to be more than enough. Performance only matter if you run into a perf issue.

Depends if you consider performance of primary importance or not, from what I can see sitting on the mailing list various mailing lists / NoSQL groups perf seems to matter a lot more outside .NET culture, where perceived perf / response times is deeply linked to end user UX / satisfaction. This is understood a lot better in alt lang fx and dev platforms where there is less enterprise/SC teachings and heavy weight fx's to cloud any focus on perf/scalability. However most top internet properties put performance of paramount importance, but I guess it is something that's disregarded in the areas where .NET is positioned in the enterprise.

2) Regarding the API. I am talking about support for things like converters, selecting which properties/fields will be serialized and which will not be, modifying the way we are reading/saving the data, etc.

3) I need to have access to the DOM in RavenDB. Dictionary loses the minimal amount of type information that is already there in JSON, and is not acceptable.

Sounds like you want access to a quasi-strong typed api that's not the POCO that's created it? I guess that's fair enough I suspect as a Document DB server you have some unique requirements. For most other use-cases where the mapping takes place at the data/dto models, I can't see why this is needed.

4) Having only string as an input (and if you have a Stream and read that to a string it is the same) is a big problem when you are dealing with large documents, because you are going to create a single continious (and very large) string. That results in having a lot of fragmentation in the LOB, which can result in Out Of Memory Exceptions in a server application.

That is not acceptable for us in RavenDB.

I understand CPU vs. Mem tradeoffs, but not considering the LOB and the implications on fragmentation means that you are leaving yourself open to some very bad issues down the road.

Yeah I don't buy this, you may be referring to large asset files which you should always stream and as they have the potential to be upward 1GB in size - which, I agree you should always stream. However I very much doubt this fragmentation is a real-world concern for data documents which are unlikely to be no more than a few MB in size. The buffer only lasts a short time after the few ms it takes to deserialize it into object graph (which generally takes up a similar amount of space as the buffer) until its reclaimed by the GC. The .NET GC as you know is self-compacting which as a result has no longterm fragmentation problems. I'd be very interested in any links contrary to this where adding and reclaiming a few MB periodically in .NET causes 'Out Of Memory Exceptions'? - as this is news to me.

5) Regarding my benchmark, take a look at Raven.Tryouts, the PerfTest class.

It isn't a realy interesting case, but we are using 3 MB document as our source information here. And we saw about 100% improvement between the two options.

ok kool, I'll give it a look when I run into some free time, thx.

Ayende Rahien
05/01/2011 08:52 AM by
Ayende Rahien

Demis,

1) Performance is important only as much as it affect the perceived perf. Beyond a certain point, it is no longer relvant.

A user can't tell if a page rendered in 50 ms or a 100 ms, for example.

2 & 3) As I said, I am not building a standard app, I am building something that requires a lot of details about the actual document and modifying how to work with it.

4) This is important, objects of size > 85Kb are NOT COMPACTED. You might want to read on the Large Object Heap and the implications of such a thing for building server products.

You might want to read this artcile: msdn.microsoft.com/en-us/magazine/cc534993.aspx

James Newton-King
05/12/2011 11:04 PM by
James Newton-King

My perspective with Json.NET is that easy of use and flexibility far outweighs performance in importance.

Everyone cares about getting stuff done quickly and well while only 5% (if that) are writing code where the performance critical code is JSON serialization.

Ayende Rahien
05/13/2011 06:38 AM by
Ayende Rahien

James, I absolutely agree with you here. And I think that you have done tremendous job in making it very easy for us to use JSON. RavenDB have benefited greatly from being able to make use of all the good stuff that are in JSON.Net For that matter, the mere fact that we could spend a few days are customize just the parts that were problematic for us is another testament for a well written piece of code.

Comments have been closed on this topic.