Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,189 | Comments: 45,957

filter by tags archive

The cost of async I/O, false assumptions and pride

time to read 8 min | 1522 words

As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

 private static void SyncWork(int pos)
 {
     var sp = Stopwatch.StartNew();
     var buffer = new byte[1024 * 4];
     long size = 0;
     using (var sha = SHA1.Create())
     using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
         FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING))
     {
         stream.Seek(pos * ReportSize, SeekOrigin.Begin);
         int read;
         while ((read = stream.Read(buffer, 0, buffer.Length)) != 0)
         {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
         }
     }
 }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for (int i = 0; i < 16; i++)
{
    var copy = i;
    new Thread(state =>
    {
        SyncWork(copy);
    }).Start();
    Thread.Sleep(250);
}

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666
Read 32 mb sync 8,794
Read 32 mb sync 8,995
Read 32 mb sync 9,080
Read 32 mb sync 9,123
Read 32 mb sync 9,299
Read 32 mb sync 9,359
Read 32 mb sync 9,593
Read 32 mb sync 9,376
Read 32 mb sync 9,399
Read 32 mb sync 9,381
Read 32 mb sync 9,337
Read 32 mb sync 9,254
Read 32 mb sync 9,207
Read 32 mb sync 9,218
Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork(int pos)
{
    var sp = Stopwatch.StartNew();
    var buffer = new byte[1024 * 4];
    using (var sha = SHA1.Create())
    using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
        FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING))
    {
        stream.Seek(pos * ReportSize, SeekOrigin.Begin);
        long size = 0;
        
        int read;
        while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
        }
    }
}

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

image

And here is the network utilization during the sync test:

image

Trying the async using 64Kb buffers gives better results:

image

And output of:

Read 32 mb async  8,290
Read 32 mb async 11,445
Read 32 mb async 13,327
Read 32 mb async 14,088
Read 32 mb async 14,569
Read 32 mb async 14,922
Read 32 mb async 15,053
Read 32 mb async 15,165
Read 32 mb async 15,188
Read 32 mb async 15,148
Read 32 mb async 15,040
Read 32 mb async 14,889
Read 32 mb async 14,764
Read 32 mb async 14,555
Read 32 mb async 14,365
Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

image

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

image

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

image

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

image

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

image

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.

Trie based routing

time to read 4 min | 764 words

In my previous post, I discussed the cost of routing in MVC vs. our own implementation. The MVC routing / infrastructure consumed about 90% of the time, while the code that actually does stuff is left with less than 10%. In our implementation, the actual request handling code gets about 85% of the time, while all the infrastructure is handled in the other 15%.

Of that , a large portion of that time is actually spent on routing. In our tests, we saw something like 40% of the time spent in selecting the right controller action to execute, this is when there is just one such action that can be run. Now, to be fair, this is a scenario where we don’t actually have any meaningful logic. We are just saying “hello world” over and over again, so the infrastructure costs are really noticeable in this benchmark. But that is the point, we want to see what are the fixed costs of the infrastructure we use.

Now, in RavenDB we have several hundreds routes (last count was something upward of 400), and we noticed that routing take a lot of time to select the right action to run. We optimized the hell out of that to the point by specializing that to our own situation and based on our own knowledge. But when we started to look at moving our code to Kestrel, we took the time to move to a system that is a better fit for us.

The major reason routing is such an expensive issue in MVC is that it does quite a lot. It needs to handle route selection, capturing values, coercing data to the right types and a lot more. All of that takes time, memory and cpu cycles. But as it turn out, we don’t really need all of that. We have relatively simple routing needs.

Here are a few example of RavenDB routes:

  • /admin/stats
  • /databases/{databaseName}/docs
  • /databases/{databaseName}/indexes/{*id}

So we have the fixed routes, like /admin/stats. We have routes that contain a database name in the middle, and we have routes with a suffix. We can take advantage of that to create an optimal solution.

In our case, we used a trie:

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

The cost of querying a trie is O(L), where L is the length of the key that we are querying. Here is how a very small trie looks in our tests.

image

We start from the route, matching each character against the trie node key. Once we have consumed the full key, we check if it has any children starting with the next character in the url. Note that the Children array here is of size 127. This allows us to index directly into the array (at cost of O(1)) to find the next trie node. We didn’t want to use a dictionary here because of its size and its cost relative to an array access.

This has some interesting implications:

  • The route definition is limited to ASCII characters (but the actual urls are not, mind).
  • Regardless of the number of routes, we are ensured of reliable performance.

But that still doesn’t let us handle routes like the ones above. In order to do that, we had to extend the syntax a bit.

/admin/stats would match exactly. That is easiest.

/databases/*/docs is a route that has a capture. The * symbol says “matches anything but /”, so we can get the database name captured without complex matching. Note that by capture, I mean just record the start position and length, we don’t want to pull a substring out of that and pay for an extra allocation.

/databases/*/indexes/$ is another route that has special meaning. The $ symbol says “the match is done, even if the url has additional characters”. That way we can capture embedded ids easily (again, no allocations).

The end result?

image

Or roughly a route match in 0.9 μs. Pretty good, even if I say so myself.

The cost of routing

time to read 8 min | 1565 words

I was asked to see what is the cost of WebAPI on Kestrel, and it lines up nicely with a topic that I wanted to talk about, routing. So I thought I’ll start with a small benchmark.

Here is a small Kestrel based server, using MVC / WebAPI:

public class Program
{
    public void ConfigureServices(IServiceCollection services)
    {
        services.AddMvc();
    }

    public void Configure(IApplicationBuilder app, ILoggerFactory loggerfactory)
    {
        app.UseMvc();
    }

    public static void Main(string[] args)
    {
        var configBuilder = new ConfigurationBuilder()
            .Add(new MemoryConfigurationProvider(new Dictionary<string, string>
            {
                ["server.urls"] = "http://localhost:8080"
            }));

        IHostingEngine application = new WebHostBuilder(configBuilder.Build())
            .UseStartup<Program>()
            .UseServer("Microsoft.AspNet.Server.Kestrel")
            .Build();


        using (application.Start())
        {
            Console.WriteLine("ready");
            Console.ReadLine();
        }
    }
}

And here is a simple controller:

[Route("api/[controller]")]
public class ValuesController : Controller
{
    // GET api/values
    [HttpGet()]
    public string Get()
    {
        return "Hello World";
    }
}

As you can see, this couldn’t really be any more trivial. I have tested this with release mode, using the following gobench command:

.\gobench.exe -c 100 -r 25000 -u http://localhost:8080/api/values

This was run after running a couple thousand requests as warm up.  In other words, we are going to do a 2.5 million requests. I intentionally choose that value, because it force the server to do enough work to be meaningful, and the test run is long enough to see other factors creeping in. Note that this isn’t a “real” benchmark, both gobench and this code are actually running on the same machine. But it is a good baseline.

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             56,818 hits/sec
Read throughput:                   9,261,363 bytes/sec
Write throughput:                  5,454,545 bytes/sec
Test time:                                44 sec

Those are good numbers. Now let us see that with my code:

[RavenAction("/api/values", "GET")]
public Task ValuesGet()
{
    return HttpContext.Response.WriteAsync("Hello world");
}

There is no funny business here. We just route the request using my code, and then just write it to the client. The results:

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             86,206 hits/sec
Read throughput:                  10,517,241 bytes/sec
Write throughput:                  8,275,862 bytes/sec
Test time:                                29 sec

That is over 25% improvement.

Now, there are a couple of things that works out better for my code. To start with, it is calling Response.WriteAsync directly, while the standard WebAPI code needs to get the result, figure out how to write it, etc.

In the profiler, this looks really interesting:

image

So the time for actually executing the action is less than 10% of the request time. Everything else is some sort of setup.

In fact, let us dig deeper, the time to run our action is 0.2% of the 10% that this method runs:

image

To be fair, the method just returns a constant string. I would be shocked if it wasn’t really fast.

Most of the time happens in the CreateActionResult, in which we select how we write. Let us look at this a bit differently:

image

The InvokeExceptionFilterAsync is the part responsible for actually calling our code and getting the result.

The InvokeAllResultFiltersAsync is responsible for figuring out how to write the output to the client. And this is the part where we actually write to the network:

image

And here is how we do in the same scenario:

image

You can note that 85% of the time we are spending in servicing the request, with just 15% dedicated to managing the infrastructure.

I’ll discuss the details of that in my next post, though.

Profiling mysteries

time to read 3 min | 473 words

In our profiling, we run into an expensive mystery call to string.Join. Nothing in our code called it, and it is expensive.

image

Looking at the code didn’t get us anywhere, so I profiled things again, this time using the debug configuration. The nice thing about the debug configuration is that it doesn’t inline methods (among other things), so what gets executed is much closer to the actual code.

Here are the results:

image

And now it is clear what is going on.

Here is the offending line:

var tryMatch = _trie.TryMatch(method, context.Request.Path);

With context.Request.Path being a PathString, and TryMatch accepting a string, we are silently calling the implicit string operator, which just calls PathString.ToString(), which just calls ToUriComponent(), which looks like:

public string ToUriComponent()
{
    if (HasValue)
    {
        if (RequiresEscaping(_value))
        {
            // TODO: Measure the cost of this escaping and consider optimizing.
            return String.Join("/", _value.Split('/').Select(EscapeDataString));
        }
        return _value;
    }
    return String.Empty;
}

Looking into the current code, it has already been optimized by removing the Linq call and its associated costs, but that is still expensive call to make.

The fix was to get the string directly:

var tryMatch = _trie.TryMatch(method, context.Request.Path.Value);

And here are 10% reduction in costs. Smile

Dynamic code generation in C#

time to read 4 min | 666 words

Let us assume that we have the following simple task:

Given a Dictionary<string, string>, convert that dictionary into a type in as performant a manner as possible. The conversion will happen many time, and first time costs are acceptable.

The answer we came up with is to dynamically generate the code based on the type. Basically, here is how it looks like:

public static Func<Dictionary<string, string>, T> Generate<T>()
    where T : new()
{
    var dic = Expression.Parameter(typeof (Dictionary<string, string>), "dic");
    var tmpVal = Expression.Parameter(typeof (string), "tmp");
    var args = new List<MemberAssignment>();
    foreach (var propertyInfo in typeof(T).GetProperties())
    {
        var tryGet = Expression.Call(dic, "TryGetValue", new Type[0], 
            Expression.Constant(propertyInfo.Name),
            tmpVal);

        Expression value = tmpVal;
        if (propertyInfo.PropertyType != typeof (string))
        {
            var convertCall = Expression.Call(typeof(Convert).GetMethod("ChangeType", 
                new Type[] { typeof(object), typeof(Type) }), tmpVal,
                Expression.Constant(propertyInfo.PropertyType));
            value = Expression.Convert(convertCall, propertyInfo.PropertyType);
        }

        var conditional = Expression.Condition(tryGet, value, 
            Expression.Default(propertyInfo.PropertyType));

        args.Add(Expression.Bind(propertyInfo, conditional));
        
    }
    var newExpression = Expression.New(typeof(T).GetConstructor(new Type[0]));

    var expression = Expression.Lambda<Func<Dictionary<string, string>, T>>(
        Expression.Block(new[] { tmpVal },Expression.MemberInit(newExpression,args)),
        dic);

    return expression.Compile();
}

As an aside, this is an intimidating piece of code, but that is about bazillion time better than having to do this manually using IL manipulations.

What this code does is dynamically generate the following method:

(Dictionary<string,string> dic) => {
    string tmp;
    return new User
    {
        Name = dic.TryGetValue("Name", out tmp) ? tmp : default(string),
        Age = dic.TryGetValue("Age", out tmp) ? (int)Convert.ChangeType(tmp, typeof(int)) : default(int)
    };
}

This is pretty much the fastest way to do this, because this is how you would write it manually. And compiling the expression dynamically means that we don’t have to do this for each and every type we run into.

Measuring baseline costs

time to read 8 min | 1588 words

You might have noticed the low level work I have been posting about lately. This is part of a larger effort to better control over our environment, and hopefully gain more than mere incremental performance improvement.

As part of that, we decided to restructure a lot of our core dependencies. The data format change is one such example, but there are others. Of particular relevance to this post is the webserver and web stack that we use, as well as the actual programming model.

In order to reduce, as much as possible, dependencies on everything else, I decided that I want to benchmark a truly complex part of RavenDB, the “/build/version” endpoint. As you can imagine, this endpoint simply reports the RavenDB version.

Here is how it looks like in RavenDB 3.0:

image

This is WebAPI controller, running on OWIN using HttpListener. The methods calls you see here are static (cached) properties, which generate an anonymous object that gets serialized to JSON.

In order to test this, I decided to use gobench to see what kind of functionality I show expect. I run the following command:

Here is what this looked like when I run this:

image

The CPU usage was roughly 70% – 80% by RavenDB, and the rest was taken by gobench. Here are the results:

Requests:                          500,000 hits
Successful requests:               500,000 hits
Network failed:                          0 hits
Bad requests failed (!2xx):              0 hits
Successful requests rate:           15,151 hits/sec
Read throughput:                 4,030,304 bytes/sec
Write throughput:                1,500,000 bytes/sec
Test time:                              33 sec

Remember, we don’t do any real work here, no I/O, no computation, nothing much at all.

Nevertheless, this is quite a nice number, even if it is effectively a hello world. This is with a 100 clients doing 5,000 requests each. Admittedly, this in on the local machine (we’ll do remote testing later), but quite nice.

How about our new code? How does it compare?

Requests:                          500,000 hits
Successful requests:               500,000 hits
Network failed:                          0 hits
Bad requests failed (!2xx):              0 hits
Successful requests rate:           12,500 hits/sec
Read throughput:                 2,312,500 bytes/sec
Write throughput:                1,237,500 bytes/sec
Test time:                              40 sec

So we are a bit slower, although not by much. That is nice, to start with. And here are the process stats while this is running.

image

You might think that I’m disappointed, but quite the reverse is true. You see, this benchmark results that you are seeing.

They were taken while running under a profiler.

In other words, we saddled ourselves with a huge performance hit, and we are still within touching distance.

You might have noticed that the process shown here is dnx.exe. This is us running on CoreCLR with the Kesterl web server, and without any web framwork (such as WebAPI). The code we are running here is here:

image

As you can see, there is quite a bit more code here than in the previous code snippet. This is mostly because we are still playing with the API. We’ll probably have something like the GetObject method once we decide exactly on how to do it. For now, we want to have everything in our face, because it simplify how we are doing things.

Anyway, I mentioned that I was running this under a profiler, right? Actually, I run this under a profiler while I run the previous command 3 times. Here are the relevant results:

image

I highlighted a couple of interesting results. The call to BuildVersion.Get took 0.2ms (while under the profiler) and most of the time was spent in serializing the result into JSON (in the profiler output, ReadObjectInternal and Write/WriteTo methods).

Another interesting to note is the TryMatch call. this is how we handle routing. I’ll have a seaprate post on that, but for now, those are pretty encouraging results.

One thing that I do have to say is that the new code is not doing a lot of things that the existing codebase is doing. For example, we keep track on a lot of stuff (requests, durations, etc) that can affect request timing. So it isn’t fair to just compare the numbers. Let us inspect how our existing codebase handle the load under the same profiling scenario.

This time I run it only once, so we have only 500,000 requests to look at:

image

The most obvious think to see here is the request duration. A request takes roughly 6ms under the profiler. While it takes 0.2ms in the new code (still under profiler). The routing code is expensive, but I’ll talk about this in another post. And you can see some of the other stuff that we do that the new code doesn’t (autorization, etc).

Oh, and here are the actual results:

Requests:                          500,000 hits
Successful requests:               500,000 hits
Network failed:                          0 hits
Bad requests failed (!2xx):              0 hits
Successful requests rate:            1,577 hits/sec
Read throughput:                   419,563 bytes/sec
Write throughput:                  156,151 bytes/sec
Test time:                             317 sec

That is quite a change.

Oh, and what about running the new code without a profiler? I have done so, and here is what we see:

image

And the actual numbers are:

Requests:                        1,500,000 hits
Successful requests:             1,500,000 hits
Network failed:                          0 hits
Bad requests failed (!2xx):              0 hits
Successful requests rate:           75,000 hits/sec
Read throughput:                13,875,000 bytes/sec
Write throughput:                7,425,000 bytes/sec
Test time:                              20 sec

So we actually have capactity to spare here. I’m pretty happy with the results for nowSmile.

Fastest code is the one not runPart II – Memory management

time to read 4 min | 772 words

One of the most important things that you need to do for high performance is to control your allocations. Indeed, the blittable format is almost entirely implemented in unmanaged memory. And we get a great deal of performance from not having the GC poke its annoying little nose into our data structures. That said, it means that we take the onus of managing the memory ourselves.

This post is about a couple of changes that we made to the memory management system in the blittable format, and some future directions we intend to explore.

The first thing that we did was to implement an unmanaged memory pool based on ConcurrentDictionary<int, ConcurrentStack<IntPtr>>. The int is the size (in power of two) of the allocated blocks. And we would allocate from the heap, and then store the buffers internally for reuse. That worked, but profiling showed that we had quite a bit of work in just leasing and returning the buffers to the thread safe pool.

Note that this was when we only tested using a single thread, we expect that the cost of working with it would only increase when using multiple threads.

That sucked.

Luckily, we use the context pattern for pretty much everything in blittable format (and more generally in RavenDB), so we can take advantage of that. We have a three stages process.

First, we allocate the memory when we first request it. Then, when we are done using it, we return it to the context, not the global pool. This allows us to avoid cross thread coordination costs entirely. And it is quite common for threads to need to run the same buffers multiple times, so we save the “check in the buffer” just to “lease me this buffer again, please” style of work.

Here are the costs when using a single shared, thread safe pool:

image

image

As you can see, we spent quite a bit of time just managing the thread safe collections. You don’t see it in this profile, but other profiling sessions show the concurrent dictionary under load as well. And this is in a single threaded, no contention benchmark.

We moved the memory allocations to the context. Because a context is single treaded, we can rely on much simpler (and cheaper) collections, and we can also reuse contexts. A request checks out a context, which has its own cached buffers. It runs through the request, then return the context as a whole to the buffer. Each of those contexts can then manage their own memory and only rarely need to go to the global pool. There is some complexity about making sure that overly fat contexts hang around, but that is basically it.

And when we need to release the data from the context, we can do all of that in a single bulk operation. Here are the profiler results after:

image

image

I admit, it is a bit disappointing to see only a 100% improvement, but I can comfort myself with knowing that the saving in multi threaded scenarios are much higher.

I mentioned that we also have some ideas on improving this furhter. This idea (which we deferred right now because there are other more important things to do) include just allocating a big buffer (128MB in size) per context. We'll not commit all of it, merely reserve the address space, and start allocating from it directly. Basically, each allocation would simply be a pointer increment (with occational calls to commit the rest of the reserved address space).

Now, in order to free the memory, all we need would be to reset the position to zero, and we are set to reuse the context again at effectively zero cost. If this sounds familiar to you, this is because this is primarily how allocations actually work in .NET, except that we explicitly control the size and the lifetime of all the objects in there.

It also has no cost in GC, which is quite nice. As I said, we are thinking about this, but at this point, I think that we are fast enough that we don't need to go there yet. We'll have to see under concurrent usage what this will be.

Fastest code is the one not runPart I – The cost of escaping strings

time to read 6 min | 1066 words

This series of posts is continuting the work I have outlined in this series. While in the previous series, I focused on the overall picture, here I want to talk about the small things we did to speedup pretty much every aspect of the system. In some cases, those are order of magnitude differences.

In this post, I want to talk about the following:

“Oren\r\nEini”

Such a simple thing, if you just look at the surface, and yet if you’ll profile the Json.NET code, you’ll see an interesting issue. I took an existing JSON file (about 67MB in size) and continiously wrote it to /dev/null under a profiler. The results are actually quite surprising.

image

In fact, this is a bit worse.

image

dotTrace has a great feature that allows you to remove “noise” from a trace. Here is what happens if we eliminate this call:

image

So, pretty obivously, this is an really expensive call. But how expensive is it, really? Note that this profiling run used a Null Stream, so there isn’t any I/O cost to consider. This is the pure computational cost of calling this method.

Let us look at a deeper profiling profiler of this method:

image

As you can see, WriteEscapedString does about 9% of work, then call WriteEscapedJavaScriptString, which make some calls to the StreamWriter. But about 33% of the time is spend doing its own thing, looking at the code, it is clear what is going on:

image

Basically, before it can write anything, this code needs to go over the string and find out if it has any escape characters. If it does, it needs write the escape sequence, then resume writing the string. This means that this code has to scan the string that it is writing several times, and much worse, it means that you can’t really do justice with those operations. Consider the string above, writing it out is actually several different calls to the writer. In fact, we have:

  • Oren
  • \r
  • \n
  • Eini

The problem here is that we lose a very powerful feature, the ability to batch multiple operations into a single I/O call. Sure, the TextWriter is going to be doing a bit of buffering for us, but that doesn’t help us really. Let us consider one of the most important calls we have in this code:

image

For the string above, we call this method twice, once for each portion of the string that isn’t escaped. That means that argument validation needs to run each and every time. And we also have to copy the memory in itsy bitsy pieces. Memory copy routines can get far greater speed if they can copy a large bunch of memory all at once, rather than being called on small sections of it.

In short, I hope that this convinced you that escaping strings is expensive. Stupidly so, when you come right down to it.

But what can you do, that is the cost of doing business, right?

Indeed, JSON.Net code is heavily optimized, and there isn’t much you can do about improving things at this point. You are stuck with the JSON format and the implications of it.

This is already a long post, but bear with me while I explain the reasoning.

The blittable format isn’t JSON. We read JSON and output JSON, but we aren’t limited to the same constraints as someone who is just consuming JSON. There are two separate use cases for the document using the blittable format. The first is when we need to work with them on the server side. This include such things as indexing, transformers, patching, etc. For those cases, if there is an escaped value, we need to get the unescaped version. In other words, I need the string with the line break in it, not the \r\n escape sequences. The other common use case, unfortunately, is when we need to write the document to the user, in which case we do need to write it out with all of the escape sequences.

What is left for us, apparently, is to choose our poision. We can optimize for either scenario, but not for both.

Except, that maybe we can. And that is what we actually did.

When we write a string value into the blittable format, we read escape sequences, and translate them into their unescaped values. In other words, we take a “\n” and store just the byte 13. But we also remember its position. The string we keep talking about? It would be stored as:

[10][O][r][e][n][10][13][E][i][n][i][2][4][0]

The first byte (10), is the length of the string. Then we have the actual unescaped version. If the server side code wants to read this string, we can just serve it directly from this memory, without any additional work. But what about when we need to write it down, and maintain the escape sequences? That is where the numbers in the end come into play. We need to skip 10 bytes past the length, where we’ll encounter 3 bytes: 2, 4, 0.

The first byte is the number of escape sequences in the string. The second is the number of bytes that we can copy until we get to the next escape sequence. Which means that our code for writing string has narrowed down to:

image

And that, in turn, means that if you don’t have any escape sequences, we can just call memcpy directly to write to the output buffer. And if you do have escape sequences, we just need to scan the escaped characters, we don’t need to scan the entire string.

There is so much saving in the air, it feels like Black Friday.

The importance of a data formatPart VII–Final benchmarks

time to read 4 min | 800 words

After quite a bit of work, we have finished (at least for now) the Blittable format. As you can probably guess from the previous posts, we have decided to also implement small string compression in the Blittable format, based on the Smaz algorithm.

We also noticed that there are two distinct use cases for the Blittable format. The first is when we actually want to end up with the document on disk. In this case, we would like to trade CPU time for I/O time (especially since usually we write once, but read a lot, so we'll keep paying that I/O cost). But there are also many cases in which we read JSON from the network jus to do something with it. For example, if we are going to read an index definition (which is sent as a JSON document), there is no need to compress it, we'll just have to immediately uncompress it again.

So we added the option to choose what mode you'll use the data. Here are the benchmarks for speed and size for Json, Blittable (for disk) and Blittable (for memory)

image image

As expected, we see a major difference in performance between Blit to disk and Blit to mem. Blit to mem is comparable or better from Json in nearly all scenarios, while Blit to disk is significantly smaller than Json while having comparable or better performance at parsing time, even though it is compressing a lot of the data.

For large documents, this is even more impressive:

image image

Blit to mem is 40% – 75% of the speed of Json at parsing, while Blit to disk can be much more expensive (us to x2.5 times more than Json), depending on how compressible the data is. On the other hand, Blit to mem is already smaller than Json in 33% – 20% for most datasets, but Blit to disk improves on that significantly, often by as much as 60% of the original JSON. In the companies.json case, the original file size is 67.25 MB, using Blit to mem we have a 45.29 MB file, and using Blit to disk gives us a 40.97MB.

This is a huge reduction in size (compared to appreciable fraction of a second on most I/O systems).

So far, we look at the big boys, but what happens if we try this on a large number of small documents? For example, as would commonly happen during indexing?

In this case, we see the following:

image

Note that Blit to disk is very expensive in this mode, this is because we actually do a lot. See the next chart:

image

Blit to disk is actually compressing the data by close to 20MB, and it is 5 MB smaller than Blit to memory.

The question on what exactly is the difference between Blit to memory and Blit to disk came up in our internal mailing list. The actual format is identical, there are currently only two differences:

  • Blit to disk will try to compress all strings as much as possible, to reduce I/O. This is often at the cost of parsing time. Blit to mem just store the strings as is, which is much faster, obviously. It takes more memory, but we generally have a lot of that available, and when using Blit to memory, we typically work with small documents, anyway.
  • Blit to disk also does additional validations (for example, that a double value is a valid double), while Blit to memory skip those. The assumption is that if the double value isn’t valid, it will fail when you actually access the double’s value, whereas we want to have such issues happen when we persist the data if we are actually going to access it later when using Blit to disk.

Overall, a pretty good job all around.

The importance of a data formatPart VI – When two orders of magnitude aren't enough

time to read 2 min | 396 words

So our current version can index 18,000 documents in 22 ms vs. the JSON approach which takes 1.8 seconds. That is an amazing improvement, but I got to tell you, I got a good feeling about it.

Without running a profiler, I'm guessing that a major part of the 22 ms cost is going to be comparisons during a binary search process. We store the property names as a sorted array, and when you need to get a property by name, we need to do a binary search in the properties name to match it.

Our test case get 3 properties from a total of 18,801 documents. We do this 30 times, to reduce the chance of jitter getting in the way. Here are the profiler results:

image

As you can see, the cost is indeed where I expected it to be. Basically, the ReadNumber and GetComparerFor are the expressions of the number of binary searches.

But here we can take advantage on the fact that we know that most of those documents are going to have the same structure. In other words, once we have found the position of  a particular property, there is a strong chance that the next time we'll look a property with the same name, we'll find it in the same position. Caching that and using this as the initial position for doing the binary search means that most of the time, we hit the right property right on the nose, and in the worst case, we have a single extra comparison to make.

This single change gives us a really nice impact:

image

That is about 40% improvement in speed, from this single change.

We have been following similar patterns throughout the development of this feature. Write the feature, make sure it passes all its tests, then profiler it. Usually that indicate a problem that we need to solve. Either by optimizing code, or by changing the way we do things in the code. Sometimes radically.  But that is a topic for the next blog post.

FUTURE POSTS

  1. Challenge: The race condition in the TCP stack - about one day from now
  2. Challenge: The race condition in the TCP stack, answer - 2 days from now
  3. API Design: robust error handling and recovery - 3 days from now
  4. Debugging CoreCLR applications in WinDBG - 4 days from now
  5. Non reproducible / intermittent error handling - 5 days from now

And 21 more posts are pending...

There are posts all the way to Aug 29, 2016

RECENT SERIES

  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats