Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,206 | Comments: 46,145

filter by tags archive

The cost of the async state machine

time to read 3 min | 483 words

In my previous post, I talked about high performance cost of the async state machine in certain use cases. I decided to test this out using the following benchmark code.

Here we have two identical pieces of code, sending 16 MB over the network. (Actually the loopback device, which is important, and we’ll look into exactly why later)

One of those is a using the synchronous, blocking, API, and the second is using async API. Beside the API differences, they both do the exact same thing, send 16MB over the network.

  • Sync - 0.055 seconds
  • Async - 0.380 seconds

So the async version is about 7 times worse than the sync version. But when I’m trying this out on 256MB test, the results are roughly the same.

  • Sync - 0.821 seconds
  • Async - 5.669 seconds

The reason this is the case is likely related to the fact that we are using the loopback device, in this case, we are probably almost always never blocking in the sync case, we almost always have the data available for us to read. In the async case, we have to pay for the async machinery, but we never actually get to yield the thread.

So I decided to test what would happen when we introduce some additional I/O latency. It was simplest to write the following:

And then to pipe both versions through the same proxy which gave the following results for 16MB.

  • Sync - 29.849 seconds
  • Async - 29.900 seconds

In this case, we can see that the sync benefit has effectively been eliminated. The cost of the async state machine is swallowed in the cost of the blocking waits. But unlike the sync system, when the async state machine is yielding, something else can run on this thread.

So it is all good, right?

Not quite so. As it turn out, if there are certain specific scenarios where you actually want to run a single, long, important operation, you can set things up so the TCP connection will (almost) always have buffered data in it. This is really common when you send a lot of data from one machine to the other. Not a short request, but something where you can get big buffers, and assume that the common setup is a steady state of consuming the data as fast as it comes. In RavenDB’s case, for example, one such very common scenario is the bulk insert mode. The client is able to send the data to the server fast enough in almost all cases that by the time the server process a batch, the next batch is already buffered and ready.

And in that specific case, I don’t want to be giving up my thread to something else to run on. I want this particular process to be as fast as possible, potentially at the expensive of much else.

This is likely to have some impact on our design, once we have the chance to run more tests.

The HashSet in the Hot Spot

time to read 1 min | 145 words

The title of this post reminds me of a Bones episode. This blog post is actually from an internal mail made by Federico Lois, as part of bigger perf work.

image004

Here is the calling code, note that this is typically called many times, and _pageStates is a HashSet<PagerState>.

image002

The following change was made:

image003

And here are the results:

image001

That is 230% improvement! I can identify with the sentiment.

The cost of the authentication method

time to read 3 min | 565 words

You might have noticed that we are doing a lot of work around performance. Some of this work can be done with just optimizing the way we are doing certain operations, but for some things, we cannot just optimize things, and a behavior change is also required.

In this case, we are talking about the authentication method for RavenDB. The current way it works goes something like this.

  • The user is making a request to a RavenDB server.
  • The server requires security, and ask the user to authenticate. For this case, we’ll sue API Keys.
  • The user & server will have a side channel discussion for authentication via API Keys, which will result in a token.
  • This token is sent as a header in all future requests.
  • The token is cryptographically signed, so the server can validate that it is valid.

So far, so good, but this does pose some issues.

To start with, we took a lot from OAuth, so that means that we assume that there are multiple entities in this. The user, the server and the authenticator, and the use of cryptographic signature is meant to ensure that the server can trust the token generated by the authenticator and served by the user.

However, in pretty much all cases, the server and the authenticator are the same. There are some special cases relating to replication and load balancing, but they aren’t relevant at this point, and we can work around them if need be.

And we’ll want to do that. The reason that this is problematic is very simple, right now, we need to cryptographically validate the token on every single request. And that is expensive. Of course, in some cases, it is actually meant to be expensive, that is why it is secure.

So we need to reduce this cost, and we can reduce that by saying that we can just keep a simple token. Conceptually, this now becomes:

  • The user is making a request to a RavenDB server.
  • The server requires security, and ask the user to authenticate. For this case, we’ll sue API Keys.
  • The user & server will have a side channel discussion for authentication via API Keys, which will result in a token.
  • This token is sent as a header in all future requests.
  • The token is just a unique id (guid), which is stored on the server memory.

Because the token is unique, and per server, we don’t need to do any crypto validation on the value. We can just check if the value is in our memory, and that would be it.

The first reaction we typically get is “but what about security? someone can re-use that guid to authenticate as someone else”. That assumes that you can sniff the conversation between client and server. If you can do that, then you are probably already don’t care about security, since you aren’t using HTTPS. But note that the same behavior occurs using the crypto token as well. If you manage to take that, you can present it to the server as your own token, and the server can’t tell the difference between a valid client and a malicious one.

The good part of this is that now the authentication part of the request is a dictionary lookup, instead of cryptographic signature validation. And the performance of authenticated requests is much higher.

Fun async tricks for getting better performance

time to read 8 min | 1552 words

I got into a discussion with Mark about the usefulness of async. In particular, Mark said:

Sync will always have less CPU usage because of less kernel transitions, less allocation, less synchronization.

This effect only comes into play when after issuing the IO the thread pool can immediately process a queued task. This is only the case when the CPU is highly saturated. At 90% saturation you have a good chance that this is happening. But nobody runs their production systems that way.

And while this is sort of correct, in the sense that a major benefit of async is that you free the working thread for someone else to work on, and that this is typically mostly useful under very high load, async is most certainly not useful just for high throughput situations.

The fun part about having async I/O is the notion of interleaving both I/O and computation together. Mark assumes that this is only relevant if you have high rate of work, because if you are starting async I/O, you have to wait for it to complete before you can do something interesting, and if there isn't any other task waiting, you are effectively blocked.

But that doesn't have to be the case. Let us take the following simple code. It isn't really doing something amazing, it is just filtering a text file:

public void FilterBadWords(string intputFile, string outputFile)
{
    var badWords = new[] { "opps", "blah", "dang" };

    using (var reader = File.OpenText(intputFile))
    using (var writer = File.AppendText(outputFile))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            bool hasBadWord = false;
            foreach (var word in badWords)
            {
                if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                {
                    hasBadWord = true;
                    break;
                }
            }

            if(hasBadWord == false)
                writer.WriteLine(line);
        }
    }
}

Here is the async version of the same code:

public async Task FilterBadWords(string intputFile, string outputFile)
{
    var badWords = new[] { "opps", "blah", "dang" };

    using (var reader = File.OpenText(intputFile))
    using (var writer = File.AppendText(outputFile))
    {
        string line;
        while ((line = await reader.ReadLineAsync()) != null)
        {
            bool hasBadWord = false;
            foreach (var word in badWords)
            {
                if (line.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                {
                    hasBadWord = true;
                    break;
                }
            }

            if(hasBadWord == false)
                await writer.WriteLineAsync(line);
        }
    }
}

If we'll assume that we are running on a slow I/O system (maybe large remote file), in both version of the code, we'll see execution pattern like so:

image

In the sync case, the I/O is done in a blocking fashion, in the async case, we aren't holding up a thread ,but the async version need to do a more complex setup, so it is likely to be somewhat slower.

But the key is, we don't have to write the async version in this manner. Consider the following code:

 public async Task FilterBadWords(string intputFile, string outputFile)
 {
     var badWords = new[] { "opps", "blah", "dang" };

     using (var reader = File.OpenText(intputFile))
     using (var writer = File.AppendText(outputFile))
     {
         var lineTask = reader.ReadLineAsync();
         Task writeTask = Task.CompletedTask;
         while (true)
         {
             var currentLine = await lineTask;
             await writeTask;
             if (currentLine == null)
                 break;
             lineTask = reader.ReadLineAsync();

             bool hasBadWord = false;
             foreach (var word in badWords)
             {
                 if (currentLine.IndexOf(word, StringComparison.OrdinalIgnoreCase) != -1)
                 {
                     hasBadWord = true;
                     break;
                 }
             }

             if(hasBadWord == false)
                 writeTask = writer.WriteLineAsync(currentLine);
         }
     }
 }

The execution pattern of this code is going to be:

image

The key point is that we start async I/O, but we aren't going to await of it immediately. Instead, we are going to do some other work first (processing the current line while we fetch & write the next line).

In other words, when we schedule the next bit of I/O to be done, we aren't going to ask the system to find us some other piece of work to execute, we are the next piece of work to execute.

Nitpicker corner: This code isn't actually likely to have this usage pattern, this code is meant to illustrate a point.

The cost of routing in NancyFX

time to read 7 min | 1393 words

I got the following feedback on this post on Twitter.

image

Using trie for routing isn't new by a long shot. It is a fairly obvious optimization. But not all tries are equal to one another. I decided to see what it would look like using Nancy.

Here is the code I run:

 class Program
 {
     static void Main(string[] args)
     {
         using (var host = new NancyHost(new Uri("http://localhost:8080")))
         {
             host.Start();
             Console.ReadLine();
         }
     }
 }

 public class SampleModule : Nancy.NancyModule
 {
     public SampleModule()
     {
         Get["/"] = _ => "Hello World!";
     }
 }

And like before, we tested this by running it in Release mode using the following gobench command:

.\gobench.exe -c 100 -r 25000 -u http://localhost:8080/

The results weren't good.

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             23,364 hits/sec
Read throughput:                   3,785,046 bytes/sec
Write throughput:                  2,009,345 bytes/sec
Test time:                               107 sec

On the same machine, similar tests using WebAPI gave me a 56,818 hits/sec and using my routing trie we get 86,206 hits/sec. Nancy also seems to be using a lot more CPU than the other two solutions.

Now, we could leave it like that, but that wouldn't be fair. Let us dig a little bit further and see if we can figure out what is costing us there?

I run the same scenario under the profiler, and we got some interesting results.

image

In terms of percentage, it compares favorably to my trie's results. 1.91% compared to 1.02%. But it terms of absolute time (same load, both running under the profiler), we see 64.7 seconds for 2.5 million routes in the nancy implementation vs. 2.2 seconds for 2.5 million routes in my implementation. We'll ignore the rest of the costs of Nancy, because there is quite a lot here that doesn't matter. Let us focus specifically on the routing trie, shall we?

image

Now, here is the really sad thing. This is running under a profiler, which slows things down tremendously.

Under those conditions, this code still manage to process a route in just 25 μs. That is pretty awesome, regardless of how you look at it. And for the vast majority of applications, that is way more than what you'll see in most applications. The moment you start actually doing stuff, you aren't even in the same territory. But one of the things that we have been focusing on is the need to speed RavenDB up by [not ready to divulge yet, but let us say… a lot]. So we pay attention to every little bit.

Now, let us look in detail on this code in GetMatches, which at the time of this writing looked something like this:

return this.routeTries[method].GetMatches(path.Split(splitSeparators, StringSplitOptions.RemoveEmptyEntries), context)
                                          .ToArray();

I'm actually going to ignore the code we see here and just talk about the profiler output, because in this case, the code is misleading.

We can see a whole bunch of things that would fail our current code review. Here is a commonly called method that does a whole bunch of expensive allocations.

On each call, we:

  • Split the string – which is both expensive in computation  and creates multiple sub strings and the array to contain them.
  • Create two dictionaries:
    • one is done inside the TrieNode.GetMatches call:
      return this.GetMatches(segments, 0, new Dictionary<string, object>(this.AdditionalParameters), context);
    • one is done inside the TrieNode.BuildResults call:
      var parameters = new Dictionary<string, object>(capturedParameters);
  • Call ToArray – which create another allocation for the results.
  • Uses Any & Select – which often requires allocation of a lambda instance.

There are also the dictionary lookup (which is actually done twice, once using ConstainsKey, then using the indexer):

if (!this.routeTries.ContainsKey(method))
{
    return MatchResult.NoMatches;
}

return this.routeTries[method] /* removed for clarity */;

Note that just replacing this with a TryGetValue is likely to give us about 8% improvement in this method.

It actually gets better, looking at the code, it looks like we hit the code path for the root route, while other routes need to do more work. I changed the route to be Get["/values"], and run the benchmark under profiler again.

The only change I made is that now I have a non root route, the cost for routing is up by about 75%:

image

 

Now, I'm being utterly unfair here. NancyFX actually has a really sophisticated routing system, it does a lot. I mean, really a lot.

My routing trie, on the other hand, is intentionally dumb and require the request handling code to participate in some decisions. This is done intentionally, because it gives me the chance to do a whole bunch of optimizations cheaply. To start with, I don't need to do any allocations whatsoever throughout the call, and the request handling code can also skip string allocations in most cases.

This isn't an issue of algorithms. It is an issue of what you are trying to do.

For comparisons, doing the same amount of work in my current version (which was even further optimized), give us this:

image

You can see what the cost of routing is for us. While the optimized root route cost for Nancy is 25 μs, we are able to route the request in 0.23 μs, that is two orders of magnitudes faster. My entire routing cost can fit 10 times in the cost of a single dictionary lookup in the NancyFX case.

So, to answer the original question, no, I really couldn't just use NancyFX code to get "this optimization". The algorithm is important, but the actual implementation details are far more interesting and relevant.

The cost of async I/O, false assumptions and pride

time to read 8 min | 1522 words

As you might have noticed, we are doing a lot of performance work. We recently moved some of our code to use async I/O in the hope of getting even more performance from the system.

The result was decidedly not what we expected. On average we saw about 10% – 30% reduction in speed, just from the use of aysnc operations. So we decided to test this.

The test is simple, make a read of a large file (1.4GB) from a network drive without buffering. The synchronous code is:

 private static void SyncWork(int pos)
 {
     var sp = Stopwatch.StartNew();
     var buffer = new byte[1024 * 4];
     long size = 0;
     using (var sha = SHA1.Create())
     using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
         FileOptions.SequentialScan | FILE_FLAG_NO_BUFFERING))
     {
         stream.Seek(pos * ReportSize, SeekOrigin.Begin);
         int read;
         while ((read = stream.Read(buffer, 0, buffer.Length)) != 0)
         {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb sync {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
         }
     }
 }

To make things interesting, we are reading 32 MB in 4KB chunks and computing their SHA1 hash. The idea is that this is a mix of both I/O and CPU operations. The machine I’m testing this on has 8 cores, so I run 16 copies of this code, with different start positions.

for (int i = 0; i < 16; i++)
{
    var copy = i;
    new Thread(state =>
    {
        SyncWork(copy);
    }).Start();
    Thread.Sleep(250);
}

The basic idea was to simulate work coming in, doing different things, and to simulate slow I/O and computation. 16 threads means that I have more threads than CPU cores, so we’ll have some context switches. Note that the use of unbuffered I/O means that we have to go over the network (slow).

The output of this code is:

Read 32 mb sync 8,666
Read 32 mb sync 8,794
Read 32 mb sync 8,995
Read 32 mb sync 9,080
Read 32 mb sync 9,123
Read 32 mb sync 9,299
Read 32 mb sync 9,359
Read 32 mb sync 9,593
Read 32 mb sync 9,376
Read 32 mb sync 9,399
Read 32 mb sync 9,381
Read 32 mb sync 9,337
Read 32 mb sync 9,254
Read 32 mb sync 9,207
Read 32 mb sync 9,218
Read 32 mb sync 9,243

Now, let us look at the equivalent async code:

private static async Task AsyncWork(int pos)
{
    var sp = Stopwatch.StartNew();
    var buffer = new byte[1024 * 4];
    using (var sha = SHA1.Create())
    using (var stream = new FileStream(@"p:\dumps\dump-raven.rar", FileMode.Open, FileAccess.Read, FileShare.Read, 4 * 1024,
        FileOptions.SequentialScan | FileOptions.Asynchronous | FILE_FLAG_NO_BUFFERING))
    {
        stream.Seek(pos * ReportSize, SeekOrigin.Begin);
        long size = 0;
        
        int read;
        while ((read = await stream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
             sha.ComputeHash(buffer, 0, read);
             size += read;
             if (size >= ReportSize)
             {
                 Console.WriteLine($"Read {size / 1024 / 1024:#,#} mb async {sp.ElapsedMilliseconds:#,#}");
                 return;
             }
        }
    }
}

Note that here I’m using async handle, to allow for better concurrency. My expectation was that this code will interleave I/O and CPU together and result in less context switches, more CPU utilization and overall faster responses.

Here is the network utilization during the async test:

image

And here is the network utilization during the sync test:

image

Trying the async using 64Kb buffers gives better results:

image

And output of:

Read 32 mb async  8,290
Read 32 mb async 11,445
Read 32 mb async 13,327
Read 32 mb async 14,088
Read 32 mb async 14,569
Read 32 mb async 14,922
Read 32 mb async 15,053
Read 32 mb async 15,165
Read 32 mb async 15,188
Read 32 mb async 15,148
Read 32 mb async 15,040
Read 32 mb async 14,889
Read 32 mb async 14,764
Read 32 mb async 14,555
Read 32 mb async 14,365
Read 32 mb async 14,129

So it is significantly worse than the sync version when using 4KB buffers. The bad thing is that when using 64Kb buffer in the sync version, we have:

image

And the whole process completed in about 2 seconds.

I’m pretty sure that I’m doing everything properly, but it seems like the sync version is significantly cheaper.

Short summary, the solution is throw all of async code way in favor of pure sync code, because it is so much faster. Banish async, all hail to the synchronous overload.

However, the plot thickens!

Before before declaring death to asynchronicity, with thunderous applause, I decided to look further into things, and pulled out my trusty profiler.

Here is the sync version:

image

As expected, most of the time is spent in actually doing I/O. The async version is a bit harder to look at:

image

This is interesting, because no I/O actually occurs here. At first I thought that this is because we are only using async I/O, so all of the missing time (notice that this is just 625 ms) is lost to the I/O system. But then I realized that we are also missing the ComputeHash costs.

Profiling async code is a bit harder, because you can’t just track the method calls. We found the missing costs here:

image

And this is really interesting. As you can see, most of the cost is in the ReadAsync method. My first thought was that I accidently opened the file in sync mode, turning the async call into a sync call. That didn’t explain the different in costs from the sync version, through, and I verified that the calls are actually async.

Then I looked deeper:

image

Why do we have so many seeks?

The answer lies in this code. And that explained it, including a big comment on why this happens. I created an issue to discuss this.

Calling SetFilePointer is typically very fast, since the OS just need to update an internal structure. For some reason, it seems much more expensive on a remote share. I assume it need to communicate with the remote share to update it on its position. The sad thing is that this is all wasted anyway, since the file position isn’t used in async calls, each actual call to ReadFileNative will be given the offset to read there.

Trie based routing

time to read 4 min | 764 words

In my previous post, I discussed the cost of routing in MVC vs. our own implementation. The MVC routing / infrastructure consumed about 90% of the time, while the code that actually does stuff is left with less than 10%. In our implementation, the actual request handling code gets about 85% of the time, while all the infrastructure is handled in the other 15%.

Of that , a large portion of that time is actually spent on routing. In our tests, we saw something like 40% of the time spent in selecting the right controller action to execute, this is when there is just one such action that can be run. Now, to be fair, this is a scenario where we don’t actually have any meaningful logic. We are just saying “hello world” over and over again, so the infrastructure costs are really noticeable in this benchmark. But that is the point, we want to see what are the fixed costs of the infrastructure we use.

Now, in RavenDB we have several hundreds routes (last count was something upward of 400), and we noticed that routing take a lot of time to select the right action to run. We optimized the hell out of that to the point by specializing that to our own situation and based on our own knowledge. But when we started to look at moving our code to Kestrel, we took the time to move to a system that is a better fit for us.

The major reason routing is such an expensive issue in MVC is that it does quite a lot. It needs to handle route selection, capturing values, coercing data to the right types and a lot more. All of that takes time, memory and cpu cycles. But as it turn out, we don’t really need all of that. We have relatively simple routing needs.

Here are a few example of RavenDB routes:

  • /admin/stats
  • /databases/{databaseName}/docs
  • /databases/{databaseName}/indexes/{*id}

So we have the fixed routes, like /admin/stats. We have routes that contain a database name in the middle, and we have routes with a suffix. We can take advantage of that to create an optimal solution.

In our case, we used a trie:

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings.

The cost of querying a trie is O(L), where L is the length of the key that we are querying. Here is how a very small trie looks in our tests.

image

We start from the route, matching each character against the trie node key. Once we have consumed the full key, we check if it has any children starting with the next character in the url. Note that the Children array here is of size 127. This allows us to index directly into the array (at cost of O(1)) to find the next trie node. We didn’t want to use a dictionary here because of its size and its cost relative to an array access.

This has some interesting implications:

  • The route definition is limited to ASCII characters (but the actual urls are not, mind).
  • Regardless of the number of routes, we are ensured of reliable performance.

But that still doesn’t let us handle routes like the ones above. In order to do that, we had to extend the syntax a bit.

/admin/stats would match exactly. That is easiest.

/databases/*/docs is a route that has a capture. The * symbol says “matches anything but /”, so we can get the database name captured without complex matching. Note that by capture, I mean just record the start position and length, we don’t want to pull a substring out of that and pay for an extra allocation.

/databases/*/indexes/$ is another route that has special meaning. The $ symbol says “the match is done, even if the url has additional characters”. That way we can capture embedded ids easily (again, no allocations).

The end result?

image

Or roughly a route match in 0.9 μs. Pretty good, even if I say so myself.

The cost of routing

time to read 8 min | 1565 words

I was asked to see what is the cost of WebAPI on Kestrel, and it lines up nicely with a topic that I wanted to talk about, routing. So I thought I’ll start with a small benchmark.

Here is a small Kestrel based server, using MVC / WebAPI:

public class Program
{
    public void ConfigureServices(IServiceCollection services)
    {
        services.AddMvc();
    }

    public void Configure(IApplicationBuilder app, ILoggerFactory loggerfactory)
    {
        app.UseMvc();
    }

    public static void Main(string[] args)
    {
        var configBuilder = new ConfigurationBuilder()
            .Add(new MemoryConfigurationProvider(new Dictionary<string, string>
            {
                ["server.urls"] = "http://localhost:8080"
            }));

        IHostingEngine application = new WebHostBuilder(configBuilder.Build())
            .UseStartup<Program>()
            .UseServer("Microsoft.AspNet.Server.Kestrel")
            .Build();


        using (application.Start())
        {
            Console.WriteLine("ready");
            Console.ReadLine();
        }
    }
}

And here is a simple controller:

[Route("api/[controller]")]
public class ValuesController : Controller
{
    // GET api/values
    [HttpGet()]
    public string Get()
    {
        return "Hello World";
    }
}

As you can see, this couldn’t really be any more trivial. I have tested this with release mode, using the following gobench command:

.\gobench.exe -c 100 -r 25000 -u http://localhost:8080/api/values

This was run after running a couple thousand requests as warm up.  In other words, we are going to do a 2.5 million requests. I intentionally choose that value, because it force the server to do enough work to be meaningful, and the test run is long enough to see other factors creeping in. Note that this isn’t a “real” benchmark, both gobench and this code are actually running on the same machine. But it is a good baseline.

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             56,818 hits/sec
Read throughput:                   9,261,363 bytes/sec
Write throughput:                  5,454,545 bytes/sec
Test time:                                44 sec

Those are good numbers. Now let us see that with my code:

[RavenAction("/api/values", "GET")]
public Task ValuesGet()
{
    return HttpContext.Response.WriteAsync("Hello world");
}

There is no funny business here. We just route the request using my code, and then just write it to the client. The results:

Requests:                          2,500,000 hits
Successful requests:               2,500,000 hits
Network failed:                            0 hits
Bad requests failed (!2xx):                0 hits
Successful requests rate:             86,206 hits/sec
Read throughput:                  10,517,241 bytes/sec
Write throughput:                  8,275,862 bytes/sec
Test time:                                29 sec

That is over 25% improvement.

Now, there are a couple of things that works out better for my code. To start with, it is calling Response.WriteAsync directly, while the standard WebAPI code needs to get the result, figure out how to write it, etc.

In the profiler, this looks really interesting:

image

So the time for actually executing the action is less than 10% of the request time. Everything else is some sort of setup.

In fact, let us dig deeper, the time to run our action is 0.2% of the 10% that this method runs:

image

To be fair, the method just returns a constant string. I would be shocked if it wasn’t really fast.

Most of the time happens in the CreateActionResult, in which we select how we write. Let us look at this a bit differently:

image

The InvokeExceptionFilterAsync is the part responsible for actually calling our code and getting the result.

The InvokeAllResultFiltersAsync is responsible for figuring out how to write the output to the client. And this is the part where we actually write to the network:

image

And here is how we do in the same scenario:

image

You can note that 85% of the time we are spending in servicing the request, with just 15% dedicated to managing the infrastructure.

I’ll discuss the details of that in my next post, though.

Profiling mysteries

time to read 3 min | 473 words

In our profiling, we run into an expensive mystery call to string.Join. Nothing in our code called it, and it is expensive.

image

Looking at the code didn’t get us anywhere, so I profiled things again, this time using the debug configuration. The nice thing about the debug configuration is that it doesn’t inline methods (among other things), so what gets executed is much closer to the actual code.

Here are the results:

image

And now it is clear what is going on.

Here is the offending line:

var tryMatch = _trie.TryMatch(method, context.Request.Path);

With context.Request.Path being a PathString, and TryMatch accepting a string, we are silently calling the implicit string operator, which just calls PathString.ToString(), which just calls ToUriComponent(), which looks like:

public string ToUriComponent()
{
    if (HasValue)
    {
        if (RequiresEscaping(_value))
        {
            // TODO: Measure the cost of this escaping and consider optimizing.
            return String.Join("/", _value.Split('/').Select(EscapeDataString));
        }
        return _value;
    }
    return String.Empty;
}

Looking into the current code, it has already been optimized by removing the Linq call and its associated costs, but that is still expensive call to make.

The fix was to get the string directly:

var tryMatch = _trie.TryMatch(method, context.Request.Path.Value);

And here are 10% reduction in costs. Smile

Dynamic code generation in C#

time to read 4 min | 666 words

Let us assume that we have the following simple task:

Given a Dictionary<string, string>, convert that dictionary into a type in as performant a manner as possible. The conversion will happen many time, and first time costs are acceptable.

The answer we came up with is to dynamically generate the code based on the type. Basically, here is how it looks like:

public static Func<Dictionary<string, string>, T> Generate<T>()
    where T : new()
{
    var dic = Expression.Parameter(typeof (Dictionary<string, string>), "dic");
    var tmpVal = Expression.Parameter(typeof (string), "tmp");
    var args = new List<MemberAssignment>();
    foreach (var propertyInfo in typeof(T).GetProperties())
    {
        var tryGet = Expression.Call(dic, "TryGetValue", new Type[0], 
            Expression.Constant(propertyInfo.Name),
            tmpVal);

        Expression value = tmpVal;
        if (propertyInfo.PropertyType != typeof (string))
        {
            var convertCall = Expression.Call(typeof(Convert).GetMethod("ChangeType", 
                new Type[] { typeof(object), typeof(Type) }), tmpVal,
                Expression.Constant(propertyInfo.PropertyType));
            value = Expression.Convert(convertCall, propertyInfo.PropertyType);
        }

        var conditional = Expression.Condition(tryGet, value, 
            Expression.Default(propertyInfo.PropertyType));

        args.Add(Expression.Bind(propertyInfo, conditional));
        
    }
    var newExpression = Expression.New(typeof(T).GetConstructor(new Type[0]));

    var expression = Expression.Lambda<Func<Dictionary<string, string>, T>>(
        Expression.Block(new[] { tmpVal },Expression.MemberInit(newExpression,args)),
        dic);

    return expression.Compile();
}

As an aside, this is an intimidating piece of code, but that is about bazillion time better than having to do this manually using IL manipulations.

What this code does is dynamically generate the following method:

(Dictionary<string,string> dic) => {
    string tmp;
    return new User
    {
        Name = dic.TryGetValue("Name", out tmp) ? tmp : default(string),
        Age = dic.TryGetValue("Age", out tmp) ? (int)Convert.ChangeType(tmp, typeof(int)) : default(int)
    };
}

This is pretty much the fastest way to do this, because this is how you would write it manually. And compiling the expression dynamically means that we don’t have to do this for each and every type we run into.

FUTURE POSTS

  1. Stack arena memory allocation context - one day from now
  2. RavenDB Restorspective - 5 days from now

There are posts all the way to Sep 30, 2016

RECENT SERIES

  1. Voron internals (5):
    13 Sep 2016 - The diff is the way
  2. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  3. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  4. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  5. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats