Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,357 | Comments: 50,734

Privacy Policy Terms
filter by tags archive
time to read 4 min | 629 words

A customer was experiencing large memory spikes in some cases, and we were looking into the allocation patterns of some of the queries that were involved. One of the things that popped up was a query that allocated just under 30GB of managed memory during its processing.

Let me repeat that, because I bears repeating. That query allocated 30(!) GB(!) during its execution. Now, that doesn’t mean that it was consuming 30 GB, it was just the allocations involved. Most of that memory was immediately discarded during the operation. But 30 GB of garbage to cleanup puts a lot of pressure on the system. We took a closer look at the offensive query. It looked something like this:

from index “Notifications/RoutingAndPriority”
where startsWith(Route, $routeKeyPrefix)
order by
Priority desc

That does not seam like a query that should be all that expensive. But details matter, so we dove into this. For this particular query, the routes are hierarchical structure that are unique for each message. Something like:

  • notifications/traffic/new-york-city/67a81019-941b-4d04-a0db-0559ed45343c
  • notifications/emergency/las-vegas/0a8e18fb-563b-4b6a-8e93-e10e08239656

And the queries that were generated were on using the city & topic to filter the information that they were interested in.

The customer in question had a lot of notifications going on at all times. And each one of their Route was unique. Internally, RavenDB uses Lucene (currently Smile ) to handle searches, and Lucene is using an inverse index to execute queries.

The usual way to think about is like this:


We have a list of terms (Brown, Green & Purple) and each of them has  list of the match documents that contain the particular term.

The process of issuing a prefix query then is easy, scan all entries that match the prefix and return their results. This is indeed what Lucene is doing. However… while it is doing that, it will do something like this:

Pay close attention to what is actually happening here. There are two enumerators that we work with. One for the terms for the field and one for the documents for a specific term.

All of this is perfectly reasonable, but thee is an issue. What happens when you have a lot of unique values? Well, then Lucene will have a lot of iterations of the loop. In this case, each term has just a single match, and Lucene is pretty good in optimizing search by specific term.

The actual problem is that Lucene allocates a string instance for each term. If we have 30 million notifications for New York’s traffic, that means that we’ll allocate 30 million strings during the processing of the query. We aren’t retaining these strings mind. They’ll be cleaned up by the GC quickly enough, but that is an additional costs that we don’t actually want.

Luckily, in this case, there is a much simple solution. Given that the pattern of route is known, we can skip the unique portion of the route. That means that in our index, we’ll do something similar to:

Route = doc.Route.Substring(0, doc.Route.LastIndexOf('/') + 1)

Once that is done, the number of unique matches there would be negligible. There would be no more allocations galore to observe and overall system performance is much improved.

We looked into whatever there is something that we can do with Lucene to avoid this allocations issue, but it is endemic to the way the API works. The longer term plan is to fix that completely, of course. We are making great strides there already Smile.

In short, if you are doing startsWith() queries or similar, pay attention to the number of unique terms that you have to go through. A simple optimization on the index like the one above can bring quite a bit of dividends.

time to read 3 min | 488 words

I’m trying to compare indexing speed of Corax vs. Lucene. Here is an interesting result:


We have two copies of the same index, running in parallel on the same data. And we can clearly see that Lucene is faster. Not by a lot, but enough to warrant investigation.

Here is the core of the work for Lucene:


And here it is for Corax:


If you look at the results, you’ll see something really interesting.

For the Corax version, the MapItems.Execute() is almost 5% slower than the Lucene version.

And that really pisses me off. That is just flat out unreasonable to see.

And the reason for that is that the MapItems.Execute() is identical in both cases. The exact same code, and there isn’t any Corax or Lucene code there. But it is slower.

Let’s dig deeper, and we can see this interesting result. This is the Lucene version, and the highlighted portion is where we are reading documents for the indexing function to run:


And here is the Corax version:


And here it is two thirds more costly? Are you kidding me? That is the same freaking code and is utterly unrelated to the indexing.

Let’s dig deeper, shall we? Here is the costs breakdown for Lucene, I highlighted the important bits:


And here is the cost breakdown for Corax


I have to explain a bit about what is going on here. RavenDB doesn’t trust the disk and will validate the data it reads from it the first time it loads a page.

That is what the UnlikelyValidatePage is doing.

What we are seeing in the profiler results is that both Corax and Lucene are calling GetPageInternal() a total of 3.69 million times, but Corax is actually paying the cost of page validation for the vast majority of them.

Corax validated over 3 million pages while Lucene validated only 650 thousand pages. The question is why?

And the answer is that Corax is faster than Lucene, so it is able to race ahead. When it races ahead, it will encounter pages first, and validate them. When Lucene comes around and tries to index those documents, they were already validated.

Basically, Lucene is surfing all the way forward on the wavefront of Corax’s work, and ends up doing a lot less work as a result.

What this means, however, is that we need to test both scenarios separately, on cold boot. Because otherwise they will mess with each other results.

time to read 5 min | 943 words

I’m going to go back a few steps and try to see where I should be looking at next, to see where I should pay the most attention. So far in this series, I mostly focused on how we read and process the data. But I think that we ought to take a step or two back and see where we are at in general. I ran the version with Pipelines and string usage in the profiler, trying to see where we are at. For example, in a previous post, the ConcurrentDictionary that I was using had a big performance cost. Is that still the case now?

Here are the current hotspots in the codebase:


Looking at this with more detail, we have:


That is… interesting. Let’s look at the code for HandleConnection right now?

Looking at the code and the profiler results, I wonder if I can do better here. Here is a small change that gives me ~2% speed boost:

The idea is that we parallelize reading from and writing to the network. It is a small boost, but any little bit helps, especially once we get into the cascading impacts of repeated optimizations.

Looking into this, we have almost two billion calls to ReadAsync, let’s see what is costly there:


That is… wow.

Why is InternalTokenSource so expensive? I’m willing to bet that the issue is this one, it is taking a lock. In my use case, I know that there is a single thread running this, so it is worth seeing if I can skip it. Unfortunately, there isn’t an easy way to skip that check. Fortunately, I can copy the code from the framework and modify it locally, to see what the impact of that would be. So I did just that (initialized once in the constructor):


Of course, that is very much a brute force approach, and not one that I would recommend. Looking at the code, it looks like there is a reason for this usage (handling cancellation of operations), but I’m ignoring that for now. Let’s see what the profiler says now:


That means that we have about 40% improvements in per call costs. As I mentioned, that is not something that we can just do, but it is an interesting observation on the cost of reading using PipeReader.

Another aspect that is really interesting is the backend state we have, which is a ConcurrentDictionary. If we’ll look at its cost, we have:


You’ll note that I’m using the NonBlocking NuGet package, which provides a ConcurrentDictionary implementation that isn’t using locking. If we’ll use the default one from the framework, which does use locking, we’ll see:


You can see the difference costs better here:


Note that there is a significant cost difference between the two options (in favor of NonBlocking). But it doesn’t translate to much of a difference when we run a real world benchmark.

So what is next?

Looking at the profiler result, there isn’t really much that we can push forward. Most of the costs we have are in the network, not in the code we run.


Well, that isn’t quite true, is it? The bulk of our code is in ParseNetworkData call, which looks like this:


So the total time we spend actually executing the core functionality of our server is really negligible. A lot of time is actually spent parsing the commands from the buffer. Note that here, we don’t actually do any I/O, all operations are operating on buffers in memory.

The Redis protocol isn’t that friendly for machine parsing, requiring us to do a lot of lookups to find the delimiters (hence the IndexOf() calls). I don’t believe that you can significantly improve on this. This means that we have to consider other options for better performance.

We are spending 35% of our runtime in parsing the command streams from the client, and the code we execute is less than 1% of our runtime. I don’t think that there are significant optimization opportunities remaining for the stream parsing, so that leaves us with the I/O that we have left. Can we do better?

We are currently using async I/O and pipelines. Looking at the project that got me interested in this topic, it is using IO_Uring (via this API) on Linux for their needs. Their parsing is straightforward, as well, see here. Quite similar to the manner in which my code operates.

So to get to the next stage in performance (as a reminder, we are now at the 1.8 million req / sec) we’ll probably need to go to the ring based approach as well. There is a NuGet package to support it, but that moves this task from something that I can spend a few hours in an evening to a couple of days / full week of effort. I don’t think that I’ll pursue this in the near future.

time to read 3 min | 470 words

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that.

A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors.

We looked at the actual error and found this log message:

Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21
At /data/global/ravenData/Databases/orders-21
 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName)
   at System.Threading.Thread.StartCore()
   at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91
   at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76
   at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388
   at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826 

This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why?

It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set.

There are global knobs that you should look at first, such as those:

  • /proc/sys/kernel/threads-max
  • /proc/sys/kernel/pid_max
  • /proc/sys/vm/max_map_count

Any of those can serve as a limit. There are also ways to set those limits on a per process manner.

There is also a per user setting, which is controlled via:

/etc/systemd/logind.conf: UserTasksMax

The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log:

a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service

That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic).

When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix.

I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.

time to read 5 min | 932 words

One of the high costs that we have right now in my Redis Clone is strings. That is actually a bit misleading, take a look here:


Strings take 12.57% of the runtime, but there is also the GC Wait, where we need to cleanup after them. That means that the manner in which we are working is pretty inefficient.

Our test scenario right now also involves solely GET and SET requests, there are no deletions, expirations, etc. I mention that because we need to consider what we’ll replace the strings with.

The simplest option is to replace that with a byte array, but that is still managed memory and incurs the costs associated with GC. We can pool those byte arrays, but then we have an important question to answer, how do we know when a buffer is no longer used?

Consider the following set of events:

Time Thread #1 Thread #2
1 SET abc  
2   GET abc
3 SET abc  
4   Use the buffer we got on #2

In this case, we have thread #2 accessing the value buffer, but we replaced that buffer. We need to let thread #2 keep using this buffer until it is done.

This little tidbit put us right back at concurrent manual memory management, which is scary. We can do things in a slightly different manner, however. We can take advantage of the GC to support us, like so:

The idea is pretty simple. We have a class that holds a buffer, and when the GC notices that it is no longer in use, it will add its buffer back to the pool. The idea is that we rely on the GC to resolve this (really hard) problem for us. The fact that this moves the cost to the finalizer means that we can not worry about this. Otherwise, you have to jump through a lot of hoops.

The ReusableBuffer class also implements GetHashCode() / Equals() which allow us to use it as a key in the dictionary.

Now that we have the backing store for keys and values, let’s see how we can read & write from the network. I’m going to go back to the ConcurrentDictionary implementation for now, so I’ll handle only a single concept at a time.

Before, we used StreamReader / StreamWriter to do the work, now we’ll use PipeReader / PipeWriter from System.IO.PIpelines. That will allow us to easily work with the raw bytes directly and it is meant for high performance scenarios.

I wrote the code twice, once using the reusable buffer model and once using PIpeReader / PipeWriter and allocating strings. I was surprised to see that my fancy reusable buffers were within 1% performance of the (much simpler) strings implementation. That is 1% in the wrong direction, by the way.

On my machine, the buffer based system was 165K ops/second while the strings based one was 166K ops/sec.

Here is the reusable buffer based approach complete source code. And to compare, here is the string based one. The string based one is about 50% shorter in terms of lines of code.

I’m guessing that the allocation pattern is really good for the kind of heuristics that the GC does. We either have long term objects (in the cache) or very short term ones.

It’s worth pointing out that the actual parsing of the commands from the network isn’t using strings. Only the actual keys and values are actually translated to strings. The rest I’m doing using raw bytes.

Here is what the code looks like for the string version under the profiler:


And here is the same thing using the reusable buffer:


There are a few interesting things to note here. The cost of ExecCommand is almost twice as high as the previous attempt. Digging deeper, I believe that the fault is here:

This is the piece of code that is responsible for setting an item in the dictionary. However, note that we are doing a read for every write? The idea here is that if we have a set on an existing item, we can avoid allocating the buffer for the key again, and reuse it.

However, that piece of code is in the critical path for this benchmark and it is quite costly. I changed it to do the allocations always, and we got a fairly consistent 1% – 3% faster than the string version. Here is what this looks like:


In other words, here is the current performance table (under the profiler):

  • 1.57 ms  - String based 
  • 1.79 ms - Reusable buffer based (reduce memory usage)
  • 1.04 ms - Reusable buffer (optimized lookup)

All of those numbers are under the profiler, and on my development machine. Let’s see what we get when I’m running them on the production instances, shall we?

  • String based – 1,602,728.75 ops/sec
  • Reusable buffer (with reducing memory code) – 1,866,676.53 ops/sec
  • Reusable buffer (optimized lookup) – 1,756,930.64

Those results do not match with what we see in my development machine. The likely reason is that the amount of operations is high enough and the load is sufficiently big that we are seeing a much bigger impact from the memory optimization at scale.

That is the only conclusion I can draw from the fact that the memory reduction code, which adds costs, is actually able to process more requests/seconds under such load.

time to read 4 min | 753 words

Now that I’m done with the low hanging fruits, I decided to shift the Redis implementation to use System.IO.Pipelines. That is a high performance I/O API that is meant specifically for servers that need to eke out all the performance out of the system.

The API is a bit different, but it follows a very logical pattern and makes a lot of sense. Here is the main loop of handling commands from a client:

The idea is that we get a buffer from the network, we read everything (including pipelined commands) and then we flush to the client. The more interesting things happen when we start processing the actual commands, because now we aren’t utilizing StreamReader but PipeReader. So we are working at the level of bytes, not strings.

Here is what this (roughly) looks like, I’m not showing the whole thing because I want to focus on the issue that I ran into:

The code is reading from the buffer, parsing the Redis format and then executing the commands. It supports multiple commands in the same buffer (pipelining) and it has absolutely atrocious performance.

Yes, the super speedy API that is significantly harder to get right (compared to the ease of working with strings) is far slower. And by far slower I mean the following, on my development machine:

  • The previous version clocks at around 126,017.72 operations per second.
  • This version clocks at less than 100 operations per second.

Yes, you read that right, less than one hundred operations per second compared to over hundred thousands for the unoptimized version.

That was… surprising, as you can imagine.

I actually wrote the implementation twice, using different approaches, trying to figure out what I was doing wrong. Surely, it can’t be that bad.

I took a look at the profiler output, to try to figure out what is going on:


It says, quite clearly, that the implementation is super bad, no? Except, that this is what you are supposed to be using. So what is going on?

The underlying problem is actually fairly simple and relates to how the Pipelines API achieves its performance. Instead of doing small calls, you are expected to get a buffer and process that. Once you are done processing the buffer you can indicate what amount of data you consumed, and then you can issue another call.

However, there is a difference between consumed data and examined data. Consider the following data:

xxxxxxxxxx ... xxxxxx

What you can see here is a pipelined command, with 335 bytes in the buffer.  We’ll process all of those commands in a single hit, except… look at the highlighted portion. What do we have there?

We have a partial command. In other words, we are expected to execute a GET with a key size of 15 bytes, but we only have the first 4 bytes here. That is actually expected and fine. We consumed all the bytes until the highlighted portion (thus letting the PipeReader know that we are done with them). The problem is that when we issue a call now, we’ll get the highlighted portion (which we didn’t consume), but we aren’t ready to process that. Data is missing. We indicate that to the PipeReader using the examined portion. So the PipeReader knows that it needs to read more from the network.

However… my code has a subtle bug. It will report that it examined the yellow highlight, not the green one. In other words, we tell the PipeReader that we consumed some portion of the buffer, and examined some more, but there are still bytes on the buffer that are neither consumed nor examined. That means that when we issue the read call, expecting to get data from the network, we’ll actually get the same buffer again, to do the exact same processing.

Eventually, we’ll have more data in the buffer from the other side, so the correctness of the solution isn’t impacted. But it will kill your performance.

The fix is really simple, we need to tell the PipeReader that we examined the entire buffer, so it will not do a busy wait and wait for more data from the network. Here is the bug fix:

With that change in place, we can hit 187,104.21 operations per second! That is 50% better, which is awesome. I haven’t profiled things yet properly, because I also want to address another issue, how are we going to deal with the data from the network. More on that in my next post.

time to read 2 min | 256 words

After achieving 1.25 million ops/sec, I decided to see what would happen if I would change the code to support pipelining. That ended up being quite involved, because I needed to both keep track of all the incoming work as well as send the work to multiple locations. The code itself is garbage, in my opinion. It is worth it only as far as it points me inthe right direction in terms of the overall architecture. You can read it below, but it is a bit complex. We read from the client as much as we are able, then we send it to each of the dedicated threads to run it.

In terms of performance, it is actually slower than the previous iteration (by about 20%!), but it serves a very important aspect, it makes it easy to tell where the costs are.

Take a look at the following profiler result:


You can see that we are spending a lot of time in I/O and in string processing. The GC time is also quite significant.

Conversely, when we actually process the commands from the clients, we are spending most of the time simply idling.


I want to tackle this in stages. The first part is to stop using strings all over the place. The next stage after that will likely be to change the I/O model.

For now, here is where we stand:


time to read 5 min | 974 words

My previous attempts to write a Redis clone were done in about as straightforward a way as possible. Open a socket to listen on, have a separate Task for each client that reads from the network, parse the command and execute it. There are some smarts around supporting pipelining, but that is pretty much it.

Let’s take a step back and build ourselves a Redis clone that matches the actual Redis architecture more closely. In order to do that, I’ll need to do everything in a single thread. That is… surprisingly hard to do in C#. There are no APIs for doing the kind of work that Redis is doing. To be rather more exact, there is the Socket.Select() method, but that requires building everything on top of that (meaning that we have to handle buffering, string handling, etc).

Given that this is a way station to the final proposed architecture, I decided to skip this entirely. Instead, I’m going to focus first on removing the major bottleneck in the system, the ConcurrentDictionary.

The profiler results show that the biggest cost we have here is the scalability of the concurrent dictionary. Even when we tried to shard it across 1024 locks, it still took almost 50% of our runtime. The question is, can we do better? One good option that we can try is to shard things directly. Instead of using a single concurrent dictionary, we will split it to separate dictionaries, each one of them would be accessed without concurrency.

The idea goes like this, we’ll have the usual read & write for the clients. But instead of processing the command inline, we’ll route it to a dedicated thread (with its own dictionary) to do the work. I set it so we’ll have 10 such threads (assuming they will reside on individual cores and that I’ll be able to process all I/O on the other 6 cores.

Here are the results after the change:

Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
Sets       113703.56          ---          ---         3.06261         0.95900        25.59900        39.93500     33743.38
Gets      1137015.79     19211.78   1117804.01         3.06109         0.95900        25.59900        39.93500     49150.52
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    1250719.35     19211.78   1117804.01         3.06122         0.95900        25.59900        39.93500     82893.90

Note that we are now at 1.25 million, almost 25% better than the previous run.

Here are some profiler results of running this code:


So in this case, we are spending a lot of time doing string processing of various kinds, waiting for GC (almost 30%). The costs for collections went down a lot (but we’ll see that it shifted somewhat).

There are some other things that pop to mind, take a look here:


That is a surprising cost for a “simple” property lookup. The substrings calls are also expensive, over 6% of the overall runtime.

When looking at other parts of the system, we have:


This is really interesting, because we spend a lot of time just waiting for items in the queue. We could probably do more things in there rather than just wait.

I also tried various other concurrency values. With a single ExecWorker running, we have 404,187 ops/sec and with two of them we are at 715,157 ops/sec. When running with four threads dedicated to processing the requests, we are at 1,060,622.24 ops/sec.

So it is obvious that we need to rethink this approach for concurrency. We aren’t able to properly scale to bigger values.

Note that this approach also does not take advantage of pipelining. We process each command separately from all else. My next move is to add support for pipelining with this approach and measure that impact.

On the one hand, we are still at around the million mark, but given that I spent very little time (and not a lot of complexity) getting an extra 250,000 ops/second from that level of change is encouraging. The profiler is also telling us that there are more things that we can do, but I want to focus on fixing the approach we take first.

Here is the current state of the code, so you can compare it to the original one.

time to read 5 min | 920 words

In the previous post, I wrote a small Redis clone using the most naïve manner. It was able to hit nearly 1M queries per second on our test instance (c6g.4xlarge, using 16 cores and 64 GB of memory). Before we get any deeper into optimization, it is worth understanding where the time is actually being spent. I run the server under a profiler, to see the various costs.

I like using dotTrace as a profiler, while using the Tracing mode, since that gives me execution time as well as the number of calls. Often enough I can reason a lot about the system performance just from those details.

Take a look at the following stats, this is the breakdown of costs in the actual processing of the connection:


And here it is when we break it up by


You can see that the cost of FlushAsync() dominates. I’m going to form a hypothesis here. When we call FlushAsync() on the StreamWriter, we’ll also flush to the underlying stream. Looking deeper into the call stack that looks like we’ll need a separate packet per command at the TCP level.

What will happen if we’ll change the StreamWriter’s AutoFlush to true, which will cause it to write immediately to the underlying stream, but won’t call the flush on the TCP stream. That will allow the TCP stream to buffer writes more efficiently.

The code change involved is removing the FlushAsync() calls and initializing the StreamWiter like so:

Let’s run the benchmark again, which will give us (on my development machine):

  • 138,979.57 QPS – using AutoFlush = true
  • 139,653.98 QPS – using FlushAsync

Either option is a wash, basically. But here is why:


Basically, AutoFlush set to true will flush not just the current stream, but also the underlying stream, putting us in the same position.

The problem is that we need to flush, otherwise we may buffer results in memory that won’t be sent to the client. Redis benchmarks rely heavily on pipelining (sending multiple commands at once), but it is entirely possible that you’ll get a bunch of commands, write them (to the buffer) and then not send anything to the client since the output buffer isn’t full. We can optimize this quite easily, using the following change:

What I’m doing here is writing to the StreamWriter directly, and I’ll only flush the buffer if there is no more input waiting. That should reduce the number of packets we send significantly, and it does. Running the benchmark again gives us:

  • 229,783.30 QPS – using delayed flushing

That is almost twice as fast, which is impressive, for such a small change. The idea is that we are able to buffer our writes far more, but not delay them too much. If we write enough to the StreamWriter buffer, it will flush itself automatically, and we’ll only actually flush the StreamWriter manually when we have nothing further to read, which we do in parallel with the reading itself.

Here is the new cost structure:


And the actual methods called:


If we’ll compare this to the first profiling results, we can find some really interesting numbers. Before, we have called FlushAsync per command (see the ExecuteCommand & FlushAsync), now we call this a lot less often).

You can see that most of the time is now in the “business logic” for this system, and from the subsystems breakdown, a lot of the cost is now in the collections.

The GC costs here also went down significantly (~5%). I’m fairly certain that this is because we flush to the TCP stream, but I didn’t check too much.

Note that string processing and GC take a lot of time, but the Collections / ExecuteCommand is taking the vast majority of the costs.

If we look into that, we’ll see:


And that is… interesting.

Mostly because the major costs are in TryAddInternal. We know that there is high contention in this scenario, but 92% of the time spent in the method directly? What is it doing? Looking at the code, it becomes obvious:


The ConcurrentDictionary is sharding the calls between the locks. And the number of locks is defined by the number of the cores we have by default. The more concurrency we have, the more we can benefit from increasing the amount. I tried setting this to 1024 and running it under the profiler, and this gave me a few percentage points improvements, but not much more. Valuable, but not at the level we are playing with.

Even so, we managed to get some interesting details from this exploration. We know that we’ll have to deal with the dictionary implementation, since it takes roughly 50% of our time. I also want to pay some attention to these numbers:


Right now, we need to figure out how to make it faster in terms of collections, but we also have to consider overall GC costs as well as the string processing details. More on that in the next post.

time to read 5 min | 849 words

I run into this project, which aims to be a Redis clone with better performance and ease of use. I found it interesting because one of the main selling points there was that it is able to run in a multi threaded mode (instead of Redis’ single thread per process model). They use memtier_benchmark (part of Redis) to test their performance. I got curious about how much performance I could get out of the system if I built my own Redis clone in C#.

The first version I built was done pretty naively. The idea is to write it in a high level manner, and see where that puts us. To make things interesting, here are the test scenarios:

  • The memtier_benchmark is going to run on c6g.2xlarge instance, using 8 cores and 32 GB of memory.
  • The tested instance is going to run on c6g.4xlarge, using 16 cores and 64 GB of memory.

Both of those instances are running on the same availability zone.

The command I’m going to run is:

memtier_benchmark –s $SERVER_IP -t 8 -c 16 --test-time=30 --distinct-client-seed -d 256 --pipeline=30

What this says is that we’ll use 8 threads (number of cores on the client instance) with 32 connections per thread, we’ll use 20% writes & 80% reads with data size that is 256 bytes in size. In total, we’ll have 256 clients and out tests are going to continuously push more data into the system.

The server is being run using:

dotnet run –c Release

Here is an example of the server while under this test:


I chose 30 seconds for the test duration to balance doing enough work to feel what is going on (multiple GC cycles, etc) while keeping the test duration short enough that I won’t get bored.

Here are the naïve version results:

Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
Sets        86300.19          ---          ---         8.14044         0.92700        99.83900       196.60700     25610.97
Gets       862870.15     36255.57    826614.58         8.10119         0.91900        99.32700       196.60700     42782.42
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals     949170.34     36255.57    826614.58         8.10476         0.91900        99.32700       196.60700     68393.39

So the naïve version, using C#, doing almost nothing, is almost touching the 1 million queries / sec. The latency, on the other hand, isn’t that good. With the p99 at almost 100ms.

Now that I got your attention with the numbers and pretty graphs, let me show you the actual code that I'm running. This is a “Redis Clone” in under 100 lines of code.

Just a few notes on the implementation. I’m not actually doing much. Most of the code is there to parse the Redis protocol. And the code is full of allocations. Each command parsing is done using multiple string splits and concats. Replies to the client require even more concats. The “store” for the system is actually just a simple ConcurrentDictionary, without anything to avoid contention or high costs.

The manner in which we handle I/O is pretty horrible, and… I think you get where I’m going here, right? My goal is to see how I can use this (pretty simple) example to get more performance without having to deal with a lot of extra fluff.

Given my initial attempt is already at nearly 1M QPS, that is a pretty good start, even if I say so myself.

The next step that I want to take it to handle the allocations that are going on here. We can probably do better here, and I aim to try. But I’ll do that in the next post.


  1. My new interview task: Stop the flow - one day from now
  2. Architectural optimizations vs the profiler - about one day from now

There are posts all the way to Aug 12, 2022


  1. Production postmortem (43):
    05 Aug 2022 - The allocating query
  2. Webinar recording (14):
    26 Jul 2022 - RavenDB & Messaging Transactions
  3. Recording (5):
    25 Jul 2022 - Build your own database at Cloud Lunch & Learn
  4. High performance .NET (7):
    19 Jul 2022 - Building a Redis Clone–Analysis II
  5. Upcoming webinar (2):
    14 Jul 2022 - Involving RavenDB in your Messaging Transactions
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats