Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,357 | Comments: 50,734

Privacy Policy Terms
filter by tags archive
time to read 6 min | 1163 words

Today I had to look into the a customer whose RavenDB instance was burning through a lot of I/O. The process is somewhat ingrained in me by this point, but I thought that it would make for a good blog post so I’ll recall that next time.

Here is what this looks like from the point of view of the disk:

image

We are seeing a lot of reads in terms of MB/sec and a lot of write operations (but far less in terms of bandwidth). That is the external details, can we figure out more? Of course.

We start our investigation by running:

sudo iotop -ao

This command gives us the accumulative time for threads that do I/O. One of the important things that RavenDB is to tag its threads with the tasks that they are assigned. Here is a sample output:

  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
 2012 be/4 ravendb    1748.00 K    143.81 M  0.00 %  0.96 % Raven.Server -c /ravendb/config/settings.json [Follower thread]
 9533 be/4 ravendb     189.92 M     86.07 M  0.00 %  0.60 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 1905 be/4 ravendb     162.73 M     72.23 M  0.00 %  0.39 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 1986 be/4 ravendb     154.52 M     71.71 M  0.00 %  0.38 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 9687 be/4 ravendb     185.57 M     70.34 M  0.00 %  0.59 % Raven.Server -c /ravendb/config/settings.json [Indexing of Car]
 1827 be/4 ravendb     172.60 M     65.25 M  0.00 %  0.69 % Raven.Server -c /ravendb/config/settings.json ['Southsand']

In this case, we see the top 6 threads in terms of I/O (for writes). We can see that we have a lot of of indexing and documents writes. That said, thread names in Linux are limited to 14 characters, so we probably need to give better names to them.

That is part of the task, let’s look at the cost in terms of reads:

  TID  PRIO  USER    DISK READ>  DISK WRITE  SWAPIN      IO    COMMAND
11191 be/4 ravendb       2.09 G     31.75 M  0.00 %  7.58 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11494 be/4 ravendb    1353.39 M     14.54 M  0.00 % 19.62 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11496 be/4 ravendb    1265.96 M      4.97 M  0.00 % 16.56 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11211 be/4 ravendb    1120.19 M     42.66 M  0.00 %  2.83 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11371 be/4 ravendb    1114.50 M     35.25 M  0.00 %  5.00 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11001 be/4 ravendb    1102.55 M     43.35 M  0.00 %  3.12 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11340 be/4 ravendb     825.43 M     26.77 M  0.00 %  4.85 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]

That is a lot more complicated, however. Now we don’t know what task this is running, only that something is reading a lot of data.

We have the thread id, so we can make use of that to see what it is doing:

sudo strace -p 11191 -c

This command will track the statistics on the systems calls that are issued by the specified thread. I’ll typically let it run for 10 – 30 seconds and then hit Ctrl+C, giving me:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.90    3.868694         681      5681        82 futex
  8.28    0.352247           9     41035           sched_yield
  0.79    0.033589        1292        26           pwrite64
  0.03    0.001246          52        24         1 recvfrom
  0.01    0.000285         285         1           restart_syscall
  0.00    0.000000           0         2           madvise
  0.00    0.000000           0         2           sendmsg
------ ----------- ----------- --------- --------- ----------------
100.00    4.256061                 46771        83 total

I’m mostly interested in the pwrite64 system call. RavenDB uses mmap() for most of its data access, so that is harder to read, but we can get a lot of information from the output. Now I’m going to run the following command:

sudo strace -p 11191 -e trace=pwrite64

This will give us a trace of all the pwrite64() system calls from that thread, looking like this:

pwrite64(315, "\365\275"..., 4113, 51080761896) = 4113
pwrite64(315, "\344\371"..., 4113, 51080893512) = 4113

There is an strace option (-y) that can be used to show the file paths for system calls, but I forgot to use it, no worries, I can do:

sudo ls -lh /proc/11191/fd/315

Which will give me the details on this file:

lrwx------ 1 root root 64 Aug  7 09:21 /proc/11783/fd/315 -> /ravendb/data/Databases/Southsand/PeriodicBackupTemp/2022-08-07-03-30-00.ravendb-encrypted-full-backup.in-progress

And that tells me everything that I need to know. The reason we have high I/O is that we are generating a backup file. That explains why we are seeing a lot of reads (since we need to read in order to generate the backup).

The entire process is mostly about figuring out exactly what is going on, and RavenDB is very careful about leaving as many breadcrumbs as possible to make it easy to follow.

time to read 1 min | 93 words

In RavenDB 5.4, we’re introducing new ELT features for Kafka and RabbitMQ. Now, instead of your documents just sitting there in your database, you can involve them in your messaging transactions. In this webinar, RavenDB CEO Oren Eini explains how these ETL tasks open up a whole new world of architectural patterns, and how they spare you from a lot of complexity when you want to involve your data in pub/sub or other messaging patterns.

time to read 1 min | 97 words

I spoke at Cloud Lunch & Learn about the basics of building a database from scratch. We took a storage engine and created a simple database within the span of an hour.

Covered in the talk are the details of how you can build the database, using indexes to speed up queries and the manner in which a database interact with its storage engine. I think it was a great talk, but let me know about your feedback:

time to read 4 min | 653 words

RavenDB is written in C#, and as such, uses managed memory. As a database, however, we need granular control of our memory, so we also do manual memory management.

One of the key optimizations that we utilize to reduce the amount of overhead we have on managing our memory is using an arena allocator. That is a piece of memory that we allocate in one shot from the operating system and operate on. Once a particular task is done, we can discard that whole segment in one shot, rather than try to work out exactly what is going on there. That gives us a proper scope for operations, which means that missing a free in some cases isn’t the end of the world.

It also makes the code for RavenDB memory allocation super simple. Here is what this looks like:

image

Whenever we need to allocate more memory, we’ll just bump the allocator up. Initially, we didn’t even implement freeing memory, but it turns out that there are a lot of long running processes inside of RavenDB, so we needed to reuse the memory inside the same operation, not just between operations.

The implementation of freeing memory is pretty simple, as well. If we return the last item that we allocated, we can just drop the next allocation position by how many bytes were allocated. For that matter, it also allows us to do incremental allocations. We can ask for some memory, then increase the allocation amount on the fly very easily.

Here is a (highly simplified) example of how this works:

As you can see, there isn’t much there. A key requirement here is that you need to return the memory back in the reverse order of how you allocated it. That is usually how it goes, but what if it doesn’t happen?

Well, then we can’t reuse the memory directly. Instead, we’ll place them in a free list. The actual allocations are done on powers of two, so that makes things easier. Here is what this actually looks like:

image

So if we free, but not from the top, we remember the location and can use it again. Note that for 2048 in the image above, we don’t have any free items.

I’m quite fond of this approach, since this is simple, easy to understand and has a great performance profile.  But I wouldn’t be writing this blog post if we didn’t run into issues, now would I?

A customer reported high memory usage (to the point of memory exhaustion) when doing a certain set of operations. That… didn’t make any sense, to be honest. That was a well traveled code path, any issue there should have been long found out.

They were able to send us a reproduction and the support team was able to figure out what is going on. The problem was that the code in question did a couple of things, which altogether led to an interesting issue.

  • It allocated and deallocated memory, but not always in the same order – this is fine, that is why we have the free list, after all.
  • It extended the memory allocation it used on the fly – perfectly fine and an important optimization for us.

Give it a moment to consider how could these two operations together result in a problem…

Here is the sequence of events:

  • Loop:
    • Allocate(1024) -> $1
    • Allocate(256) -> $2
    • Grow($1, 4096) -> Success
    • Allocate(128) -> $3
    • Free($1) (4096)
    • Free($3) (128)
    • Free($2) (256)

What is going on here?

Well, the issue is that we are allocating a 1KB buffer, but return a 4KB buffer. That means that we add the returned buffer to the 4KB free list, but we cannot pull from that free list on allocation.

Once found, it was an easy thing to do (detect this state and handle it), but until we figured it out, it was quite a mystery.

time to read 5 min | 943 words

I’m going to go back a few steps and try to see where I should be looking at next, to see where I should pay the most attention. So far in this series, I mostly focused on how we read and process the data. But I think that we ought to take a step or two back and see where we are at in general. I ran the version with Pipelines and string usage in the profiler, trying to see where we are at. For example, in a previous post, the ConcurrentDictionary that I was using had a big performance cost. Is that still the case now?

Here are the current hotspots in the codebase:

image

Looking at this with more detail, we have:

image

That is… interesting. Let’s look at the code for HandleConnection right now?

Looking at the code and the profiler results, I wonder if I can do better here. Here is a small change that gives me ~2% speed boost:

The idea is that we parallelize reading from and writing to the network. It is a small boost, but any little bit helps, especially once we get into the cascading impacts of repeated optimizations.

Looking into this, we have almost two billion calls to ReadAsync, let’s see what is costly there:

image

That is… wow.

Why is InternalTokenSource so expensive? I’m willing to bet that the issue is this one, it is taking a lock. In my use case, I know that there is a single thread running this, so it is worth seeing if I can skip it. Unfortunately, there isn’t an easy way to skip that check. Fortunately, I can copy the code from the framework and modify it locally, to see what the impact of that would be. So I did just that (initialized once in the constructor):

image

Of course, that is very much a brute force approach, and not one that I would recommend. Looking at the code, it looks like there is a reason for this usage (handling cancellation of operations), but I’m ignoring that for now. Let’s see what the profiler says now:

image

That means that we have about 40% improvements in per call costs. As I mentioned, that is not something that we can just do, but it is an interesting observation on the cost of reading using PipeReader.

Another aspect that is really interesting is the backend state we have, which is a ConcurrentDictionary. If we’ll look at its cost, we have:

image

You’ll note that I’m using the NonBlocking NuGet package, which provides a ConcurrentDictionary implementation that isn’t using locking. If we’ll use the default one from the framework, which does use locking, we’ll see:

image

You can see the difference costs better here:

image

Note that there is a significant cost difference between the two options (in favor of NonBlocking). But it doesn’t translate to much of a difference when we run a real world benchmark.

So what is next?

Looking at the profiler result, there isn’t really much that we can push forward. Most of the costs we have are in the network, not in the code we run.

image

Well, that isn’t quite true, is it? The bulk of our code is in ParseNetworkData call, which looks like this:

image

So the total time we spend actually executing the core functionality of our server is really negligible. A lot of time is actually spent parsing the commands from the buffer. Note that here, we don’t actually do any I/O, all operations are operating on buffers in memory.

The Redis protocol isn’t that friendly for machine parsing, requiring us to do a lot of lookups to find the delimiters (hence the IndexOf() calls). I don’t believe that you can significantly improve on this. This means that we have to consider other options for better performance.

We are spending 35% of our runtime in parsing the command streams from the client, and the code we execute is less than 1% of our runtime. I don’t think that there are significant optimization opportunities remaining for the stream parsing, so that leaves us with the I/O that we have left. Can we do better?

We are currently using async I/O and pipelines. Looking at the project that got me interested in this topic, it is using IO_Uring (via this API) on Linux for their needs. Their parsing is straightforward, as well, see here. Quite similar to the manner in which my code operates.

So to get to the next stage in performance (as a reminder, we are now at the 1.8 million req / sec) we’ll probably need to go to the ring based approach as well. There is a NuGet package to support it, but that moves this task from something that I can spend a few hours in an evening to a couple of days / full week of effort. I don’t think that I’ll pursue this in the near future.

time to read 5 min | 932 words

One of the high costs that we have right now in my Redis Clone is strings. That is actually a bit misleading, take a look here:

image

Strings take 12.57% of the runtime, but there is also the GC Wait, where we need to cleanup after them. That means that the manner in which we are working is pretty inefficient.

Our test scenario right now also involves solely GET and SET requests, there are no deletions, expirations, etc. I mention that because we need to consider what we’ll replace the strings with.

The simplest option is to replace that with a byte array, but that is still managed memory and incurs the costs associated with GC. We can pool those byte arrays, but then we have an important question to answer, how do we know when a buffer is no longer used?

Consider the following set of events:

Time Thread #1 Thread #2
1 SET abc  
2   GET abc
3 SET abc  
4   Use the buffer we got on #2

In this case, we have thread #2 accessing the value buffer, but we replaced that buffer. We need to let thread #2 keep using this buffer until it is done.

This little tidbit put us right back at concurrent manual memory management, which is scary. We can do things in a slightly different manner, however. We can take advantage of the GC to support us, like so:

The idea is pretty simple. We have a class that holds a buffer, and when the GC notices that it is no longer in use, it will add its buffer back to the pool. The idea is that we rely on the GC to resolve this (really hard) problem for us. The fact that this moves the cost to the finalizer means that we can not worry about this. Otherwise, you have to jump through a lot of hoops.

The ReusableBuffer class also implements GetHashCode() / Equals() which allow us to use it as a key in the dictionary.

Now that we have the backing store for keys and values, let’s see how we can read & write from the network. I’m going to go back to the ConcurrentDictionary implementation for now, so I’ll handle only a single concept at a time.

Before, we used StreamReader / StreamWriter to do the work, now we’ll use PipeReader / PipeWriter from System.IO.PIpelines. That will allow us to easily work with the raw bytes directly and it is meant for high performance scenarios.

I wrote the code twice, once using the reusable buffer model and once using PIpeReader / PipeWriter and allocating strings. I was surprised to see that my fancy reusable buffers were within 1% performance of the (much simpler) strings implementation. That is 1% in the wrong direction, by the way.

On my machine, the buffer based system was 165K ops/second while the strings based one was 166K ops/sec.

Here is the reusable buffer based approach complete source code. And to compare, here is the string based one. The string based one is about 50% shorter in terms of lines of code.

I’m guessing that the allocation pattern is really good for the kind of heuristics that the GC does. We either have long term objects (in the cache) or very short term ones.

It’s worth pointing out that the actual parsing of the commands from the network isn’t using strings. Only the actual keys and values are actually translated to strings. The rest I’m doing using raw bytes.

Here is what the code looks like for the string version under the profiler:

image

And here is the same thing using the reusable buffer:

image

There are a few interesting things to note here. The cost of ExecCommand is almost twice as high as the previous attempt. Digging deeper, I believe that the fault is here:

This is the piece of code that is responsible for setting an item in the dictionary. However, note that we are doing a read for every write? The idea here is that if we have a set on an existing item, we can avoid allocating the buffer for the key again, and reuse it.

However, that piece of code is in the critical path for this benchmark and it is quite costly. I changed it to do the allocations always, and we got a fairly consistent 1% – 3% faster than the string version. Here is what this looks like:

image

In other words, here is the current performance table (under the profiler):

  • 1.57 ms  - String based 
  • 1.79 ms - Reusable buffer based (reduce memory usage)
  • 1.04 ms - Reusable buffer (optimized lookup)

All of those numbers are under the profiler, and on my development machine. Let’s see what we get when I’m running them on the production instances, shall we?

  • String based – 1,602,728.75 ops/sec
  • Reusable buffer (with reducing memory code) – 1,866,676.53 ops/sec
  • Reusable buffer (optimized lookup) – 1,756,930.64

Those results do not match with what we see in my development machine. The likely reason is that the amount of operations is high enough and the load is sufficiently big that we are seeing a much bigger impact from the memory optimization at scale.

That is the only conclusion I can draw from the fact that the memory reduction code, which adds costs, is actually able to process more requests/seconds under such load.

time to read 4 min | 753 words

Now that I’m done with the low hanging fruits, I decided to shift the Redis implementation to use System.IO.Pipelines. That is a high performance I/O API that is meant specifically for servers that need to eke out all the performance out of the system.

The API is a bit different, but it follows a very logical pattern and makes a lot of sense. Here is the main loop of handling commands from a client:

The idea is that we get a buffer from the network, we read everything (including pipelined commands) and then we flush to the client. The more interesting things happen when we start processing the actual commands, because now we aren’t utilizing StreamReader but PipeReader. So we are working at the level of bytes, not strings.

Here is what this (roughly) looks like, I’m not showing the whole thing because I want to focus on the issue that I ran into:

The code is reading from the buffer, parsing the Redis format and then executing the commands. It supports multiple commands in the same buffer (pipelining) and it has absolutely atrocious performance.

Yes, the super speedy API that is significantly harder to get right (compared to the ease of working with strings) is far slower. And by far slower I mean the following, on my development machine:

  • The previous version clocks at around 126,017.72 operations per second.
  • This version clocks at less than 100 operations per second.

Yes, you read that right, less than one hundred operations per second compared to over hundred thousands for the unoptimized version.

That was… surprising, as you can imagine.

I actually wrote the implementation twice, using different approaches, trying to figure out what I was doing wrong. Surely, it can’t be that bad.

I took a look at the profiler output, to try to figure out what is going on:

image

It says, quite clearly, that the implementation is super bad, no? Except, that this is what you are supposed to be using. So what is going on?

The underlying problem is actually fairly simple and relates to how the Pipelines API achieves its performance. Instead of doing small calls, you are expected to get a buffer and process that. Once you are done processing the buffer you can indicate what amount of data you consumed, and then you can issue another call.

However, there is a difference between consumed data and examined data. Consider the following data:

*3
$3
SET
$15
memtier-2818567
$256
xxxxxxxxxx ... xxxxxx
*2
$3
GET
$15
memtier-7689405
*2
$3
GET
$15
memt

What you can see here is a pipelined command, with 335 bytes in the buffer.  We’ll process all of those commands in a single hit, except… look at the highlighted portion. What do we have there?

We have a partial command. In other words, we are expected to execute a GET with a key size of 15 bytes, but we only have the first 4 bytes here. That is actually expected and fine. We consumed all the bytes until the highlighted portion (thus letting the PipeReader know that we are done with them). The problem is that when we issue a call now, we’ll get the highlighted portion (which we didn’t consume), but we aren’t ready to process that. Data is missing. We indicate that to the PipeReader using the examined portion. So the PipeReader knows that it needs to read more from the network.

However… my code has a subtle bug. It will report that it examined the yellow highlight, not the green one. In other words, we tell the PipeReader that we consumed some portion of the buffer, and examined some more, but there are still bytes on the buffer that are neither consumed nor examined. That means that when we issue the read call, expecting to get data from the network, we’ll actually get the same buffer again, to do the exact same processing.

Eventually, we’ll have more data in the buffer from the other side, so the correctness of the solution isn’t impacted. But it will kill your performance.

The fix is really simple, we need to tell the PipeReader that we examined the entire buffer, so it will not do a busy wait and wait for more data from the network. Here is the bug fix:

With that change in place, we can hit 187,104.21 operations per second! That is 50% better, which is awesome. I haven’t profiled things yet properly, because I also want to address another issue, how are we going to deal with the data from the network. More on that in my next post.

time to read 3 min | 451 words

imageI’m inordinately fond of the Fallacies of Distributed Computing, these are a set of common (false) assumptions that people make when building distributed systems, to their sorrow.

Today I want to talk about one of those fallacies:

There is one administrator.

I like to add the term competent in there as well.

A pretty significant amount of time in the development of RavenDB was dedicated to addressing that issue. For example, RavenDB has a lot of code and behavior around externalizing metrics. Both its own and the underlying system.

That is a duplication of effort, surely. Let’s consider the simplest stuff, such as CPU, memory and I/O resource utilization. RavenDB makes sure to track those values, plot them in the user interface and expose that to external monitoring systems.

All of those have better metrics sources. You can ask the OS directly about those details, and it will likely give you far better answers (with more details) than RavenDB can.

There have been numerous times where detailed monitoring from the systems that RavenDB runs on was the thing that allowed us to figure out what is going on. Having the underlying hardware tell us in detail about its status is wonderful. Plug that into a monitoring system so you can see trends and I’m overjoyed.

So why did we bother investing all this effort to add support for this to RavenDB? We would rather have the source data, not whatever we expose outside. RavenDB runs on a wide variety of hardware and software systems. By necessity, whatever we can provide is only a partial view.

The answer to that is that we cannot assume that the administrator has set up such monitoring. Nor can we assume that they are able to.

For example, the system may be running on a container in an environment where the people we talk to have no actual access to the host machine to pull production details.

Having a significant investment in self-contained set of diagnostics means that we aren’t limited to whatever the admin has set up (and has the permissions to view) but have a consistent experience digging into issues.

And since we have our own self contained diagnostics, we can push them out to create a debug package for offline analysis or even take active actions in response to the state of the system.

If we were relying on external monitoring, we would need to integrate that, each and every time. The amount of work (and quality of the result) in such an endeavor is huge.

We build RavenDB to last in production, and part of that is that it needs to be able to survive even outside of the hothouse environment.

time to read 2 min | 256 words

After achieving 1.25 million ops/sec, I decided to see what would happen if I would change the code to support pipelining. That ended up being quite involved, because I needed to both keep track of all the incoming work as well as send the work to multiple locations. The code itself is garbage, in my opinion. It is worth it only as far as it points me inthe right direction in terms of the overall architecture. You can read it below, but it is a bit complex. We read from the client as much as we are able, then we send it to each of the dedicated threads to run it.

In terms of performance, it is actually slower than the previous iteration (by about 20%!), but it serves a very important aspect, it makes it easy to tell where the costs are.

Take a look at the following profiler result:

image

You can see that we are spending a lot of time in I/O and in string processing. The GC time is also quite significant.

Conversely, when we actually process the commands from the clients, we are spending most of the time simply idling.

image

I want to tackle this in stages. The first part is to stop using strings all over the place. The next stage after that will likely be to change the I/O model.

For now, here is where we stand:

 

FUTURE POSTS

  1. My new interview task: Stop the flow - one day from now
  2. Architectural optimizations vs the profiler - about one day from now

There are posts all the way to Aug 12, 2022

RECENT SERIES

  1. Production postmortem (43):
    05 Aug 2022 - The allocating query
  2. Webinar recording (14):
    26 Jul 2022 - RavenDB & Messaging Transactions
  3. Recording (5):
    25 Jul 2022 - Build your own database at Cloud Lunch & Learn
  4. High performance .NET (7):
    19 Jul 2022 - Building a Redis Clone–Analysis II
  5. Upcoming webinar (2):
    14 Jul 2022 - Involving RavenDB in your Messaging Transactions
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats