Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,357 | Comments: 50,734

Privacy Policy Terms
filter by tags archive
time to read 6 min | 1163 words

Today I had to look into the a customer whose RavenDB instance was burning through a lot of I/O. The process is somewhat ingrained in me by this point, but I thought that it would make for a good blog post so I’ll recall that next time.

Here is what this looks like from the point of view of the disk:

image

We are seeing a lot of reads in terms of MB/sec and a lot of write operations (but far less in terms of bandwidth). That is the external details, can we figure out more? Of course.

We start our investigation by running:

sudo iotop -ao

This command gives us the accumulative time for threads that do I/O. One of the important things that RavenDB is to tag its threads with the tasks that they are assigned. Here is a sample output:

  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
 2012 be/4 ravendb    1748.00 K    143.81 M  0.00 %  0.96 % Raven.Server -c /ravendb/config/settings.json [Follower thread]
 9533 be/4 ravendb     189.92 M     86.07 M  0.00 %  0.60 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 1905 be/4 ravendb     162.73 M     72.23 M  0.00 %  0.39 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 1986 be/4 ravendb     154.52 M     71.71 M  0.00 %  0.38 % Raven.Server -c /ravendb/config/settings.json [Indexing of Use]
 9687 be/4 ravendb     185.57 M     70.34 M  0.00 %  0.59 % Raven.Server -c /ravendb/config/settings.json [Indexing of Car]
 1827 be/4 ravendb     172.60 M     65.25 M  0.00 %  0.69 % Raven.Server -c /ravendb/config/settings.json ['Southsand']

In this case, we see the top 6 threads in terms of I/O (for writes). We can see that we have a lot of of indexing and documents writes. That said, thread names in Linux are limited to 14 characters, so we probably need to give better names to them.

That is part of the task, let’s look at the cost in terms of reads:

  TID  PRIO  USER    DISK READ>  DISK WRITE  SWAPIN      IO    COMMAND
11191 be/4 ravendb       2.09 G     31.75 M  0.00 %  7.58 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11494 be/4 ravendb    1353.39 M     14.54 M  0.00 % 19.62 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11496 be/4 ravendb    1265.96 M      4.97 M  0.00 % 16.56 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11211 be/4 ravendb    1120.19 M     42.66 M  0.00 %  2.83 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11371 be/4 ravendb    1114.50 M     35.25 M  0.00 %  5.00 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11001 be/4 ravendb    1102.55 M     43.35 M  0.00 %  3.12 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]
11340 be/4 ravendb     825.43 M     26.77 M  0.00 %  4.85 % Raven.Server -c /ravendb/config/settings.json [.NET ThreadPool]

That is a lot more complicated, however. Now we don’t know what task this is running, only that something is reading a lot of data.

We have the thread id, so we can make use of that to see what it is doing:

sudo strace -p 11191 -c

This command will track the statistics on the systems calls that are issued by the specified thread. I’ll typically let it run for 10 – 30 seconds and then hit Ctrl+C, giving me:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 90.90    3.868694         681      5681        82 futex
  8.28    0.352247           9     41035           sched_yield
  0.79    0.033589        1292        26           pwrite64
  0.03    0.001246          52        24         1 recvfrom
  0.01    0.000285         285         1           restart_syscall
  0.00    0.000000           0         2           madvise
  0.00    0.000000           0         2           sendmsg
------ ----------- ----------- --------- --------- ----------------
100.00    4.256061                 46771        83 total

I’m mostly interested in the pwrite64 system call. RavenDB uses mmap() for most of its data access, so that is harder to read, but we can get a lot of information from the output. Now I’m going to run the following command:

sudo strace -p 11191 -e trace=pwrite64

This will give us a trace of all the pwrite64() system calls from that thread, looking like this:

pwrite64(315, "\365\275"..., 4113, 51080761896) = 4113
pwrite64(315, "\344\371"..., 4113, 51080893512) = 4113

There is an strace option (-y) that can be used to show the file paths for system calls, but I forgot to use it, no worries, I can do:

sudo ls -lh /proc/11191/fd/315

Which will give me the details on this file:

lrwx------ 1 root root 64 Aug  7 09:21 /proc/11783/fd/315 -> /ravendb/data/Databases/Southsand/PeriodicBackupTemp/2022-08-07-03-30-00.ravendb-encrypted-full-backup.in-progress

And that tells me everything that I need to know. The reason we have high I/O is that we are generating a backup file. That explains why we are seeing a lot of reads (since we need to read in order to generate the backup).

The entire process is mostly about figuring out exactly what is going on, and RavenDB is very careful about leaving as many breadcrumbs as possible to make it easy to follow.

time to read 1 min | 93 words

In RavenDB 5.4, we’re introducing new ELT features for Kafka and RabbitMQ. Now, instead of your documents just sitting there in your database, you can involve them in your messaging transactions. In this webinar, RavenDB CEO Oren Eini explains how these ETL tasks open up a whole new world of architectural patterns, and how they spare you from a lot of complexity when you want to involve your data in pub/sub or other messaging patterns.

time to read 4 min | 653 words

RavenDB is written in C#, and as such, uses managed memory. As a database, however, we need granular control of our memory, so we also do manual memory management.

One of the key optimizations that we utilize to reduce the amount of overhead we have on managing our memory is using an arena allocator. That is a piece of memory that we allocate in one shot from the operating system and operate on. Once a particular task is done, we can discard that whole segment in one shot, rather than try to work out exactly what is going on there. That gives us a proper scope for operations, which means that missing a free in some cases isn’t the end of the world.

It also makes the code for RavenDB memory allocation super simple. Here is what this looks like:

image

Whenever we need to allocate more memory, we’ll just bump the allocator up. Initially, we didn’t even implement freeing memory, but it turns out that there are a lot of long running processes inside of RavenDB, so we needed to reuse the memory inside the same operation, not just between operations.

The implementation of freeing memory is pretty simple, as well. If we return the last item that we allocated, we can just drop the next allocation position by how many bytes were allocated. For that matter, it also allows us to do incremental allocations. We can ask for some memory, then increase the allocation amount on the fly very easily.

Here is a (highly simplified) example of how this works:

As you can see, there isn’t much there. A key requirement here is that you need to return the memory back in the reverse order of how you allocated it. That is usually how it goes, but what if it doesn’t happen?

Well, then we can’t reuse the memory directly. Instead, we’ll place them in a free list. The actual allocations are done on powers of two, so that makes things easier. Here is what this actually looks like:

image

So if we free, but not from the top, we remember the location and can use it again. Note that for 2048 in the image above, we don’t have any free items.

I’m quite fond of this approach, since this is simple, easy to understand and has a great performance profile.  But I wouldn’t be writing this blog post if we didn’t run into issues, now would I?

A customer reported high memory usage (to the point of memory exhaustion) when doing a certain set of operations. That… didn’t make any sense, to be honest. That was a well traveled code path, any issue there should have been long found out.

They were able to send us a reproduction and the support team was able to figure out what is going on. The problem was that the code in question did a couple of things, which altogether led to an interesting issue.

  • It allocated and deallocated memory, but not always in the same order – this is fine, that is why we have the free list, after all.
  • It extended the memory allocation it used on the fly – perfectly fine and an important optimization for us.

Give it a moment to consider how could these two operations together result in a problem…

Here is the sequence of events:

  • Loop:
    • Allocate(1024) -> $1
    • Allocate(256) -> $2
    • Grow($1, 4096) -> Success
    • Allocate(128) -> $3
    • Free($1) (4096)
    • Free($3) (128)
    • Free($2) (256)

What is going on here?

Well, the issue is that we are allocating a 1KB buffer, but return a 4KB buffer. That means that we add the returned buffer to the 4KB free list, but we cannot pull from that free list on allocation.

Once found, it was an easy thing to do (detect this state and handle it), but until we figured it out, it was quite a mystery.

time to read 3 min | 488 words

I’m trying to compare indexing speed of Corax vs. Lucene. Here is an interesting result:

image

We have two copies of the same index, running in parallel on the same data. And we can clearly see that Lucene is faster. Not by a lot, but enough to warrant investigation.

Here is the core of the work for Lucene:

image

And here it is for Corax:

image

If you look at the results, you’ll see something really interesting.

For the Corax version, the MapItems.Execute() is almost 5% slower than the Lucene version.

And that really pisses me off. That is just flat out unreasonable to see.

And the reason for that is that the MapItems.Execute() is identical in both cases. The exact same code, and there isn’t any Corax or Lucene code there. But it is slower.

Let’s dig deeper, and we can see this interesting result. This is the Lucene version, and the highlighted portion is where we are reading documents for the indexing function to run:

image

And here is the Corax version:

image

And here it is two thirds more costly? Are you kidding me? That is the same freaking code and is utterly unrelated to the indexing.

Let’s dig deeper, shall we? Here is the costs breakdown for Lucene, I highlighted the important bits:

image

And here is the cost breakdown for Corax

image

I have to explain a bit about what is going on here. RavenDB doesn’t trust the disk and will validate the data it reads from it the first time it loads a page.

That is what the UnlikelyValidatePage is doing.

What we are seeing in the profiler results is that both Corax and Lucene are calling GetPageInternal() a total of 3.69 million times, but Corax is actually paying the cost of page validation for the vast majority of them.

Corax validated over 3 million pages while Lucene validated only 650 thousand pages. The question is why?

And the answer is that Corax is faster than Lucene, so it is able to race ahead. When it races ahead, it will encounter pages first, and validate them. When Lucene comes around and tries to index those documents, they were already validated.

Basically, Lucene is surfing all the way forward on the wavefront of Corax’s work, and ends up doing a lot less work as a result.

What this means, however, is that we need to test both scenarios separately, on cold boot. Because otherwise they will mess with each other results.

time to read 3 min | 595 words

RavenDB introduced a TCP compression feature in version 5.3. The idea is that all internal communication in the cluster (as well as subscriptions), will use the Zstd compression format to reduce the overall bandwidth utilization by RavenDB. We have always supported HTTP compression, and that closed the circle.

The fact hat we are using Zstd means that we have a higher compression ratio and less CPU usage, so everyone was happy.  Except… sometimes, they weren’t.

In some cases, we noticed that there would be network failures at a far higher rate than previous experienced. RavenDB is robust to network errors, so that was handled, but that is still a concern. We figured out that the problem was rooted in the compression code. If we enabled compression between the systems, it would have far higher rate of failures than otherwise. But only when running in secured mode, when the system is running without security, everything works.

My first suspicion is that something is in the network, monitoring it. But the whole point of secured mode is that no one can peek into the stream not interfere with the contents. Given that this is a self-healing issue, it took some time to dedicate the right amount of attention to it, but we managed to figure it out.

This is a confluence of three different features that all play together to get this to happen.

With compression, we typically do something like this:

That is pretty much how all compression stream will work. But we do have to consider the following issue, there may be no output.

When can that happen?

Let’s assume that I’m using the simplest compression algorithm (run length encoding).

In other words, it will take a buffer such as: aaaaaacccccccbbbb and turn that into a7c6b4.

Now, let’s consider what would be the output of such an algorithm if we pass it a buffer consisting of a single value?

It will only update its internal state, it will not output anything. That is fine, we need a call to Flush() to ensure that all the state is out.

That means that this will return an empty buffer, which we are then writing to the inner stream. And that is fine, right? Since writing a zero length buffer is a no-op.

Except that it isn’t a no-op. There is the concept of empty SSL records, mostly it seams to handle the BEAST attack. So when you pass an empty buffer to the SslStream, it will emit an empty record to the network.

Which is good, except that you may have a scenario where you emit a lot of those values. And it turns out that OpenSSL has a limit to how many consecutive empty records it will accept (under the assumption that it must move forward and produce output and not just loop).

So, in order to repeat this bug, we need:

  • Input that will result in zero output from the compressor (fully repeating previous values, usually). Resulting in a zero length buffer as the output of the compression.
  • Sending the empty SSL record over the stream.
  • Repeating this for 32 times.

When all three conditions are satisfied, we get an error on the receiving end and the connection is broken. That means that the next call will have a different compression state and likely won’t have a problem at the same location.

In short, this is fun exercise in seeing how three different design decisions, all of whom are eminently reasonable, result in a very hard to trace bug.

The good thing is that this is simplicity itself to solve. We just need to avoid writing zero length buffer to the stream.

time to read 1 min | 77 words

Next week I’ll be presenting a new major feature for the new RavenDB 5.4 release. Built-in ETL for Kafka and RabbitMQ. Instead of your documents just sitting there in your database, you can involve them in your messaging transactions.

You can register for the webinar here

Please register, I would love to hear your feedback on the topic.

time to read 4 min | 641 words

We got a call from a customer, a pretty serious one. RavenDB is used to compute billing charges for customers. The problem was that in one of their instances, the value for a particular customer was wrong. What was worse was that it was wrong on just one instance of the cluster. So the customer would see different values in different locations. We take such things very seriously, so we started an investigation.

Let me walk you through reproducing this issue, we have three collections (Users, Credits and Charges):

image

The user is performing actions in the system, which issue charges. This is balanced by the Credits in the system for the user (payment they made). There is no 1:1 mapping between charges and credits, usually.

Here is an example of the data:

image

And now, let’s look at the index in question:

This is a multi map-reduce index that aggregates data from all three collections. Now, let’s run a query:

image

This is… wrong. The charges & credits should be more or less aligned. What is going on?

RavenDB has a feature called Map Reduce Visualizer, to help work with such scenarios, let’s see what this tells us, shall we?

image

What do we see in this image?

You can see that we have two results for the index. Look at Page #854 (at the top), we have one result with –67,343 and another with +67,329. The second result also does not have an Id property or a Name property.

What is going on?

It is important to understand that the image that we have here represents the physical layout of the data on disk. We run the maps of the documents, and then we run the reduce on each page individually, and sum them up again. This approach allows us to handle even a vast amount of data with ease.

Look at what we have in Page #540. We have two types of documents there, the users/ayende document and the charges documents. Indeed, at the top of Page #540 we can see the result of reducing all the results in the page. The data looks correct.

However…

Look at Page #865, what is going on there? Looks like we have most of the credits there. Most importantly, we don’t have the users/ayende document there. Let’s take a look at the reduce definition we have:

What would happen when we execute it on the results in Page #865? Well, there is no entry with the Name property there. So there is no Name, but there is also no Id. But we project this out to the next stage.

When we are going to reduce the data again among all the entries in Page #854 (the root one), we’ll group by the Id property, but the Id property from the different pages is different. So we get two separate results here.

The issue is that the reduce function isn’t recursive, it assumes that in all invocations, it will have a document with the Name property. That isn’t valid, since RavenDB is free to shuffle the deck in the reduce process. The index should be robust to reducing the data multiple times.

Indeed, that is why we had different outputs on different nodes, since we don’t guarantee that will process results in the same order, only that the output should be identical, if the reduce function is correct. Here is the fixed version:

And the query is now showing the correct results:

image

That is much better Smile

time to read 3 min | 415 words

A customer called us, complaining that RavenDB isn’t supporting internationalization. That was a big term to unpack. It boiled down to a simple issue. They were using Hebrew text in their system, consuming us from a node.js client, and they observed that sometimes, RavenDB would corrupt the data.

They would get JSON similar to this:

{ “Status”: "�", “Logged: true }

That… is not good. And also quite strange. I’m a native Hebrew speaker, so I threw a lot of such texts into RavenDB in the past. In fact, one of our employees built a library project for biblical texts, naturally all in Hebrew. Another employee maintained a set of Lucene analyzers for Hebrew. I think that I can safely say that RavenDB and Hebrew has been done. But the problem persisted. What was worse, it was not consistent. Every time that we tried to see what is going on, it worked.

We added code inside of RavenDB to try to detect what is going on, and there was nothing there. Eventually we tried to look into the Node.js RavenDB client, because we exhausted everything else. It looked okay, and in our tests, it… worked.

So we sat down and thought about what it could be. Let’s consider the actual scenario we have on hand:

  • Hebrew characters in JSON are being corrupted.
  • RavenDB uses UTF-8 encoding exclusively.
  • That means that Hebrew characters are using multi byte characters

That line of thinking led me to consider that the problem is related to chunking. We read from the network in chunks, and if the chunk happened to fall on a character boundary, we mess it up, maybe?

Once I started looking into this, the fix was obvious:

image

Here we go: ‍!

This bug is a great example of how things can not show up in practice for a really long time. In order to hit this you need chunking to happen in just the wrong place, and if you are running locally (as we usually do when troubleshooting), the likelihood you’ll see this is far lower. Given that most JSON property names and values are in the ASCII set, you need a chunk of just the right size to see it. Once we know about it, reproducing it is easy, just create a single string that is full of multi byte chars (such as an emoji) and make it long enough that it must be chunked.

The fix was already merged and released.

time to read 3 min | 470 words

A customer opened a support call telling us that they reached the scaling limits of RavenDB. Given that they had a pretty big machine specifically to handle the load they were expecting, they were (rightly) upset about that.

A short back and forth caused us to realize that RavenDB started to fail shortly after they added a new customer to their system. And by fail I mean that it started throwing OutOfMemoryException in certain places. The system was not loaded and there weren’t any other indications of high load. The system had plenty of memory available, but critical functions inside RavenDB would fail because of out of memory errors.

We looked at the actual error and found this log message:

Raven.Client.Exceptions.Database.DatabaseLoadFailureException: Failed to start database orders-21
At /data/global/ravenData/Databases/orders-21
 ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(ThreadHandle t, Int32 stackSize, Int32 priority, Char* pThreadName)
   at System.Threading.Thread.StartCore()
   at Raven.Server.Utils.PoolOfThreads.LongRunning(Action`1 action, Object state, String name) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Utils\PoolOfThreads.cs:line 91
   at Raven.Server.Documents.TransactionOperationsMerger.Start() in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\TransactionOperationsMerger.cs:line 76
   at Raven.Server.Documents.DocumentDatabase.Initialize(InitializeOptions options, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DocumentDatabase.cs:line 388
   at Raven.Server.Documents.DatabasesLandlord.CreateDocumentsStorage(StringSegment databaseName, RavenConfiguration config, Nullable`1 wakeup) in C:\Builds\RavenDB-5.3-Custom\53024\src\Raven.Server\Documents\DatabasesLandlord.cs:line 826 

This is quite an interesting error. To start with, this is us failing to load a database, because we couldn’t spawn the relevant thread to handle transaction merging. That is bad, but why?

It turns out that .NET will only consider a single failure scenario for a thread failing to start. If it fails, it must be because the system is out of memory. However, we are running on Linux, and there are other reasons why that can happen. In particular, there are various limits that you can set on your environment that would limit the number of threads that you can set.

There are global knobs that you should look at first, such as those:

  • /proc/sys/kernel/threads-max
  • /proc/sys/kernel/pid_max
  • /proc/sys/vm/max_map_count

Any of those can serve as a limit. There are also ways to set those limits on a per process manner.

There is also a per user setting, which is controlled via:

/etc/systemd/logind.conf: UserTasksMax

The easiest way to figure out what is going on is to look at the kernel log at that time, here is what we got in the log:

a-orders-srv kernel: cgroup: fork rejected by pids controller in /system.slice/ravendb.service

That made it obvious where the problem was, in the ravendb.service file, we didn’t have TasksMax set, which meant that it was set to 4915 (probably automatically set by the system depending on some heuristic).

When the number of databases and operations on the database reached a particular size, we hit the limit and started failing. That is not a fun place to be in, but at least it is easy to fix.

I created this post specifically so it will be easy to Google that in the future. I also created an issue to get a better error message in this scenario.

FUTURE POSTS

  1. My new interview task: Stop the flow - one day from now
  2. Architectural optimizations vs the profiler - about one day from now

There are posts all the way to Aug 12, 2022

RECENT SERIES

  1. Production postmortem (43):
    05 Aug 2022 - The allocating query
  2. Webinar recording (14):
    26 Jul 2022 - RavenDB & Messaging Transactions
  3. Recording (5):
    25 Jul 2022 - Build your own database at Cloud Lunch & Learn
  4. High performance .NET (7):
    19 Jul 2022 - Building a Redis Clone–Analysis II
  5. Upcoming webinar (2):
    14 Jul 2022 - Involving RavenDB in your Messaging Transactions
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats