Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,992 | Comments: 49,637

filter by tags archive
time to read 3 min | 523 words

A RavenDB user called us with a very strange issue. They are running on RavenDB 3.5 and have run out of disk space. That is expected, since they are storing a lot of data. Instead of simply increasing the disk size, they decided to take the time and simply increase the machine overall capacity. They moved from a 16 cores machine to a 48 cores machine with a much larger disk.

After the move, they found out something worrying. RavenDB now used a lot more CPU. If previous the average load was around 60% CPU utilization, now they were looking at 100% utilization on a much more powerful machine. That didn’t make sense to us, so we set out to figure out what was going on. A couple of mini dumps and we were able to figure out what was going on.

It got really strange because there was the following interesting observation:

  • Under minimal load / idle – no CPU at all
  • Under high load – CPU utilization in the 40%
  • Under medium load – CPU utilization at 100%

That was strange. When there isn’t enough load, we are at a 100%? What gives?

The culprit was simple: BlockingCollection.

“Huh”, I can hear you say. “How can that be?”

A BlockingCollection should not be the cause of high CPU, right? It is in the name, it is blocking. Here is what happened. That blocking collection is used to manage tasks, and by default we are spawning threads to handle that at twice the number of available cores. All of these threads are sitting in a loop, calling Take() on the blocking collection.

The blocking collection internally is implemented as using a SemaphoreSlim, which call Wait() and Release() on the values as needed. Here is the Release() method notifying waiters:

image

What you can see is that if we have more than a single waiter, we’ll update all of them. The system in question had 48 cores, so we had 96 threads waiting for work. When we add an item to the collection, all of them will wake and try to pull an item from the collection. Once of them will succeed, and then rest will not.

Here is the relevant code:

image

As you can imagine, that means that we have 96 threads waking up and spending a full cycle just spinning. That is the cause of our high CPU.

If we have a lot of work, then the threads are busy actually doing work, but if there is just enough work to wake the threads, but not enough to give them something to do, they’ll set forth to see how hot they can make the server room.

The fix was to reduce the number of threads waiting on this queue to a more reasonable number.

The actual problem was fixed in .NET Core, where the SemaphoreSlim will only wake as many threads as it has items to free, which will avoid the spin storm that this code generates.

time to read 2 min | 254 words

We run a lot of benchmarks internally and sometimes it feels like there is a roaming band of performance focused optimizers that go through the office and try to find under utilized machines. Some people mine bitcoin for fun, in our office, we benchmark RavenDB and try to see if we can either break a record or break RavenDB.

Recently a new machine was… repurposed to serve as a benchmarking server. You can call it a right of passage for most new machines here, I would say. The problem with that machine is that the client would error. Not only would it fail, but at the exact same interval. We tested that from multiple clients and from multiple machines and found that every 30 minutes on the dot, we’ll have an outage that lasted under one second.

Today I come to the office to news that the problem was found:

image

It seems that after 30 minutes of idle time (no user logged in), the machine would turn off the ethernet, regardless of if there are active connections going on. Shortly afterward it would be woken up, of course, but it would be down just enough time for us to notice it.

In fact, I’m really happy that we got an error. I would hate to try to figure out latency spikes because of something like this, and I still don’t know how the team found the root cause.

time to read 2 min | 252 words

In my previous post, I asked you to find the bug in the following code:

This code looks okay, at a glance, but it turns out that this is a really nasty data corruption bug waiting to happen. Here is what the problematic usage looks like:

Do you see the error now?

If the operation will time out, an exception will be raised, but the underlying operation isn’t over. We are using a shared pool, so the buffer we use may be handed over to someone else. At this point, we do something with the buffer, but the pending I/O operation will read data into this buffer, meaning that this is probably going to be garbage in it when we actually use it.

To actually happen, you need to have a timeout operation, reuse of the buffer and the I/O operation completing at just the wrong time. So a sequence of highly unlikely events that would assuredly happen within an hour of pushing something like that to production. For fun, this will reliably happen the moment you have some network issues. So imagine that you have a slow node, which then cause memory corruption, which end up being a visible bug (instead of maybe aborted request) very rarely, and with no indication on how this happened.

How do you fix this? Like this:

This will use a cancellation token, which will cause the operation to be aborted at the stream level, meaning that we can safely reuse values that we passed the underlying stream.

time to read 2 min | 400 words

Hadi’s had an interesting Tweet:

This sounds reasonable on the surface, but it runs into hard issues with the vast differences between any amount of money vs. free. There have been quite a few studies on this. A good reading on the subject can be found here. Take a note of the Amazon experience. Shipping that is free increases sales. Shipping that is 0.2$ does not increase sales. Note that from a practical standpoint, there is no real difference between the prices, but it matters, quite a lot.

Leaving aside the psychology of free, there are also other issues. Let’s say that there is a tool that can save someone on my team 10 minutes a day, it is priced at 1$ / year. By any measure you care to name, that is more than worth it.

But the process of actually getting this purchased is likely to be more expensive than the actual cost of the tool. In my case, the process is “go talk to Oren”. In other environments, that may involve “talk to your boss, that will submit it to accounts payable, which will pay it in 60 days”.

And at that point, charging 1$ just doesn’t matter. You could charge 50$, and the “cost” for the person making the purchase would be pretty much the same. Note that this is for corporate environment, the situation is (slightly) different for sales directly to consumers, but not significantly so. Free stills trumps everything.

When talking about freemium models, you hear quotes that are bad. For example, Evernote had a < 2% conversion rate, and that is a wildly successful example. Dropbox has a rate of about 4%, and the average seems to be 1%. And that is for businesses who are focused on optimizing the freemium funnel. That takes a lot of time and effort, mind you.

I don’t think that there is a practical option for turning this around, and I say that as someone who would love it if that were at all possible.

time to read 2 min | 245 words

I’ll be writing a lot more about our RavenDB C++ client, but today I was reviewing some code and I got a reply that made me go: “Ohhhhh! Nice”, and I just had to blog about it.

image

This is pretty much a direct transaction of how you’ll write this kind of query in C#, and the output of this is a RQL query that looks like this:

image

The problem is that I know how the C# version works. It uses Reflection to extract the field names from the type, so we can figure out what fields you are interested in. In C++, you don’t have Reflection, so how can this possibly work?

What Alexander did was really nice. Given that the user already have to provide us with the serialization routine for this type (so we can turn the JSON into the types that will be returned). Inside the select_fields() call, he constructed an empty object, serialize that and then use the field names in the resulting JSON to figure out what fields we want to project from the Users documents.

It make perfect sense, it require no additional work from the user and it gives us consistent API. It is also something that I would probably never think to do.

time to read 2 min | 274 words

After trying (and failing) to use rustls to handle client authentication, I tried to use rust-openssl bindings. It crapped out on me with a really scary link error. I spent some time trying to figure out what was going on, but given that it said that I wanted to write Rust code, not deal with link errors, I decided to see if the final alternative in the Rust eco system will work, native-tls package.

And… that is a no go as well. Which is sad, because the actual API was quite nice. The reason it isn’t going to work? The native-tls package just has no support for client certificate authentication when running as a server, so not usable for me.

That leaves me with strike three out of three:

  • rustls – native Rust API, easy to work with, but doesn’t allow to accept arbitrary client certificates, only ones from known issuers.
  • rust-openssl – I have build this on top of OpenSSL before, so I know it works. However, trying to build it on Windows resulted in link errors, so that was out.
  • native-tls – doesn’t have support for client certificates, so not usable.

I think that at this point, I have three paths available to me:

  • Give up and maybe try doing something else with Rust.
  • Fork rustls and add support for accepting arbitrary client certificates. I’m not happy with this because it requires changing not just rustls but also probably webpki package and I’m unsure if the changes I have in mind will not hurt the security of the system.
  • Try to fix the OpneSSL link issue.

I think that I’ll go with the third option, but this is really annoying.

time to read 2 min | 328 words

In my previous post, I asked about the following code and what its output will be:

As it turns out, this code will output two different numbers:

  • On Debug – 134,284,904
  • On Release – 66,896

The behavior is consistent between these two modes.

I was pretty sure that I knew what was going on, but I asked to verify. You can read the GitHub issue if you want the spoiler.

I attached to the running program in WinDBG and issued the following command:

We care about the last line. In particular, we can see that all the memory is indeed in the byte array, as expected.

Next, let’s dump the actual instances that take so much space:

There is one large instance here that we care about, let’s see what is holding on to this fellow, shall we?

It looks like we have a reference from a local variable. Let’s see if we can verify that, shall we? We will use the clrstack command and ask it to give us the parameters and local variables, like so:

The interesting line is 16, which shows:

image

In other words, here is the local variable, and it is set to null. What is going on? And why don’t we see the same behavior on release mode?

As mentioned in the issue, the problem is that the JIT introduce a temporary local variable here, which the GC is obviously aware of, but WinDBG is not. This cause the program to hold on to the value for a longer period of time than expected.

In general, this should only be a problem if you have a long running loop. In fact, we do in some case, and in debug mode, that actually caused our memory utilization to go through the roof and led to this investigation.

In release mode, these temporary variables are rarer (but can still happen, it seems).

FUTURE POSTS

  1. To all user groups: virtual meetups availability - 21 minutes from now

There are posts all the way to Apr 03, 2020

RECENT SERIES

  1. Production postmortem (29):
    23 Mar 2020 - high CPU when there is little work to be done
  2. RavenDB 5.0 (3):
    20 Mar 2020 - Optimizing date range queries
  3. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  4. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  5. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats