Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,017 | Comments: 49,691

filter by tags archive
time to read 3 min | 449 words

A user reported a bug to our support. When running on MacOS, they were unable to authenticate against a remote RavenDB instance.

That was strange, since we support running on MacOS, both as a client and as a server. We have had some issues around different behavior, but it is working, so what could the issue be?

RavenDB uses X509 certificates for authentication. That ensure mutual authentication for client and server, as well as secure the communication from any prying eyes. But on that particular system, it just did not work. RavenDB was accessible, but when attempting to access it, we weren’t able to authenticate. When using the browser, we didn’t get the “Choose the certificate” dialog either. That was really strange.  Digging deeper, we verified that the certificate was setup property in the keychain. We also tested FireFox, which has a separate store for certificates, nothing worked.

Then we tested using curl, and were able to properly access and authenticate to the server. So something was really strange here. Testing from a different machine, we were able to observe no issues.

The user mentioned that they recently moved to Catalina, which is known to have some changes in how it process certificates. None of which applied to our scenario, however.

Eventually, we started comparing network traces and then we found something really interesting. Take a look at this:

image

That was an interesting discovery. The user had an anti virus installed, and the AV installed a root CA and then setup a proxy to direct all traffic through the AV. Because it added a root CA, it was able to sniff all the traffic on the machine.

However, with a client certificate, that model doesn’t work. The proxy would need to have the private key of the certificate to be able to authenticate to the remote system, which it obviously does not have. It silently stripped the request for a client certificate, which meant that as far as RavenDB was concerned, we saw no client certificate in the request, so we rightfully rejected it.

I found it interesting that we were able to actually use curl, I assume that Avast didn’t setup the proxy so curl would be included.

The solution was simple, exclude RavenDB from the inspected addresses, which immediately fixed the problem.

I spent some time trying to figure out if there was a good way for us to detect this automatically. Sadly, there is no way to tell from the client side what is the certificate that was used. If there was, we could compare it to the expected result and alert on that.

time to read 5 min | 965 words

Recently JetBrains announce that dotTrace and dotMemory supports profiling applications on Linux. We have been testing these out for production usage, analyzing production scenarios and have been finding (and fixing) issues with about an order of magnitude less work and hassle. I have been a fan of JetBrains for over 15 years and they have been providing excellent software for a long time. So why this post (and no, this isn’t an ad and we are paying for a subscription to JetBrains for all our devs).

I’m super excited about being able to profile on Linux, because this is where most of our customers are now running. Before, we had to use arcane tooling and try to figure things out. With better tools, we are able to pinpoint issues almost immediately.

Here is the result of a single profiling session:

image

Here, we create a new List and then called ToArray() on it. All we had to do here was just create the array upfront, without going through resize and additional allocations.

image

Here we had to do a much move involved operation, remembering values across transactions to reduce load. Note how much time we saved!

And of course, just as important:

image

We tried to optimize an operation. It looks like it didn’t work out exactly as expected, so we need to take additional look at that.

These are about performance,  but even more important here is the ability to track memory and its sources. We were able to identify multiple places where we are using too much memory. In some cases, this is about configuration options. In one such case, limiting the amount of memory that we speculatively hold back resulted in a reduction of memory utilization over time from 3.5 GB to 800 MB. And these are much harder to figure out, because it is very hard to see the full picture.

Another factor that became obvious is that RavenDB’s current infrastructure was written in 2015 / 2016. The lowest pieces of the code were written on what was then called DNX. It was written to be high performance and stable. We didn’t have to touch it much ever since. Profiling memory utilization showed us that we hold a large number of values in the large object heap. This is because we need to have a byte[] that we can also send to native code, so we need to pin it. That was because we needed to call some methods, such as Stream.Read(byte[], int,int()  and then process it using raw pointers.

When the code was written, that was the only choice. Allocate it on the large object heap ( to reduce fragmentation) and run with it. Now, however, we have Memory<byte> and Span<byte>, which are much easier to work with for this scenario, so we can discard this specific feature and optimize our memory utilization further.

I don’t think we would have gotten to this without the profiler point a big red blinking arrow at the issue. And here is another good example, here is the memory state of RavenDB from dotTrace after running a benchmark consisting of a hundred thousands queries with javascript projections, this one:

image

And here is the state:

image

That is… interesting. Why do we have so many? The JavaScript engine is single threaded, but we have concurrent queries, so we use multiple such engines. We cache them, so we have:

image

And the mechanism there to clean them up will not be triggered normally. So we have this needlessly around. After fixing the cache eviction logic, we get:

image

Looking at the byte arrays here, we can track this down further, to see that the root cause is the pinning buffer I spoke about earlier. We’re handling that fix in a different branch, so this is pretty good sign that we are done with this kind of benchmark.

image

The down sides dotTrace and dotMemory, exist. On Linux, dotTrace only supports sampling, not tracing. That sounds like a minor issue, but it means that when I’m running the same benchmark, I’m missing something that I find critical. The number of times a method was run. If a method is taking 50% of the runtime, but it is invoked hundred millions times, I need a very different approach from something that happens three times. The lack on tracing option with number of times the methods were called is a real issue for my usual workflow.

For dotMemory, if you have a highly connected graph (for example, a doubly linked list) and a lot of objects, the amount of resources that it takes to analyze the process is huge. I deployed a cloud machine with 128 GB of RAM and 32 cores to try to analyze an 18 GB process dump. That wasn’t sufficient to give all the details, sadly. It was still able to provide me with enough so I could figure out what the issue was, though.

Another issue is that the visuals are amazing, but when I have a 120 millions links in a node, trying to represent that in the UI is a hopeless cause. I wish that there was a way to dump to text the same information that dotMemory provides. That would allow to analyze ridiculous amount of information in a structured manner.

time to read 2 min | 287 words

When you have error code model for returning errors, you are going to be fairly limited in how you can report actual issues.

Here is a good example, taken from the ZStd source code:

image

You can see that the resulting error is the same, but we have three very different errors. Well, to be fair, we have two types of errors.

The total size is wrong and the number of samples is either too big or too small. There is no way to express that to the calling code, which may be far higher in the stack. There is just: “The source size is wrong” error.

There is actually an attempt at proper error reporting. The DISPLAYLEVEL is a way to report more information about the error, but like any C library, we are talking about creating custom error reporting. The DISPLAYLEVEL macro will write to the standard output if a flag is set. That flag is impossible to be set from outside the compilation unit, as far as I can see. So consuming this from managed code means that I have to just guess what these things are.

You can say a lot about the dangers and complexities of exceptions. But having a good way to report complex errors to the caller is very important. Note that in this case, complex means an arbitrary string generated at error time, not a complex object. An error code is great if you need to handle the error. But if you need to show it to the user, log it or handle it after the fact, a good clear error message is the key.

time to read 4 min | 654 words

A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:

image

Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.

Running the query, we would get:

image

But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?

We looked at the score, here are the values for the part where we see the difference:

  • 1.702953815460205
  • 1.7029536962509155

Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:

image

I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.

We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:

image

They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?

image

Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756

This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.

I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.

Usually, this doesn’t matter, but please look at the above values and consider the following fact:

What do you think are the values of C and D in the code above?

  • c = 1.4082127
  • d = 1.4082128

Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.

The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.

Really annoying.

time to read 3 min | 523 words

A RavenDB user called us with a very strange issue. They are running on RavenDB 3.5 and have run out of disk space. That is expected, since they are storing a lot of data. Instead of simply increasing the disk size, they decided to take the time and simply increase the machine overall capacity. They moved from a 16 cores machine to a 48 cores machine with a much larger disk.

After the move, they found out something worrying. RavenDB now used a lot more CPU. If previous the average load was around 60% CPU utilization, now they were looking at 100% utilization on a much more powerful machine. That didn’t make sense to us, so we set out to figure out what was going on. A couple of mini dumps and we were able to figure out what was going on.

It got really strange because there was the following interesting observation:

  • Under minimal load / idle – no CPU at all
  • Under high load – CPU utilization in the 40%
  • Under medium load – CPU utilization at 100%

That was strange. When there isn’t enough load, we are at a 100%? What gives?

The culprit was simple: BlockingCollection.

“Huh”, I can hear you say. “How can that be?”

A BlockingCollection should not be the cause of high CPU, right? It is in the name, it is blocking. Here is what happened. That blocking collection is used to manage tasks, and by default we are spawning threads to handle that at twice the number of available cores. All of these threads are sitting in a loop, calling Take() on the blocking collection.

The blocking collection internally is implemented as using a SemaphoreSlim, which call Wait() and Release() on the values as needed. Here is the Release() method notifying waiters:

image

What you can see is that if we have more than a single waiter, we’ll update all of them. The system in question had 48 cores, so we had 96 threads waiting for work. When we add an item to the collection, all of them will wake and try to pull an item from the collection. Once of them will succeed, and then rest will not.

Here is the relevant code:

image

As you can imagine, that means that we have 96 threads waking up and spending a full cycle just spinning. That is the cause of our high CPU.

If we have a lot of work, then the threads are busy actually doing work, but if there is just enough work to wake the threads, but not enough to give them something to do, they’ll set forth to see how hot they can make the server room.

The fix was to reduce the number of threads waiting on this queue to a more reasonable number.

The actual problem was fixed in .NET Core, where the SemaphoreSlim will only wake as many threads as it has items to free, which will avoid the spin storm that this code generates.

time to read 3 min | 434 words

RavenDB makes extensive use of certificates for authentication and encryption. They allow us to safely communicate between distributed instances without worrying about a man in the middle or eavesdroppers. Given the choices we had to implement authentication, I’m really happy with the results of choosing certificates as the foundation of our authentication infrastructure.

It would be too good, however, to expect to have no issues with certificates. The topic of this point is a puzzler. A user has chosen to use a self signed certificate for the nodes in the cluster, but was unable to authenticate between the servers unless they registered the certificate in the OS’ store.

That sounds reasonable, right? If this is a self signed certificate, we obviously don’t trust it, so we need this extra step to ensure that we do trust it. However, we designed RavenDB specifically to avoid this step. If you are using a self signed certificate, the server will trust its own certificate, and thus will trust anyone that is using the same certificate.

In this case, however, that wasn’t happening. For some reason, the code path that we use to ensure that we trust our own certificate was not being activated, and that was a puzzler indeed.

One of the things that RavenDB does on first startup is to try to connect to itself as a client. It checks whatever it is successful or not. If not, we’ll try again, ignoring the registered root CAs. If we are successful at that point, we know what the issue here and ensure that we ignore the untrusted signer on the certificate. We only enable this code path if by default we don’t trust our own certificate.

Looking at the logs, we could see that we got a failure when talking to ourselves, some sort of a device not ready issue. That was strange. We hooked strace to look into what was going on, but there was nothing that was wrong at the sys call level. Then we looked into what was going on and realized that the issue was that the server’s was configured to use: https://ravendb-1.francecentral.cloudapp.azure.com/ but was actually hosted on https://ravendb-1-tst.francecentral.cloudapp.azure.com/

Do you see the difference?

The server was try to contact itself using the configured hostname. It failed, because of a DNS issue, so it couldn’t contact itself to figure out that the certificate was invalid. At that point, it didn’t install the hook and wouldn’t trust the self signed certificate.

So the issue started with investigating why we nodes in the cluster don’t trust each other with self signed certificate and got resolved by a simple configuration error.

time to read 2 min | 272 words

We have a bunch of smart people whose job description does not include breaking RavenDB, but nevertheless they manage to do so on a regular basis. This is usually done while subjecting RavenDB to new and interesting ways to distress it. The latest issue was being able to crash the system with an impossible bug. That was interesting to go through, so it is worth a post.

Take a look at the following code:

This code will your application (you can discard multi threading as a source, by the way) in a particularly surprising way.

Let’s assume that we have a system that is running under low memory conditions. The following sequence of events is going to occur:

  • BufferedChannel is allocated.
  • The BufferedChannel constructor is being run.
  • The attempt to allocate the _destinations list fails.
  • An exception is thrown upward and handled.

So far, so good, right? Except that now your system is living on borrowed time.

You see, this class has a finalizer. And the fact that the constructor wasn’t called doesn’t matter to the finalizer. It will still call the Finalize method, but that method calls to Dispose, and Dispose expect to get a valid object. At this point, we are going to get a Null Reference Exception from the finalizer, which is considered to be a critical error and fail the entire process.

For additional fun, this kind of failure will happen only in under harsh conditions. When the OS refuses allocations, which doesn’t happen very often. We have been running memory starvation routines for a while, and it takes a specific set of failures to get it failing in just the right manner to cause this.

time to read 1 min | 190 words

We run into a hard failure during some diagnostics runs that was quite interesting. Here it the relevant code:

The code was meant to serve as a repository of commands so we can see in what order they run into the system. (Yes, we could use a queue, instead, but the code above isn’t what we actually run, it is just enough to reproduce the problem I wanted to talk about). And it failed, because there was already a value with the same key.

That was strange, because we didn’t expect that to happen, obviously. We wrote the following code:

And we got some really interesting results. On my laptop, I get:

Changes: 9388, Idle: 5442

We got widely variant results. Mostly, I believe, it related to the system configuration and the CPU frequency. But the basic problem is that on that particular machine, we were fast enough to be able to run commands faster than the machine could count Smile.

That is a good problem to have, I think. We fixed it by making sure our time source is monotonically increasing, which is good enough for debug code.

time to read 3 min | 564 words

We got a bug report recently about RavenDB messing up a query. The query in question was:

from index TestIndex where Amount < 0

The query, for what it is worth, is meant to find accounts that have pre-paid. The problem was that this query returned bad results. This was something very strange. Comparisons of numbers is a very well trodden field in RavenDB.

How can this be? Let’s look a bit deeper into things, shall we? Luckily the reproduction is pretty simple, let’s take a look:

With an index like this, we can be fairly certain that the query above will return no results, right? After all, this isn’t any kind of complex math. But the query will return results for this index. And that is strange.

You might have noticed that the numbers we use are decimals ( 1.5m is the C# postfix for decimal constant ). The problem repeats with decimal, double and float. When we started looking at this, we assumed that the problem was with floating point precision. After all, any comparison of floating point values will warn you about this. But using decimal, we are supposed to be protected from this.

When faced with such an issue, we dove deep into how RavenDB does numeric range comparisons. Sadly, this is not a trivial matter, but after reviewing everything, we were certain that the code is correct, it had to be something else.

I finally ended up with the following code:

And that gave me some interesting output:

00000000 - 00000000 - 00000000 - 00000000
00000000 - 00000000 - 00000000 - 80010000

0.0 == 0 ? True

And that was quite surprising.We can get the same using double of floats:

And this code gives us:

00-00-00-00-00-00-00-00
00-00-00-00-00-00-00-80
00-00-00-00
00-00-00-80

What is going on here?

Well, let’s look at the format of floating point numbers for a sec, shall we? There is a great discussion of this here, but the key observation for this particular issue can be seen by looking at the binary representation.

Here is 2.0:

 image

And is here -2.0:

image

As usual, the first bit is the sign marker. But unlike int32 or int64, with floating point, it is absolutely legal to have the following byte patterns:

image

image

The first one here is zero, and the second one is negative zero. They are equal to one another, but the binary representation is different. And that caused our issue.

Part of how RavenDB does range numeric queries on floating point it to translate them to a lexical value that can be compared using memcmp(), there is some magic involved, but the key observation was that we didn’t account for negative zero in the problem. The negative zero was translated to –9,223,372,036,854,775,808 (long.MinValue) and that obviously is smaller than zero.

The fix in our code was to handle this explicitly, like so:

And that is how I started out by chasing bugs and ended up with a sum total of negative zero.

time to read 5 min | 946 words

imageThis story started a few years ago, in a very non technical setting. We changed the accountant that we use for Hibernating Rhinos. We outgrew the office we were using at the time and needed better services. Among the changes that was implemented as a result of this move was the usage of new accounting software. Nothing really that interesting, to be frank. I like that my accounting is boring. However, the new accounting software was an on-premise solution. In other words, we are the one running it. Which is perfectly fine, we provisioned a VM in our data center (a fancy name to the single rack that we had at the time) and let it run.

As you can imagine, we consider our accounting data to be mission critical, so to speak. I don’t mind not being able to access it for an hour, for example, but losing it is going to be Bad. So we had a backup, nothing really that interesting. We have a backup that goes to local disk on the VM, remote disk in the office and just to be safe, uploading the backup to S3. I asked one of our developers to take care of this, and aside from specifying that I want backups in triplicates, I didn’t really pay attention. That was around 2017, I believe.  I made sure that if the backup failed, we would get notified of that, and that was pretty much it.

One of the reasons that I like my accounting boring is that it simplify my life and reduces stress. Unfortunately, it seems like my accounting practices has a cost. In particular, it means that I favor paying a bit too much to the taxman. That means that all of the taxes are going out immediately, and the company doesn’t end up the year with a large tax bill that we need to cover. But I overdid it a time or two, and we overpaid on our taxes. Well, that was by design, extra money showing up from the taxman is much better than a surprise bill. But at certain point, we were supposed to get a refund for a non trivial amount. At which point the tax authorities came a-calling and audited us.

Remember that I talked about boring accounting practices. The day we started the audit, I was having dinner with my wide and being audited was the third topic of the day, if I recall properly. They found a few things that we did wrong (we registered an invoice for the wrong currency, so we cancelled it and issued a new one, instead of refunding it and issuing a new one). That was a Thing, it seemed. But the end result was pretty much nothing. I loved it. Since then, we were audited a few more times, always with no repercussions.

Given that the next audit is a question of when (usually every 18 – 30 months or so, it seems), not if. I really care about my accounting data. Hence the triple backups policy. You might have been going through this post expecting to hear that we lost the accounting data, and the backup failed, and now my accountant outlook is decidedly not boring. I’m afraid that this is only half true. We did have a failed backup, but we caught it before we actually needed it.

At one point, I looked at out backup policies, and I noticed that the accounting backup was months old at this point. That was concerning, I gotta say. Here is the timeline, as I could piece is together:

  • Q2 2017 – Backup process is defined and tested. This is a one off process that we use only for the accounting database.
  • Q1 2018 – Routine key rotation is performed on some of our keys. Unbeknownst to us, the backup process lose the ability to report failure. But given that it doesn’t fail, no one notices.
  • Q4 2018 – The developer responsible for setting up the backup process leave the company. As part of the outgoing employee process, we shut down relevant user accounts.
  • Q1 2019 – The accounting server is rebooted. The backup process fails to start, because the user account is disabled.

You might notice the scale of this issue. The underlying problem was that the developer setup this one off process as a… well, one off process. That meant that it wasn’t hooked to any of our usual monitoring / alert systems. It did have a way to report on errors, but the credentials on that went stale after a year. No one paid attention, since the backups continued to run.

The backup process was also running under the user account of the developer, not a service account. I guess it was easier than creating a user, but the end result was that when we deactivated the user account after the developer left the company, we also disabled the backup. But the process was running, and it continued to run for months.  Only much later will the process fail to start, and by then there was no way to report errors, and we noticed it only because we looked for that during routine operations.

One of the reasons we had built backups directly into the core of RavenDB was exactly this sort of situations. A backup process is not something that you cobble together (that’s on us, to be fair), it is something that should be part and parcel of the operations of your database, and being able to do something like get backups in triplicate is essential for good operations experience.

FUTURE POSTS

  1. Optimizing RavenDB by adding Thread.Sleep(5) - about one hour from now
  2. Complex distributed transactions with RavenDB - about one day from now

There are posts all the way to May 27, 2020

RECENT SERIES

  1. Webinar recording (4):
    25 May 2020 - Event sourcing and RavenDB
  2. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  3. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
  4. Production postmortem (29):
    23 Mar 2020 - high CPU when there is little work to be done
  5. RavenDB 5.0 (3):
    20 Mar 2020 - Optimizing date range queries
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats