Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,992 | Comments: 49,637

filter by tags archive
time to read 4 min | 654 words

A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:

image

Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.

Running the query, we would get:

image

But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?

We looked at the score, here are the values for the part where we see the difference:

  • 1.702953815460205
  • 1.7029536962509155

Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:

image

I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.

We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:

image

They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?

image

Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756

This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.

I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.

Usually, this doesn’t matter, but please look at the above values and consider the following fact:

What do you think are the values of C and D in the code above?

  • c = 1.4082127
  • d = 1.4082128

Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.

The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.

Really annoying.

time to read 3 min | 523 words

A RavenDB user called us with a very strange issue. They are running on RavenDB 3.5 and have run out of disk space. That is expected, since they are storing a lot of data. Instead of simply increasing the disk size, they decided to take the time and simply increase the machine overall capacity. They moved from a 16 cores machine to a 48 cores machine with a much larger disk.

After the move, they found out something worrying. RavenDB now used a lot more CPU. If previous the average load was around 60% CPU utilization, now they were looking at 100% utilization on a much more powerful machine. That didn’t make sense to us, so we set out to figure out what was going on. A couple of mini dumps and we were able to figure out what was going on.

It got really strange because there was the following interesting observation:

  • Under minimal load / idle – no CPU at all
  • Under high load – CPU utilization in the 40%
  • Under medium load – CPU utilization at 100%

That was strange. When there isn’t enough load, we are at a 100%? What gives?

The culprit was simple: BlockingCollection.

“Huh”, I can hear you say. “How can that be?”

A BlockingCollection should not be the cause of high CPU, right? It is in the name, it is blocking. Here is what happened. That blocking collection is used to manage tasks, and by default we are spawning threads to handle that at twice the number of available cores. All of these threads are sitting in a loop, calling Take() on the blocking collection.

The blocking collection internally is implemented as using a SemaphoreSlim, which call Wait() and Release() on the values as needed. Here is the Release() method notifying waiters:

image

What you can see is that if we have more than a single waiter, we’ll update all of them. The system in question had 48 cores, so we had 96 threads waiting for work. When we add an item to the collection, all of them will wake and try to pull an item from the collection. Once of them will succeed, and then rest will not.

Here is the relevant code:

image

As you can imagine, that means that we have 96 threads waking up and spending a full cycle just spinning. That is the cause of our high CPU.

If we have a lot of work, then the threads are busy actually doing work, but if there is just enough work to wake the threads, but not enough to give them something to do, they’ll set forth to see how hot they can make the server room.

The fix was to reduce the number of threads waiting on this queue to a more reasonable number.

The actual problem was fixed in .NET Core, where the SemaphoreSlim will only wake as many threads as it has items to free, which will avoid the spin storm that this code generates.

time to read 3 min | 434 words

RavenDB makes extensive use of certificates for authentication and encryption. They allow us to safely communicate between distributed instances without worrying about a man in the middle or eavesdroppers. Given the choices we had to implement authentication, I’m really happy with the results of choosing certificates as the foundation of our authentication infrastructure.

It would be too good, however, to expect to have no issues with certificates. The topic of this point is a puzzler. A user has chosen to use a self signed certificate for the nodes in the cluster, but was unable to authenticate between the servers unless they registered the certificate in the OS’ store.

That sounds reasonable, right? If this is a self signed certificate, we obviously don’t trust it, so we need this extra step to ensure that we do trust it. However, we designed RavenDB specifically to avoid this step. If you are using a self signed certificate, the server will trust its own certificate, and thus will trust anyone that is using the same certificate.

In this case, however, that wasn’t happening. For some reason, the code path that we use to ensure that we trust our own certificate was not being activated, and that was a puzzler indeed.

One of the things that RavenDB does on first startup is to try to connect to itself as a client. It checks whatever it is successful or not. If not, we’ll try again, ignoring the registered root CAs. If we are successful at that point, we know what the issue here and ensure that we ignore the untrusted signer on the certificate. We only enable this code path if by default we don’t trust our own certificate.

Looking at the logs, we could see that we got a failure when talking to ourselves, some sort of a device not ready issue. That was strange. We hooked strace to look into what was going on, but there was nothing that was wrong at the sys call level. Then we looked into what was going on and realized that the issue was that the server’s was configured to use: https://ravendb-1.francecentral.cloudapp.azure.com/ but was actually hosted on https://ravendb-1-tst.francecentral.cloudapp.azure.com/

Do you see the difference?

The server was try to contact itself using the configured hostname. It failed, because of a DNS issue, so it couldn’t contact itself to figure out that the certificate was invalid. At that point, it didn’t install the hook and wouldn’t trust the self signed certificate.

So the issue started with investigating why we nodes in the cluster don’t trust each other with self signed certificate and got resolved by a simple configuration error.

time to read 2 min | 272 words

We have a bunch of smart people whose job description does not include breaking RavenDB, but nevertheless they manage to do so on a regular basis. This is usually done while subjecting RavenDB to new and interesting ways to distress it. The latest issue was being able to crash the system with an impossible bug. That was interesting to go through, so it is worth a post.

Take a look at the following code:

This code will your application (you can discard multi threading as a source, by the way) in a particularly surprising way.

Let’s assume that we have a system that is running under low memory conditions. The following sequence of events is going to occur:

  • BufferedChannel is allocated.
  • The BufferedChannel constructor is being run.
  • The attempt to allocate the _destinations list fails.
  • An exception is thrown upward and handled.

So far, so good, right? Except that now your system is living on borrowed time.

You see, this class has a finalizer. And the fact that the constructor wasn’t called doesn’t matter to the finalizer. It will still call the Finalize method, but that method calls to Dispose, and Dispose expect to get a valid object. At this point, we are going to get a Null Reference Exception from the finalizer, which is considered to be a critical error and fail the entire process.

For additional fun, this kind of failure will happen only in under harsh conditions. When the OS refuses allocations, which doesn’t happen very often. We have been running memory starvation routines for a while, and it takes a specific set of failures to get it failing in just the right manner to cause this.

time to read 1 min | 190 words

We run into a hard failure during some diagnostics runs that was quite interesting. Here it the relevant code:

The code was meant to serve as a repository of commands so we can see in what order they run into the system. (Yes, we could use a queue, instead, but the code above isn’t what we actually run, it is just enough to reproduce the problem I wanted to talk about). And it failed, because there was already a value with the same key.

That was strange, because we didn’t expect that to happen, obviously. We wrote the following code:

And we got some really interesting results. On my laptop, I get:

Changes: 9388, Idle: 5442

We got widely variant results. Mostly, I believe, it related to the system configuration and the CPU frequency. But the basic problem is that on that particular machine, we were fast enough to be able to run commands faster than the machine could count Smile.

That is a good problem to have, I think. We fixed it by making sure our time source is monotonically increasing, which is good enough for debug code.

time to read 3 min | 564 words

We got a bug report recently about RavenDB messing up a query. The query in question was:

from index TestIndex where Amount < 0

The query, for what it is worth, is meant to find accounts that have pre-paid. The problem was that this query returned bad results. This was something very strange. Comparisons of numbers is a very well trodden field in RavenDB.

How can this be? Let’s look a bit deeper into things, shall we? Luckily the reproduction is pretty simple, let’s take a look:

With an index like this, we can be fairly certain that the query above will return no results, right? After all, this isn’t any kind of complex math. But the query will return results for this index. And that is strange.

You might have noticed that the numbers we use are decimals ( 1.5m is the C# postfix for decimal constant ). The problem repeats with decimal, double and float. When we started looking at this, we assumed that the problem was with floating point precision. After all, any comparison of floating point values will warn you about this. But using decimal, we are supposed to be protected from this.

When faced with such an issue, we dove deep into how RavenDB does numeric range comparisons. Sadly, this is not a trivial matter, but after reviewing everything, we were certain that the code is correct, it had to be something else.

I finally ended up with the following code:

And that gave me some interesting output:

00000000 - 00000000 - 00000000 - 00000000
00000000 - 00000000 - 00000000 - 80010000

0.0 == 0 ? True

And that was quite surprising.We can get the same using double of floats:

And this code gives us:

00-00-00-00-00-00-00-00
00-00-00-00-00-00-00-80
00-00-00-00
00-00-00-80

What is going on here?

Well, let’s look at the format of floating point numbers for a sec, shall we? There is a great discussion of this here, but the key observation for this particular issue can be seen by looking at the binary representation.

Here is 2.0:

 image

And is here -2.0:

image

As usual, the first bit is the sign marker. But unlike int32 or int64, with floating point, it is absolutely legal to have the following byte patterns:

image

image

The first one here is zero, and the second one is negative zero. They are equal to one another, but the binary representation is different. And that caused our issue.

Part of how RavenDB does range numeric queries on floating point it to translate them to a lexical value that can be compared using memcmp(), there is some magic involved, but the key observation was that we didn’t account for negative zero in the problem. The negative zero was translated to –9,223,372,036,854,775,808 (long.MinValue) and that obviously is smaller than zero.

The fix in our code was to handle this explicitly, like so:

And that is how I started out by chasing bugs and ended up with a sum total of negative zero.

time to read 5 min | 946 words

imageThis story started a few years ago, in a very non technical setting. We changed the accountant that we use for Hibernating Rhinos. We outgrew the office we were using at the time and needed better services. Among the changes that was implemented as a result of this move was the usage of new accounting software. Nothing really that interesting, to be frank. I like that my accounting is boring. However, the new accounting software was an on-premise solution. In other words, we are the one running it. Which is perfectly fine, we provisioned a VM in our data center (a fancy name to the single rack that we had at the time) and let it run.

As you can imagine, we consider our accounting data to be mission critical, so to speak. I don’t mind not being able to access it for an hour, for example, but losing it is going to be Bad. So we had a backup, nothing really that interesting. We have a backup that goes to local disk on the VM, remote disk in the office and just to be safe, uploading the backup to S3. I asked one of our developers to take care of this, and aside from specifying that I want backups in triplicates, I didn’t really pay attention. That was around 2017, I believe.  I made sure that if the backup failed, we would get notified of that, and that was pretty much it.

One of the reasons that I like my accounting boring is that it simplify my life and reduces stress. Unfortunately, it seems like my accounting practices has a cost. In particular, it means that I favor paying a bit too much to the taxman. That means that all of the taxes are going out immediately, and the company doesn’t end up the year with a large tax bill that we need to cover. But I overdid it a time or two, and we overpaid on our taxes. Well, that was by design, extra money showing up from the taxman is much better than a surprise bill. But at certain point, we were supposed to get a refund for a non trivial amount. At which point the tax authorities came a-calling and audited us.

Remember that I talked about boring accounting practices. The day we started the audit, I was having dinner with my wide and being audited was the third topic of the day, if I recall properly. They found a few things that we did wrong (we registered an invoice for the wrong currency, so we cancelled it and issued a new one, instead of refunding it and issuing a new one). That was a Thing, it seemed. But the end result was pretty much nothing. I loved it. Since then, we were audited a few more times, always with no repercussions.

Given that the next audit is a question of when (usually every 18 – 30 months or so, it seems), not if. I really care about my accounting data. Hence the triple backups policy. You might have been going through this post expecting to hear that we lost the accounting data, and the backup failed, and now my accountant outlook is decidedly not boring. I’m afraid that this is only half true. We did have a failed backup, but we caught it before we actually needed it.

At one point, I looked at out backup policies, and I noticed that the accounting backup was months old at this point. That was concerning, I gotta say. Here is the timeline, as I could piece is together:

  • Q2 2017 – Backup process is defined and tested. This is a one off process that we use only for the accounting database.
  • Q1 2018 – Routine key rotation is performed on some of our keys. Unbeknownst to us, the backup process lose the ability to report failure. But given that it doesn’t fail, no one notices.
  • Q4 2018 – The developer responsible for setting up the backup process leave the company. As part of the outgoing employee process, we shut down relevant user accounts.
  • Q1 2019 – The accounting server is rebooted. The backup process fails to start, because the user account is disabled.

You might notice the scale of this issue. The underlying problem was that the developer setup this one off process as a… well, one off process. That meant that it wasn’t hooked to any of our usual monitoring / alert systems. It did have a way to report on errors, but the credentials on that went stale after a year. No one paid attention, since the backups continued to run.

The backup process was also running under the user account of the developer, not a service account. I guess it was easier than creating a user, but the end result was that when we deactivated the user account after the developer left the company, we also disabled the backup. But the process was running, and it continued to run for months.  Only much later will the process fail to start, and by then there was no way to report errors, and we noticed it only because we looked for that during routine operations.

One of the reasons we had built backups directly into the core of RavenDB was exactly this sort of situations. A backup process is not something that you cobble together (that’s on us, to be fair), it is something that should be part and parcel of the operations of your database, and being able to do something like get backups in triplicate is essential for good operations experience.

time to read 3 min | 414 words

imageI was writing code, in the zone, slinging features around and in general having a great time. I was able to create the structure that I wanted, and things worked. So I started to do another pass on the code, to make sure that we go the error handling right.

I ended up writing code similar to the following:

As you can see, spawning of dinos can fail. Maybe there aren’t enough resources, maybe I have reached the limits of this particular type of dino. Regardless of what exactly was failing, we would try a bunch of options before giving up. That way, we can be sure that the usual suspects has already been accounted for.

This code failed with a strange error: “Cannot change the horn VIN: ‘horn-18238a8as81’”. 

I’m not setting the horn VIN anywhere, I’m using the high level API to spawn dinos, why am I seeing things in this manner. I debugged this, but I couldn’t figure out what was going on. I run the code on a separate project, and it worked. I rules out web app vs. console app, I decompiled the code and looked at things, I sniffed the network and looked into what was going on. I was completely lost.

At that point, I called another dev over to look at things and tried to explain what was going on. As I was going on and on about all the things that I tried and how much I hate the Jurassic period and how the Triassic has a much better API, he gently coughed and asked me if I’m not missing a break here.

I actually looked at the code and wanted to cry. My code did the exact thing that I asked it to. If I was able to successfully spawn a dino, I would go ahead and create another one, with the same name. It looks like deep in the guts of the system, there were assignments of ids to the various pieces, and trying to create an instance with the same name caused this error.

Cue head on keyboard a few times, a single statement added and everything worked.

I literally spent a few hours on this thing, which is really embarrassing. I would probably spend a lot more if I didn’t have a fresh set of eyes come and point it out.

time to read 4 min | 602 words

imageI spoke about this in the video, and it seems to have caught a lot of people yes, so I thought that I would talk here a bit how we trace the root cause of a RavenDB critical issue to a printer being out of paper.

What is the relationship, I can hear you ask, between RavenDB, a document database, and a printer being out of paper? That is a good question. The answer is pretty much none. There is no DocPrint module inside of RavenDB and the last time yours truly wrote printer code was over fifteen years ago. But the story started, like all good tech stories, with a phone call. An inconveniently timed phone call, I might add.

Imagine, the year is 2013, and I’m enjoying the best part of the year, December. I love December for a few reasons. I was born in it, and it is a time where pretty much all our customers are busy doing other stuff and we can focus on pure development. So imagine my surprise when, on the other side of the line, I got a pretty upset admin that had to troubleshoot a RavenDB instance that would refuse to start. To make things interesting, this admin had drawn the short straw, I assume, and had to man the IT department while everyone else was on vacation. He had no relation to RavenDB during normal operations and only had the bare minimum information about it.

The symptoms were clear, luckily. RavenDB would simply refuse to start, hanging almost immediately, it seemed. No network access, nothing in the logs to indicate a problem. Just… hanging…

Here is the stack trace that we captured:

And that was… it. We started to look into it, and we run into this StackOverflow question, which was awesome. Indeed, restarting the print spooler would fix the problem and RavenDB would immediately start running. But I couldn’t leave it like this, and I guess the admin on the other side was kinda bored, because he went along with my investigation.

We now have access to the code (we didn’t in 2013) and can look at exactly what is going on. This comment had me… upset:

image

The service manager in Windows will consider any service that didn’t finish initialization in 30 seconds to be failed and kill it.  You might be putting the things together at this point?

After restarting the print spooler, we were able to start RavenDB, but restarting it cause another failure. Eventually we tracked it down to a network printer that was out of paper (presumably no one in the office to notice / fill it up). My assumption is that the print spooler was holding a register key hostage while making a network call to the printer that would time out because it didn’t have any paper. If at that time you would attempt to use a performance counter, you would hang, and if you are running as service, would be killed.

I’m left with nothing to do but quote Leslie Lamport:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Oh so true.

We ended up ditching performance counters entirely after running into so many issues around them. As of 2017, people were still running into this issue, so I think that was a great decision.

time to read 4 min | 704 words

One of the changes that we made in RavenDB 4.2 is a pretty radical one, even if we didn’t really talk about it. It is the fact that RavenDB now contains C code. Previously, we were completely managed (with a bunch of P/Invoke) calls. But we now we have some C code in the project. The question is why?

The answer isn’t actually about performance or the power of native C code. We are usually pretty happy with the kind of assembly instructions that we can get from C#. The actual problem was that we needed a proper abstraction. At this moment, RavenDB is running on the following platforms:

  • Windows x86-32 bits
  • Windows x86-64 bits
  • Linux x86-32 bits
  • Linux x86-64 bits
  • Linux ARM 32 bits
  • Linux ARM 64 bits
  • macOS 64 bits

And each of this platform requires some changes in how we do things. The other problem is that .NET is a well specified system, all the types sizes are well known, for example. The same isn’t true for the underlying API. Windows does a really good job of maintaining proper APIs across versions and 32/64 editions. Linux basically doesn’t seem to care. Types sizes change quite often, sometimes in unpredictable ways.

Probably the most fun part was figuring out that on x86, Linux syscall #224 is gettid(), but on ARM, you’ll call to gettime(). The problem is that if you are using C, all of that is masked for you. And it got quite unwieldly. So we decided to create a PAL (platform abstraction layer) in C to handle these details.

The rules for the PAL are simple, we don’t make assumptions about types, sizes or underlying platform. For example, let’s take a look at some declarations.

image

All the types are explicit about their size, and where we need to pass a complex type (SYSTEM_INFORMATION) we define it ourselves, rather than rely on any system type. And here are the P/Invoke definitions for these calls. Again, we are being explicit with the types event though in C# the size of types are fixed.

image

You might have noticed that I care about error handling. And error handling in C is… poor. We use the following convention in these kind of operations:

  • Each method does a single logical thing.
  • Each method return either success or flag indication the internal step in which it fail.
  • On failure, the method will return the system error code for the failure.

The idea is that on the calling side, we can construct exactly where and why we failed and still get good errors.

Yesterday I run into an issue where we didn’t move some code to the PAL, and we actually had a bug there. The problem was that when running on ARM32 machine, we would pass a C# struct to a syscall. The problem was that we defined that struct based on the values in 64 bits Linux. When called on 32 bits system, the values went to the wrong location. Luckily, this was a call that was never used. It is used by our dashboard to let the admin know how much disk space is available. There is no code that actually take action based on this information.

Which was great, because when we actually run the code, we got this value in the Studio:

image

When I dug deeper into the code, it gave really bad results. My Raspberry PI thought it had 700 PB of disk space free. The actual reason we got this funny error? We send the number of bytes to the client, and under these conditions, we can only support up to about 8 PB of free space in the browser.

I moved the code from C# P/Invoke to a simple method to calculate this:

image

Implementing this for all platforms means that we have a much nicer interface and our C# code is abstracted from the gory details on how we actually compute this.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (29):
    23 Mar 2020 - high CPU when there is little work to be done
  2. RavenDB 5.0 (3):
    20 Mar 2020 - Optimizing date range queries
  3. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  4. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  5. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats