Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,880 | Comments: 49,255

filter by tags archive
time to read 2 min | 272 words

We have a bunch of smart people whose job description does not include breaking RavenDB, but nevertheless they manage to do so on a regular basis. This is usually done while subjecting RavenDB to new and interesting ways to distress it. The latest issue was being able to crash the system with an impossible bug. That was interesting to go through, so it is worth a post.

Take a look at the following code:

This code will your application (you can discard multi threading as a source, by the way) in a particularly surprising way.

Let’s assume that we have a system that is running under low memory conditions. The following sequence of events is going to occur:

  • BufferedChannel is allocated.
  • The BufferedChannel constructor is being run.
  • The attempt to allocate the _destinations list fails.
  • An exception is thrown upward and handled.

So far, so good, right? Except that now your system is living on borrowed time.

You see, this class has a finalizer. And the fact that the constructor wasn’t called doesn’t matter to the finalizer. It will still call the Finalize method, but that method calls to Dispose, and Dispose expect to get a valid object. At this point, we are going to get a Null Reference Exception from the finalizer, which is considered to be a critical error and fail the entire process.

For additional fun, this kind of failure will happen only in under harsh conditions. When the OS refuses allocations, which doesn’t happen very often. We have been running memory starvation routines for a while, and it takes a specific set of failures to get it failing in just the right manner to cause this.

time to read 1 min | 190 words

We run into a hard failure during some diagnostics runs that was quite interesting. Here it the relevant code:

The code was meant to serve as a repository of commands so we can see in what order they run into the system. (Yes, we could use a queue, instead, but the code above isn’t what we actually run, it is just enough to reproduce the problem I wanted to talk about). And it failed, because there was already a value with the same key.

That was strange, because we didn’t expect that to happen, obviously. We wrote the following code:

And we got some really interesting results. On my laptop, I get:

Changes: 9388, Idle: 5442

We got widely variant results. Mostly, I believe, it related to the system configuration and the CPU frequency. But the basic problem is that on that particular machine, we were fast enough to be able to run commands faster than the machine could count Smile.

That is a good problem to have, I think. We fixed it by making sure our time source is monotonically increasing, which is good enough for debug code.

time to read 3 min | 564 words

We got a bug report recently about RavenDB messing up a query. The query in question was:

from index TestIndex where Amount < 0

The query, for what it is worth, is meant to find accounts that have pre-paid. The problem was that this query returned bad results. This was something very strange. Comparisons of numbers is a very well trodden field in RavenDB.

How can this be? Let’s look a bit deeper into things, shall we? Luckily the reproduction is pretty simple, let’s take a look:

With an index like this, we can be fairly certain that the query above will return no results, right? After all, this isn’t any kind of complex math. But the query will return results for this index. And that is strange.

You might have noticed that the numbers we use are decimals ( 1.5m is the C# postfix for decimal constant ). The problem repeats with decimal, double and float. When we started looking at this, we assumed that the problem was with floating point precision. After all, any comparison of floating point values will warn you about this. But using decimal, we are supposed to be protected from this.

When faced with such an issue, we dove deep into how RavenDB does numeric range comparisons. Sadly, this is not a trivial matter, but after reviewing everything, we were certain that the code is correct, it had to be something else.

I finally ended up with the following code:

And that gave me some interesting output:

00000000 - 00000000 - 00000000 - 00000000
00000000 - 00000000 - 00000000 - 80010000

0.0 == 0 ? True

And that was quite surprising.We can get the same using double of floats:

And this code gives us:

00-00-00-00-00-00-00-00
00-00-00-00-00-00-00-80
00-00-00-00
00-00-00-80

What is going on here?

Well, let’s look at the format of floating point numbers for a sec, shall we? There is a great discussion of this here, but the key observation for this particular issue can be seen by looking at the binary representation.

Here is 2.0:

 image

And is here -2.0:

image

As usual, the first bit is the sign marker. But unlike int32 or int64, with floating point, it is absolutely legal to have the following byte patterns:

image

image

The first one here is zero, and the second one is negative zero. They are equal to one another, but the binary representation is different. And that caused our issue.

Part of how RavenDB does range numeric queries on floating point it to translate them to a lexical value that can be compared using memcmp(), there is some magic involved, but the key observation was that we didn’t account for negative zero in the problem. The negative zero was translated to –9,223,372,036,854,775,808 (long.MinValue) and that obviously is smaller than zero.

The fix in our code was to handle this explicitly, like so:

And that is how I started out by chasing bugs and ended up with a sum total of negative zero.

time to read 5 min | 946 words

imageThis story started a few years ago, in a very non technical setting. We changed the accountant that we use for Hibernating Rhinos. We outgrew the office we were using at the time and needed better services. Among the changes that was implemented as a result of this move was the usage of new accounting software. Nothing really that interesting, to be frank. I like that my accounting is boring. However, the new accounting software was an on-premise solution. In other words, we are the one running it. Which is perfectly fine, we provisioned a VM in our data center (a fancy name to the single rack that we had at the time) and let it run.

As you can imagine, we consider our accounting data to be mission critical, so to speak. I don’t mind not being able to access it for an hour, for example, but losing it is going to be Bad. So we had a backup, nothing really that interesting. We have a backup that goes to local disk on the VM, remote disk in the office and just to be safe, uploading the backup to S3. I asked one of our developers to take care of this, and aside from specifying that I want backups in triplicates, I didn’t really pay attention. That was around 2017, I believe.  I made sure that if the backup failed, we would get notified of that, and that was pretty much it.

One of the reasons that I like my accounting boring is that it simplify my life and reduces stress. Unfortunately, it seems like my accounting practices has a cost. In particular, it means that I favor paying a bit too much to the taxman. That means that all of the taxes are going out immediately, and the company doesn’t end up the year with a large tax bill that we need to cover. But I overdid it a time or two, and we overpaid on our taxes. Well, that was by design, extra money showing up from the taxman is much better than a surprise bill. But at certain point, we were supposed to get a refund for a non trivial amount. At which point the tax authorities came a-calling and audited us.

Remember that I talked about boring accounting practices. The day we started the audit, I was having dinner with my wide and being audited was the third topic of the day, if I recall properly. They found a few things that we did wrong (we registered an invoice for the wrong currency, so we cancelled it and issued a new one, instead of refunding it and issuing a new one). That was a Thing, it seemed. But the end result was pretty much nothing. I loved it. Since then, we were audited a few more times, always with no repercussions.

Given that the next audit is a question of when (usually every 18 – 30 months or so, it seems), not if. I really care about my accounting data. Hence the triple backups policy. You might have been going through this post expecting to hear that we lost the accounting data, and the backup failed, and now my accountant outlook is decidedly not boring. I’m afraid that this is only half true. We did have a failed backup, but we caught it before we actually needed it.

At one point, I looked at out backup policies, and I noticed that the accounting backup was months old at this point. That was concerning, I gotta say. Here is the timeline, as I could piece is together:

  • Q2 2017 – Backup process is defined and tested. This is a one off process that we use only for the accounting database.
  • Q1 2018 – Routine key rotation is performed on some of our keys. Unbeknownst to us, the backup process lose the ability to report failure. But given that it doesn’t fail, no one notices.
  • Q4 2018 – The developer responsible for setting up the backup process leave the company. As part of the outgoing employee process, we shut down relevant user accounts.
  • Q1 2019 – The accounting server is rebooted. The backup process fails to start, because the user account is disabled.

You might notice the scale of this issue. The underlying problem was that the developer setup this one off process as a… well, one off process. That meant that it wasn’t hooked to any of our usual monitoring / alert systems. It did have a way to report on errors, but the credentials on that went stale after a year. No one paid attention, since the backups continued to run.

The backup process was also running under the user account of the developer, not a service account. I guess it was easier than creating a user, but the end result was that when we deactivated the user account after the developer left the company, we also disabled the backup. But the process was running, and it continued to run for months.  Only much later will the process fail to start, and by then there was no way to report errors, and we noticed it only because we looked for that during routine operations.

One of the reasons we had built backups directly into the core of RavenDB was exactly this sort of situations. A backup process is not something that you cobble together (that’s on us, to be fair), it is something that should be part and parcel of the operations of your database, and being able to do something like get backups in triplicate is essential for good operations experience.

time to read 3 min | 414 words

imageI was writing code, in the zone, slinging features around and in general having a great time. I was able to create the structure that I wanted, and things worked. So I started to do another pass on the code, to make sure that we go the error handling right.

I ended up writing code similar to the following:

As you can see, spawning of dinos can fail. Maybe there aren’t enough resources, maybe I have reached the limits of this particular type of dino. Regardless of what exactly was failing, we would try a bunch of options before giving up. That way, we can be sure that the usual suspects has already been accounted for.

This code failed with a strange error: “Cannot change the horn VIN: ‘horn-18238a8as81’”. 

I’m not setting the horn VIN anywhere, I’m using the high level API to spawn dinos, why am I seeing things in this manner. I debugged this, but I couldn’t figure out what was going on. I run the code on a separate project, and it worked. I rules out web app vs. console app, I decompiled the code and looked at things, I sniffed the network and looked into what was going on. I was completely lost.

At that point, I called another dev over to look at things and tried to explain what was going on. As I was going on and on about all the things that I tried and how much I hate the Jurassic period and how the Triassic has a much better API, he gently coughed and asked me if I’m not missing a break here.

I actually looked at the code and wanted to cry. My code did the exact thing that I asked it to. If I was able to successfully spawn a dino, I would go ahead and create another one, with the same name. It looks like deep in the guts of the system, there were assignments of ids to the various pieces, and trying to create an instance with the same name caused this error.

Cue head on keyboard a few times, a single statement added and everything worked.

I literally spent a few hours on this thing, which is really embarrassing. I would probably spend a lot more if I didn’t have a fresh set of eyes come and point it out.

time to read 4 min | 602 words

imageI spoke about this in the video, and it seems to have caught a lot of people yes, so I thought that I would talk here a bit how we trace the root cause of a RavenDB critical issue to a printer being out of paper.

What is the relationship, I can hear you ask, between RavenDB, a document database, and a printer being out of paper? That is a good question. The answer is pretty much none. There is no DocPrint module inside of RavenDB and the last time yours truly wrote printer code was over fifteen years ago. But the story started, like all good tech stories, with a phone call. An inconveniently timed phone call, I might add.

Imagine, the year is 2013, and I’m enjoying the best part of the year, December. I love December for a few reasons. I was born in it, and it is a time where pretty much all our customers are busy doing other stuff and we can focus on pure development. So imagine my surprise when, on the other side of the line, I got a pretty upset admin that had to troubleshoot a RavenDB instance that would refuse to start. To make things interesting, this admin had drawn the short straw, I assume, and had to man the IT department while everyone else was on vacation. He had no relation to RavenDB during normal operations and only had the bare minimum information about it.

The symptoms were clear, luckily. RavenDB would simply refuse to start, hanging almost immediately, it seemed. No network access, nothing in the logs to indicate a problem. Just… hanging…

Here is the stack trace that we captured:

And that was… it. We started to look into it, and we run into this StackOverflow question, which was awesome. Indeed, restarting the print spooler would fix the problem and RavenDB would immediately start running. But I couldn’t leave it like this, and I guess the admin on the other side was kinda bored, because he went along with my investigation.

We now have access to the code (we didn’t in 2013) and can look at exactly what is going on. This comment had me… upset:

image

The service manager in Windows will consider any service that didn’t finish initialization in 30 seconds to be failed and kill it.  You might be putting the things together at this point?

After restarting the print spooler, we were able to start RavenDB, but restarting it cause another failure. Eventually we tracked it down to a network printer that was out of paper (presumably no one in the office to notice / fill it up). My assumption is that the print spooler was holding a register key hostage while making a network call to the printer that would time out because it didn’t have any paper. If at that time you would attempt to use a performance counter, you would hang, and if you are running as service, would be killed.

I’m left with nothing to do but quote Leslie Lamport:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Oh so true.

We ended up ditching performance counters entirely after running into so many issues around them. As of 2017, people were still running into this issue, so I think that was a great decision.

time to read 4 min | 704 words

One of the changes that we made in RavenDB 4.2 is a pretty radical one, even if we didn’t really talk about it. It is the fact that RavenDB now contains C code. Previously, we were completely managed (with a bunch of P/Invoke) calls. But we now we have some C code in the project. The question is why?

The answer isn’t actually about performance or the power of native C code. We are usually pretty happy with the kind of assembly instructions that we can get from C#. The actual problem was that we needed a proper abstraction. At this moment, RavenDB is running on the following platforms:

  • Windows x86-32 bits
  • Windows x86-64 bits
  • Linux x86-32 bits
  • Linux x86-64 bits
  • Linux ARM 32 bits
  • Linux ARM 64 bits
  • macOS 64 bits

And each of this platform requires some changes in how we do things. The other problem is that .NET is a well specified system, all the types sizes are well known, for example. The same isn’t true for the underlying API. Windows does a really good job of maintaining proper APIs across versions and 32/64 editions. Linux basically doesn’t seem to care. Types sizes change quite often, sometimes in unpredictable ways.

Probably the most fun part was figuring out that on x86, Linux syscall #224 is gettid(), but on ARM, you’ll call to gettime(). The problem is that if you are using C, all of that is masked for you. And it got quite unwieldly. So we decided to create a PAL (platform abstraction layer) in C to handle these details.

The rules for the PAL are simple, we don’t make assumptions about types, sizes or underlying platform. For example, let’s take a look at some declarations.

image

All the types are explicit about their size, and where we need to pass a complex type (SYSTEM_INFORMATION) we define it ourselves, rather than rely on any system type. And here are the P/Invoke definitions for these calls. Again, we are being explicit with the types event though in C# the size of types are fixed.

image

You might have noticed that I care about error handling. And error handling in C is… poor. We use the following convention in these kind of operations:

  • Each method does a single logical thing.
  • Each method return either success or flag indication the internal step in which it fail.
  • On failure, the method will return the system error code for the failure.

The idea is that on the calling side, we can construct exactly where and why we failed and still get good errors.

Yesterday I run into an issue where we didn’t move some code to the PAL, and we actually had a bug there. The problem was that when running on ARM32 machine, we would pass a C# struct to a syscall. The problem was that we defined that struct based on the values in 64 bits Linux. When called on 32 bits system, the values went to the wrong location. Luckily, this was a call that was never used. It is used by our dashboard to let the admin know how much disk space is available. There is no code that actually take action based on this information.

Which was great, because when we actually run the code, we got this value in the Studio:

image

When I dug deeper into the code, it gave really bad results. My Raspberry PI thought it had 700 PB of disk space free. The actual reason we got this funny error? We send the number of bytes to the client, and under these conditions, we can only support up to about 8 PB of free space in the browser.

I moved the code from C# P/Invoke to a simple method to calculate this:

image

Implementing this for all platforms means that we have a much nicer interface and our C# code is abstracted from the gory details on how we actually compute this.

time to read 1 min | 142 words

Today my team tracked down, hunted and summarily executed a bug. It was a nasty one, because under certain (pretty rare) error conditions, we would get into pretty nasty state and not recover from it.

The problem was that the code to handle this and handle the state properly was there, and on when we tested it locally, worked just fine. But on production, it didn’t work. Unless we tried to figure out why it wasn’t working, in which case it did work.

As you can imagine, this can be a pretty frustrating state of affair. Eventually we tracked it down to the following problem. Here is the relevant code. And the problem is that the exception aren’t thrown (and therefor not handled). Can you see the issue?

image

time to read 3 min | 568 words

Our test process occasionally crashed with an access violation exception. We consider these to be Priority 0 bugs, so we had one of the most experience developers in the office sit on this problem.

Access violation errors are nasty, because they give you very little information about what is going on, and there is typically no real way to recover from them. We have a process to deal with them, though. We know how to setup things so we’ll get a memory dump on error, so the very first thing that we work toward is to reproduce this error.

After a fair bit of effort, we managed to get to a point where we can semi-reliably reproduce this error. This means, if you wanna know, that do “stuff” and get the error in under 15 minutes. That’s the reason we need the best people on those kind of investigations. Actually getting to the point where this fails is the most complex part of the process.

The goal here is to get to two really important pieces of information:

  • A few memory dumps of the actual crash – these are important to be able to figure out what is going on.
  • A way to actually generate the crash – in a reasonable time frame, mostly because we need to verify that we actually fixed the issue.

After a bunch of work, we were able to look at the dump file and found that the crash originated from Voron’s code. The developer in charge then faulted, because they  tried to increase the priority of an issue with Priority 0 already, and P-2147483648 didn’t quite work out.

We also figured out that this can only occur on 32 bits, which is really interesting. 32 bits is a constrained address space, so it is a great way to run into memory bugs.

We started to look even more closely at this. The problem happened while running memcpy(), and looking at the addresses that were passed to the function, one of them was Voron allocated memory, whose state was just fine. The second value pointed to a MEM_RESERVE portion of memory, which didn’t make sense at all.

Up the call stack we went, to try to figure out what we were doing. Here is where we ended up in (hint: The crash happened deep inside the Insert() call).

image

This is test code, mind you, exercising some really obscure part of Voron’s storage behavior. And once we actually looked at the code, it was obvious what the problem was.

We were capturing the addresses of an array in memory, using the fixed statement.

But then we used them outside the fixed. If there happened to be a GC between these two lines, and if it happened to move the memory and free the segment, we would access memory that is no longer valid. This would result in an access violation, naturally. I think we were only able to reproduce this in 32 bits because of the tiny address space. In 64 bits, there is a lot less pressure to move the memory, so it remains valid.

Luckily, this is an error only in our tests, so we reduce our DEFCON level to more reasonable value. The fix was trivial (move the Insert calls to the fixed scope), and we were able to test that this fixed the issue.

Bug of the day

time to read 1 min | 108 words

This one was in an experimental feature, as part of the extra testing process of making this into a stable feature, we run into a bug.

Here is the bug fix:

image

The code is meant to detect changes in a distributed environment and we checked the wrong location, meaning that we never actually did the check. In 99.999% of the time, this already happened, but it exposed us to some nasty race conditions.

This particular piece of code has been the subject to multiple code reviews, all of which never noticed the issue.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. re (22):
    19 Aug 2019 - The Order of the JSON, AKA–irresponsible assumptions and blind spots
  2. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  3. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
  4. Production postmortem (26):
    07 Jun 2019 - Printer out of paper and the RavenDB hang
  5. Reviewing Sled (3):
    23 Apr 2019 - Part III
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats