Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,686 | Comments: 48,552

filter by tags archive

Subtle corruption and the debugger

time to read 1 min | 172 words

We had a bug. If a certain method was called, we would do something very bad. Here is the fix for this issue:

image

Basically, we assumed that the passed pointer is a char pointer and not a UTF8 byte pointer. That led to horrible mess down the line, including the fact that the length was pass to the constructor is twice the size of the actual allocated memory.

In rare cases, that would be enough to push us to the next page of memory. If that page of memory wasn’t mapped, we would die with an access violation.

There is just one problem with this scenario. We never called this method in our codebase. This method was implicitly called by the debugger to show nice strings. Which meant that during debugging, sometimes, we would corrupt our own state and end up killing ourselves. For fun, this is the kind of error that literally cannot be debugged.

When the code flat out lies to you

time to read 1 min | 154 words

Take a look at the following snippet. This is a table definition using Esent.

image

At a glance, this looks fine. It defines a table with a primary key that is auto incrementing whose type is long. Everything seems fine and dandy.

Unfortunately, there is a horrible bug in this code. The Long type here refers to Win32 long type in C which is… 32 bits. When I wrote this code, I assumed that this is C#’s long type, which is 64 bits in size.

The difference matters, because after just over 2 billion inserts to this table, everything is going to break in a horrible horrible horrible manner. At which point the only fix is to run a schema upgrade on the internal data structure, which is… an interesting experience when you have literally billions of records there.

I want to see the QA process that catch this bug!

time to read 2 min | 344 words

When we get bug reports from the field, we routinely also do a small assessment to figure out why we missed the issue in our own internal tests and runway to production.

We just got a bug report like that. RavenDB is not usable at all on a Raspberry PI because of an error about Non ASCII usage.

This is strange. To start with we test on Raspberry Pi. To be rather more exact, we test on the same hardware and software combination that the user was running on.  And what is this Non ASCII stuff? We don’t have any such thing in our code.

As we investigated, we figured out that the root cause was that we were trying to pass a Non ASCII value to the headers of the request. That didn’t make sense, the only things we write to the request in this case is well defined values, such as numbers and constant strings. All of which should be in ASCII. What was going on?

After a while, the mystery cleared. In order to reproduced this bug, you needed to have the following preconditions:

  • A file hashed to a negative Int64 value.
  • A system whose culture settings was set to sv-SE (Swedish).
  • Run on Linux.

This is detailed in this issue. On Linux (and not on Windows), when using Swedish culture, negative numbers are using: ”−1” and not “-1”.

For those of you with sharp eyes, you noticed that this is U+2212, (minus sign), and not U+002D (hyphen minus). On Linux, for Unicode knows what, this is used as the negative mark. I would complain, but my native language has „.

Anyway, the fix was to force the usage of invariant when converting the Int64 to a string for the header, which is pretty obvious. We are also exploring how to fix this in a more global manner.

But I keep coming back to the set of preconditions that is required. Sometimes I wonder why we miss a bug, in this case, I can only say that I would have been surprised if we would have found it.

RavenDB Security Vulnerability Advisory

time to read 3 min | 533 words

You can read the full details here. The short of it is that we discovered a security vulnerability in RavenDB. This post tells a story. For actionable operations, see the previous link and upgrade your RavenDB instance to a build that includes the fix.

Timeline:

  • June 6 – A routine code review inside RavenDB expose a potential flaw in sanitizing external input. It is escalated and confirmed be a security bug. Further investigation classify it as CRTICIAL issue. A lot of sad faces on our slack channels show up. The issue has the trifecta of security problems:
    • It is remotely exploitable.
    • It is on in the default configuration.
    • It provide privilege escalation (and hence, remote code execution).
  • June 6 – A fix is implemented. This is somewhat complicated by the fact that we don’t want it to look like a security fix to avoid this issue.
  • June 7 – The fix goes through triple code review by independent teams.
  • June 7 – An ad hoc team goes through all related functionality to see if similar issues are still present.
  • June 8 – Fixed version is deployed to our production environment.

We had to make a choice here, whatever to alert all users immediately, or first provide the fix and urge them to upgrade (while opening them up to attacks in the meanwhile). We also want to avoid the fix, re-fix, for-real-this-time cycle from rushing too often.

As this was discovered internally and there are no indications that this is known and/or exploited in the wild, we chose the more conservative approach and run our full “pre release” cycle, including full 72-96 hours in a production environment serving live traffic.

  • June 12 – The fix is now available in a publicly released version (4.0.5).
  • June 13 – Begin notification of customers. This was done by:
    • Emailing all RavenDB 4.0 users. One of the reasons that we ask for registration even for the free community edition is exactly this. We want to be able to notify users when such an event occur.
    • Publishing security notice on our website.
    • Pushing a notification to all vulnerable RavenDB nodes warning about this issue. Here is what this looks like:
      image
  • Since June 13 – Monitoring of deployed versions and checking for vulnerable builds still in use.
  • June 18 – This blog post and public notice in the mailing list to get more awareness of this issue. The website will also contain the following notice for the next couple weeks to make sure that everyone know that they should upgrade:
    image

We are also going to implement a better method to push urgent notices like that in the future, to make sure that we can better alert users. We have also inspected the same areas of the code in earlier versions and verified that this is a new issue and not something that impacts older versions.

I would be happy to hear what more we can do to improve both our security and our security practices.

And yes, I’ll discuss the actual vulnerability in detail in a month or so.

The case of the missing writes in Docker (a Data Corruption story)

time to read 6 min | 1017 words

image

We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

ChallengeThe invisible concurrency bug–Answer

time to read 1 min | 162 words

The bug from yesterday would only show when a particular query is being run concurrently, and not always then.

Here is the code that is responsible for the bug:

It is quite hard to see, because it is so subtle. The code here create a cached lambda that is global for the process. The lambda takes the current engine, the object to transform return the converted object.

So far, so good, right?

Except that in this case,  the lambda is capturing the engine parameter that is passed to the function. The engine is single threaded, and must not be used concurrently. The problem is that the code already handles this situation, and the current engine instance is passed to the lamda, where it is never used. The original engine instance is being used concurrently, violating its invariants and causing errors down the line.

The fix was to simply use the current engine instance that was passed to us, but this was really hard to figure out.

Production postmortemThe unavailable Linux server

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to 172.31.0.2 which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to 192.35.177.64 on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

Getting fatal out of memory errors because we are managing memory too well

time to read 4 min | 621 words

imageWe got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.

That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.

It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.

When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.

And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.

In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources  before the GC could realize that we are actually not really using them at this point.

The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.

Production crash root cause analysis step by step

time to read 3 min | 470 words

One of our production server crashed. They do that a lot, because we make them do evil things in harsh environment so they would crash on our end and not when running on your systems. This one is an interesting crash.

We are running the system with auto dump, so a crash will trigger a mini dump that we can run. Here is the root cause of the error, and IOException here:

    SP               IP               Function
    00000024382FF0D0 00007FFD8313108A System_Private_CoreLib!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)+0x3cf7ea
    00000024382FF140 00007FFD82D611A2 System_Private_CoreLib!System.IO.FileStream.FlushWriteBuffer(Boolean)+0x62
    00000024382FF180 00007FFD82D60FA1 System_Private_CoreLib!System.IO.FileStream.Dispose(Boolean)+0x41
    00000024382FF1D0 00007FFD82D60631 System_Private_CoreLib!System.IO.FileStream.Finalize()+0x11

And with this, we are almost done analyzing the problem, right? This is a crash because a finalizer has thrown, so we now know that we are leaking a file stream and that it throws on the finalizer.  We are almost, but not quite, done. We are pretty careful about such things, and most to all of our usages are wrapped in a using statement.

The reason for the error was quite apparent when we checked, we run out of disk space. But that is something that we extensively tested and we are supposed to work fine with that. What is going on?

No disk space is an interesting IO error, because it is:

  • transient, in the sense that the admin can clear some space, and does not expect you to crash.
  • persistent, in the sense that the application can’t really do something to recover from it.

As it turns out, we have been doing everything properly. Here is the sample code that will reproduce this error:

For the moment, ignore things like file metadata, etc. We are writing 12 bytes, and we have only place for 10. Note that calling WriteByte does not go to the disk, extend the file, etc. Instead, it will just write to the buffer and the buffer will be flushed when you call Flush or Dispose.

Pretty simple, right?

Now let us analyze what will happen in this case, shall we?

We call Dispose on the FileStream, and that ends up throwing, as expected. But let us dig a little further, shall we?

Here is the actual Dispose code:

This is in Stream.Dispose. Now, consider the case of a class that:

  • Implements a finalizer, such as FileStream.
  • Had a failure during the Dispose(true); call.
  • Will have the same failure the next time you call Dispose(), even if with Dispose(false);

In this case, the Dispose fails, but the SuppressFinalize is never called, so this will be called from the finalizer, which will also fail, bringing our entire system down.

The issue was reported here and I hope that they will be fixed by the time you read it.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  2. RavenDB 4.1 features (11):
    04 Jul 2018 - This document is included in your subscription
  3. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  4. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  5. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats