Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,646 | Comments: 48,401

filter by tags archive

RavenDB Security Vulnerability Advisory

time to read 3 min | 533 words

You can read the full details here. The short of it is that we discovered a security vulnerability in RavenDB. This post tells a story. For actionable operations, see the previous link and upgrade your RavenDB instance to a build that includes the fix.

Timeline:

  • June 6 – A routine code review inside RavenDB expose a potential flaw in sanitizing external input. It is escalated and confirmed be a security bug. Further investigation classify it as CRTICIAL issue. A lot of sad faces on our slack channels show up. The issue has the trifecta of security problems:
    • It is remotely exploitable.
    • It is on in the default configuration.
    • It provide privilege escalation (and hence, remote code execution).
  • June 6 – A fix is implemented. This is somewhat complicated by the fact that we don’t want it to look like a security fix to avoid this issue.
  • June 7 – The fix goes through triple code review by independent teams.
  • June 7 – An ad hoc team goes through all related functionality to see if similar issues are still present.
  • June 8 – Fixed version is deployed to our production environment.

We had to make a choice here, whatever to alert all users immediately, or first provide the fix and urge them to upgrade (while opening them up to attacks in the meanwhile). We also want to avoid the fix, re-fix, for-real-this-time cycle from rushing too often.

As this was discovered internally and there are no indications that this is known and/or exploited in the wild, we chose the more conservative approach and run our full “pre release” cycle, including full 72-96 hours in a production environment serving live traffic.

  • June 12 – The fix is now available in a publicly released version (4.0.5).
  • June 13 – Begin notification of customers. This was done by:
    • Emailing all RavenDB 4.0 users. One of the reasons that we ask for registration even for the free community edition is exactly this. We want to be able to notify users when such an event occur.
    • Publishing security notice on our website.
    • Pushing a notification to all vulnerable RavenDB nodes warning about this issue. Here is what this looks like:
      image
  • Since June 13 – Monitoring of deployed versions and checking for vulnerable builds still in use.
  • June 18 – This blog post and public notice in the mailing list to get more awareness of this issue. The website will also contain the following notice for the next couple weeks to make sure that everyone know that they should upgrade:
    image

We are also going to implement a better method to push urgent notices like that in the future, to make sure that we can better alert users. We have also inspected the same areas of the code in earlier versions and verified that this is a new issue and not something that impacts older versions.

I would be happy to hear what more we can do to improve both our security and our security practices.

And yes, I’ll discuss the actual vulnerability in detail in a month or so.

The case of the missing writes in Docker (a Data Corruption story)

time to read 6 min | 1017 words

image

We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

ChallengeThe invisible concurrency bug–Answer

time to read 1 min | 162 words

The bug from yesterday would only show when a particular query is being run concurrently, and not always then.

Here is the code that is responsible for the bug:

It is quite hard to see, because it is so subtle. The code here create a cached lambda that is global for the process. The lambda takes the current engine, the object to transform return the converted object.

So far, so good, right?

Except that in this case,  the lambda is capturing the engine parameter that is passed to the function. The engine is single threaded, and must not be used concurrently. The problem is that the code already handles this situation, and the current engine instance is passed to the lamda, where it is never used. The original engine instance is being used concurrently, violating its invariants and causing errors down the line.

The fix was to simply use the current engine instance that was passed to us, but this was really hard to figure out.

Production postmortemThe unavailable Linux server

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to 172.31.0.2 which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to 192.35.177.64 on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

Getting fatal out of memory errors because we are managing memory too well

time to read 4 min | 621 words

imageWe got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.

That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.

It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.

When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.

And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.

In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources  before the GC could realize that we are actually not really using them at this point.

The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.

Production crash root cause analysis step by step

time to read 3 min | 470 words

One of our production server crashed. They do that a lot, because we make them do evil things in harsh environment so they would crash on our end and not when running on your systems. This one is an interesting crash.

We are running the system with auto dump, so a crash will trigger a mini dump that we can run. Here is the root cause of the error, and IOException here:

    SP               IP               Function
    00000024382FF0D0 00007FFD8313108A System_Private_CoreLib!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)+0x3cf7ea
    00000024382FF140 00007FFD82D611A2 System_Private_CoreLib!System.IO.FileStream.FlushWriteBuffer(Boolean)+0x62
    00000024382FF180 00007FFD82D60FA1 System_Private_CoreLib!System.IO.FileStream.Dispose(Boolean)+0x41
    00000024382FF1D0 00007FFD82D60631 System_Private_CoreLib!System.IO.FileStream.Finalize()+0x11

And with this, we are almost done analyzing the problem, right? This is a crash because a finalizer has thrown, so we now know that we are leaking a file stream and that it throws on the finalizer.  We are almost, but not quite, done. We are pretty careful about such things, and most to all of our usages are wrapped in a using statement.

The reason for the error was quite apparent when we checked, we run out of disk space. But that is something that we extensively tested and we are supposed to work fine with that. What is going on?

No disk space is an interesting IO error, because it is:

  • transient, in the sense that the admin can clear some space, and does not expect you to crash.
  • persistent, in the sense that the application can’t really do something to recover from it.

As it turns out, we have been doing everything properly. Here is the sample code that will reproduce this error:

For the moment, ignore things like file metadata, etc. We are writing 12 bytes, and we have only place for 10. Note that calling WriteByte does not go to the disk, extend the file, etc. Instead, it will just write to the buffer and the buffer will be flushed when you call Flush or Dispose.

Pretty simple, right?

Now let us analyze what will happen in this case, shall we?

We call Dispose on the FileStream, and that ends up throwing, as expected. But let us dig a little further, shall we?

Here is the actual Dispose code:

This is in Stream.Dispose. Now, consider the case of a class that:

  • Implements a finalizer, such as FileStream.
  • Had a failure during the Dispose(true); call.
  • Will have the same failure the next time you call Dispose(), even if with Dispose(false);

In this case, the Dispose fails, but the SuppressFinalize is never called, so this will be called from the finalizer, which will also fail, bringing our entire system down.

The issue was reported here and I hope that they will be fixed by the time you read it.

Transactions, request processing and convoys, oh my!

time to read 5 min | 805 words

imageYou might have noticed that we have been testing, quite heavily, what RavenDB can do. Part of that testing focused on running on barely adequate at ridiculous loads.

Because we setup the production cluster like a drunk monkeys, a single high load database was able to affect all other databases.  To be fair, I explicitly gave the booze to the monkeys and then didn’t look when they made a hash of it, because we wanted to see how RavenDB does when setup by someone who has as little interest in how to properly setup things and just want to Get Things Done.

The most interesting part of  this experiment was that we had a wide variety of mixed workload on the system. Some databases are primarily read heavy, some are used infrequently, some databases are write heavy and some are both write heavy and indexing heavy. Oh, and all the nodes in the cluster were setup identically, with each write and index going to all the nodes.

That was really interesting when we started very heavy indexing loads and then pushed a lot of writes. Remember, we intentionally under provisioned, and these machines are 2 cores with 4GB RAM and they were doing heavy indexing, processing a lot of reads and writes, the words.

Here is the traffic summary from one of the instances:

image

A very interesting wrinkle that I didn’t mention is that this setup has a really interesting property that we never tested. It has fast I/O, but the number of tasks that are waiting for the two cores is high enough that they don’t get a lot of CPU time on an individual basis. What that means is that it looks like we have I/O that is faster than the CPU.

A core concept of RavenDB performance is that we can send work to the disk to run and in that timeframe we will be able to complete more operations in memory, then send a batch of them to disk. Rinse, repeat, etc.

This wasn’t the case here. By the time we finished a single operation we’ll already have the previous operation completed, and so we’ll proceed with a single operation every time. That killed our performance, and it meant that the transactions merger queue would grow and grow.

We fixed things so we’ll take into account the load on the system when this happens, and we gave a higher priority to the transaction merging thread over normal request processing or indexing. This is because write requests can’t complete before the transaction has been committed, so obviously we don’t want to process further requests at the expense of processing the current requests.

This problem can only occur when we are competing heavily for CPU time, something that we don’t typically see. We are usually constrained a lot more by network or disk. With enough CPU capacity, there is never an issue of the requests and the transaction merger competing for the same core and interrupting each other, so we didn’t consider this scenario.

Another problem we had was the kind of assumptions we made with regards to the processing power. Because we tested on higher end machines, we tested with some ridiculous performance numbers, including hundreds of thousands of writes per second, indexing, mixed read / write load, etc. But we tested that on high end hardware, which means that we got requests that completed fairly quickly. And that led to a particular pattern of resource utilization. Because we reuse buffers between requests, it is typically better to grab a single big buffer and keep using it, rather than having to enlarge it between requests.

If your requests are short, the number of requests you have in flight is small, so you get great locality of reference and don’t have to allocate memory from the OS so often. But if that isn’t the case…

We had a lot of requests in flight. A lot of them because they were waiting for the transaction merger to complete its work, and it was having to fight the incoming requests for some CPU time. So we have a lot of inflight requests, and they intentionally got a bigger buffer than they actually needed (pre-allocating). You can continue down this line of thought, but I’ll give you a hint, it ends with KABOOM.

All in all, I think that it was a very profitable experiment. This is going to be very relevant for users on the low end, especially those running Docker instances, etc. But it should also help if you are running production worthy systems and can benefit from higher utilization of the system resources.

The cost of finalizers

time to read 2 min | 231 words

A common pattern we use is to wrap objects that require disposing of resources with a dispose / finalizer. Here is a simple example:

This is a pretty simple case, and obviously we try to avoid these allocations, so we try to reuse the SecretItem class.

However, when peeking into high memory utilization issue, we run into a really strange situation. We had a lot of buffers in memory, and most of them were held by items that were held in turn by the Finalization Queue.

I started to write up what I figured out about finalizer in .NET, but this post does a great job of explaining everything. The issue here is that we retain a reference to the byte array after the dispose. Because the object is finalizable, the GC isn’t going to be able to remove it on the spot. That means that it is reachable, and anything that it holds is also reachable.

Now, we are trying very hard to be nice to the GC and not allocate too much, which means that the finalizer thread will only wake up occasionally, but that means that the buffer will be held up in memory for a much longer duration.

The fix was the set the Buffer to null during the Dispose call, which means that the GC can pick up that it isn’t actually being used much faster.

ChallengeFind the bug in the fix–answer

time to read 2 min | 347 words

I am writing this answer before people had a chance to answer the actual challenge, so I hope people caught it. This was neither easy nor obvious to catch, because it was hiding with a pile of other stuff and the bug is a monster to figure out.

In case you need a reminder, here is the before & after code:

Look at line 18 in the second part. If we tried to allocate native memory and failed, we would try again, this this with the requested amount.

The logic here is that we typically want to request memory in power of 2 increments. So if asked for 17MB, we’ll allocate 32MB. This code is actually part of our memory allocator, which request memory from the operating system, so it is fine if we allocate more, we’ll just use that in a bit. However, if we don’t have enough memory to allocate 32MB, maybe we do have enough to allocate 17MB. And in many cases, we do, which allow the system to carry on operating.

Everyone is happy, right? Look at line 21 in the second code snippet. We set the allocated size to the size we wanted to allocate, not the actual size we allocated.

We allocated 17MB, we think we allocated 32MB, and now everything can happen.

This is a nasty thing to figure out. If you are lucky, this will generate an access violation when trying to get to that memory you think you own. If you are not lucky, this memory was actually allocated to your process, which means that you are now corrupting some totally random part of memory in funny ways. And that means that in some other time you’ll be start seeing funny behaviors and impossible results and tear your hair out trying to figure it out.

To make things worse, this is something that only happens when you run out of memory, so you are already suspicious about pretty much everything that is going on there. Nasty, nasty, nasty.

I might need a new category of bugs: “Stuff that makes you want to go ARGH!”

FUTURE POSTS

  1. RavenDB 4.1 Features: Cluster wide ACID transactions - one day from now

There are posts all the way to Jun 20, 2018

RECENT SERIES

  1. RavenDB 4.1 features (6):
    19 Jun 2018 - Explain that choice
  2. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  3. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  4. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  5. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats