Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,707 | Comments: 48,617

filter by tags archive

ChallengeThe invisible concurrency bug–Answer

time to read 1 min | 162 words

The bug from yesterday would only show when a particular query is being run concurrently, and not always then.

Here is the code that is responsible for the bug:

It is quite hard to see, because it is so subtle. The code here create a cached lambda that is global for the process. The lambda takes the current engine, the object to transform return the converted object.

So far, so good, right?

Except that in this case,  the lambda is capturing the engine parameter that is passed to the function. The engine is single threaded, and must not be used concurrently. The problem is that the code already handles this situation, and the current engine instance is passed to the lamda, where it is never used. The original engine instance is being used concurrently, violating its invariants and causing errors down the line.

The fix was to simply use the current engine instance that was passed to us, but this was really hard to figure out.

Production postmortemThe unavailable Linux server

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to 172.31.0.2 which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to 192.35.177.64 on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

ChallengeFind the bug in the fix–answer

time to read 2 min | 347 words

I am writing this answer before people had a chance to answer the actual challenge, so I hope people caught it. This was neither easy nor obvious to catch, because it was hiding with a pile of other stuff and the bug is a monster to figure out.

In case you need a reminder, here is the before & after code:

Look at line 18 in the second part. If we tried to allocate native memory and failed, we would try again, this this with the requested amount.

The logic here is that we typically want to request memory in power of 2 increments. So if asked for 17MB, we’ll allocate 32MB. This code is actually part of our memory allocator, which request memory from the operating system, so it is fine if we allocate more, we’ll just use that in a bit. However, if we don’t have enough memory to allocate 32MB, maybe we do have enough to allocate 17MB. And in many cases, we do, which allow the system to carry on operating.

Everyone is happy, right? Look at line 21 in the second code snippet. We set the allocated size to the size we wanted to allocate, not the actual size we allocated.

We allocated 17MB, we think we allocated 32MB, and now everything can happen.

This is a nasty thing to figure out. If you are lucky, this will generate an access violation when trying to get to that memory you think you own. If you are not lucky, this memory was actually allocated to your process, which means that you are now corrupting some totally random part of memory in funny ways. And that means that in some other time you’ll be start seeing funny behaviors and impossible results and tear your hair out trying to figure it out.

To make things worse, this is something that only happens when you run out of memory, so you are already suspicious about pretty much everything that is going on there. Nasty, nasty, nasty.

I might need a new category of bugs: “Stuff that makes you want to go ARGH!”

ChallengeFind the bug in the fix

time to read 1 min | 72 words

We are working on improving the reliability of RavenDB under a host of scenarios This week it is low memory conditions. We made some fixes, and introduced a horrible bug.

Here is the code, can you see what the error is? Here are the first and second versions of the code. The second version is meant to be more robust to running under low memory conditions, but it is actually much worse.

Optimizing select projectionsPart IV–Understand, don’t do

time to read 2 min | 382 words

This post is what the entire series has been building building toward to. In the previous post we have refactored our code to make it easier to build additional behaviors. Here is one such behavior, which uses the Esprima project to parse the code and see if it uses a pattern that we can optimize for. Let us see the code, then discuss what it does.

The new stuff is mostly in the TryProjectOptimized method. This method is making heavy use of the new C# features to do easy parsing of ASTs, and it really shows that the C# team is using C# to build the compiler (that is one of the common things that compiler writers do, make their own lives easier). Take this code even a couple of language versions back, and it would be much hard to read and work with.

At any rate, what this code is doing is search for an object literal that does simple assignment of properties. If it detect that, instead of calling into the JS engine, it will produce a delegate that will do just that directly.

The result is pretty much the same, but we have to do so much less. If we can’t detect a pattern that we recognize, we just fall back to the JS engine again, and get the pervious behavior.

This is pretty awesome thing to have, because if we detect a particular common pattern, we can optimize that quite easily. The code really lend itself to that approach now.

What about the performance? Well…

And this runs in 0.31 seconds on the 10K, 5K, 1K run.

  • 10K in 83 ms
  • 5K in 200 ms
  • 1K in 31 ms

So that is two orders of magnitude for the 10K run.

An an exercise, try taking this code and teaching it how to recognize simple expressions like + and –. The second projection example is meant to give you some idea about that.

Oh, and one last thought, we are actually biased against the optimization. The Jint code produce results in formats that we need to pull the data from, and we aren’t accounting for that here. The optimize code can generate the output to be already in the format we need it to be ( the difference between the Dictionary and the JsValue).  

Optimizing select projectionsInterlude, refactoring

time to read 2 min | 256 words

So we got an order of magnitude performance boost, without really doing much, to be honest. But the code is starting to look really ugly, and future optimizations are out until we fix it.

Since we don’t really care about the internal details, we can do two things in one move. First, we’ll cleanup the code by introducing a delegate that will abstract the handling and the second is that this delegate will also allow us to handle caching of additional values simple by way of capturing the delegate state.

Here is what the code looks like:

Note that we have it broken apart into distinct steps, and the ProjectViaJint method is self contained and any caching it needs is done internally. This is much needed refactoring if we want to be able to continue to optimize this code, and it runs in about the same speed as the previous version, so we didn’t hurt anything there.

With this in place, we can see how much better we can make things.

Before we go on, I want to emphasis that this code is actually making a lot of assumptions. The passed dictionary must always have the same set of fields, for example, otherwise we may see a value from a previous iteration. No null handling or error handling is in place, we aren’t showing the unpacking of the values from the JS env to our code, and plenty of other stuff. That isn’t that important at this point, because I’m showing the progression of optimizations, rather then production code.

Optimizing select projectionsPart III

time to read 2 min | 229 words

After optimizing the projection performance by so much by just caching the engine, I got to thinking, we are also creating a new JS object to pass the arguments every single time. What would happen if we’ll cache that?

Here is what I ended up with, again, all of this is pretty brute force (mainly because I’m writing these posts while the baby is asleep, and I’m trying to get as many of them out before she wakes up):

I wouldn’t have expected this to be a dramatic boost, but we got:

And this runs in 0.57 seconds on the 10K, 5K, 1K run.

  • 10K in 413 ms
  • 5K in 110 ms
  • 1K in 45 ms

That is also a third of the cost that we just saved.

This is interesting, so it is worth another check, there are other two potentially expensive operations here, the invocation of the method and the sending of arguments.

Trawling of the Jint code shows that we can remove some abstract by using ICallable directly, and we can cache the array of arguments, this all leads to:

And this runs in 0.37 seconds on the 10K, 5K, 1K run.

  • 10K in 279 ms
  • 5K in 59  ms
  • 1K in 32 ms

And that is a wow. Because right now, without really doing much at all, we area already an order of magnitude higher then our first attempt, and we can do more.

Optimizing select projectionsPart II

time to read 2 min | 236 words

In the previous post, I showed how we can handle select projections and setup a perf test. The initial implementation we had run in about 4.4 seconds for our test. But it didn’t give any thought to performance.

Let us see if we can do better. The first thing to do is to avoid building the Jint engine and parsing the code all the time. The way we set things up, we wrap the actual object literal in a function, and there is no state, so we can reuse the previous engine instance without any issues. ;:

That means that we don’t need to pay the cost of creating a new engine, parsing the code, etc. Here is what this looks like:

Note that the cache key here is the raw projection, not the function we send to the entire, this allows us to avoid any string allocations in the common (cached) path.

And this runs in 0.75 seconds on the 10K, 5K, 1K run.

  • 10K in 574 ms
  • 5K in 137 ms
  • 1K in 51 ms

Just this small change boosted our performance by a major margin.

Note that because the engines are not thread safe, to use that in a real application we’ll need to ensure thread safety. The way I did that is to have a pool of these engines for each projection and just use that, so an engine is always access in a single threaded mode.

Optimizing select projectionsPart I

time to read 2 min | 240 words

I spoke about designing for performance in my previous post and I thought it would be in interesting series  of blog posts. The task I have is that we have a data source of some kind, in our case, I decided to make the task as simple as possible and define the source data as a Dictionary<string, object>, without allowing nesting of data.

We want the user to be able to provide a custom way to project data from that dictionary. Here are a few ways to do that:

And here is the data source in question:

Obviously the user can make the select projections as complex as they wish. But I think that these three sample give us a good representation of the complexity. We’ll also treat them a bit differently, with regards to their likelihood. So when testing, we’ll check that the first projection is run 10,000 times, then second projection is run 5,000 times and the last projection is run 1,000 times.

With the layout of the land settled, let us see how we can get this actually working. I’m going to use Jint and solve the issue in a very brute force manner.

Here is the initial implementation:

And this runs in 4.4 seconds on the 10K, 5K, 1K run.

  • 10K in 2,756 ms
  • 5K in 1,310 ms
  • 1K in 314 ms

I’m pretty sure that we can do better, and we’ll look at that in the next post.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  2. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  3. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  4. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  5. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats