Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,395 | Comments: 47,405

filter by tags archive

The case of the hanging request in the cache

time to read 2 min | 351 words

imageA user just reported that after a short time of using RavenDB, all requests started to hang and eventually timed out. We are talking about making several dozen requests in the space of a minute or so, as a user operating against a web application.

That was a cause of concern, obviously, and something that was very strange. Luckily, the user was able to send us a stripped down version of his application and we were able to reproduce this state locally. On that particular application only. We couldn’t reproduce this anywhere else.

In fact, we fixed the problem and still don’t know how to reproduce it. The underlying issue is that we are making a GET request using HttpClient and specifying HttpCompletionOption.ResponseHeadersRead in the SendAsync method. Now, the server might return us a 304 response, in which case we’ll use the cached value we have. If we make a few requests like that, all future requests will hang. In order to fix this problem we need to dispose the response, or read the (empty) content of the response.

Now, the crazy thing is that I can probably speculate pretty accurately on why this is going on, we have fixed it, but I can’t reproduce this problem except within the application the user sent us. So no independent reproduction.

The likely issue is that the HttpClient considers the request open until we read all its content (even if there is non), and that as we make more requests that have the same 304 and therefor no access to the content of the request, we have more and more requests that are considered to be open, up to the maximum number of requests that are allowed against the server concurrently. At this point, we are waiting for timeout, and it look like we are hanging.

All very reasonable and perfectly understandable (if annoying because you only need to dispose it sometimes is a trap waiting to happen), but I don’t know why we couldn’t reproduce it.

Find the security bug in this code

time to read 1 min | 80 words

This code try very hard to ensure that the secret key provided to it is eradicated after it is properly saved.

This is because we try to reduce the attack vector for keeping the encryption key in memory.

However, there are at least two different ways that this code is failing in what it is trying to do. Can you find them?

For that matter, how much sense does it make to even attempt what it is doing?

Tripping on the TPL

time to read 2 min | 364 words

I thought that I found a bug in the TPL, but it looks like its working (more or less) by design. Basically, when a task is completed, all awaiting tasks will be notified on that, which is pretty much what you would expect. What isn’t usually expected is that those tasks can interfere with one another. Consider the following code:

We have two tasks, which accept a parent task and do something with it. What do you think will happen when we run the following code?

Unless you are very quick on the draw, running this code will result in a timeout message, but how? We know that we have a much shorter duration for the task than the timeout, so what is going on?

Well, effectively what is going on is that the parent task has a list of children that it will notify, and by default, it will do so synchronously and sequentially. If a child task blocks for whatever reason (for example, it might be processing a lot of work), the other children of the parent task will not be notified.

If there is a timeout setup, it will be triggered, even though the parent task was already completed. It took us a lot of time to figure out the repro in this issue, and we were certain that this is some sort of race condition in the TPL. I had a blog post talking all about it, but the Microsoft team is fast enough that they were able to literally answer my issue before I had the time to complete my blog post. That is really impressive.

I should note that the suggestion, using RunContinuationsAsynchronously, works quite well for creating a new Task or using TaskCompletionSource, but there is no way to specify that when you are using Task.Run. What is worse for us is that since this is not the default (for perfectly good performance reasons, by the way), this means that any code that we call into might trigger this. I would have much rather to be able to specify than when waiting on the task, rather than when creating it.

Overloading the Windows Kernel and locking a machine with RavenDB benchmarks

time to read 4 min | 781 words

During benchmarking RavenDB, we have run into several instances where the entire machine would freeze for a long duration, resulting in utter non responsiveness.

This has been quite frustrating to us, since a frozen machine make it kinda hard to figure out what is going on. But we finally figured it out, and all the details are right here in the screen shot.

image

What you can see is us running our current benchmark, importing the entire StackOverflow dataset into RavenDB. Drive C is the system drive, and drive D is the data drive that we are using to test our RavenDB’s performance.

Drive D is actually a throwaway SSD. That is, an SSD that we use purely for benchmarking and not for real work. Given the amount of workout we give the drive, we expect it to die eventually, so we don’t want to trust it with any other data.

At any rate, you can see that due to a different issue entirely, we are now seeing data syncs in excess of 8.5 GB. So basically, we wrote 8.55GB of data very quickly into a memory mapped file, and then called fsync. At the same time, we started increasing our  scratch buffer usage, because calling fsync ( 8.55 GB ) can take a while. Scratch buffers are a really interesting thing, they were born because of Linux crazy OOM design, and are basically a way for us to avoid paging. Instead of allocating memory on the heap like normal, which would then subject us to paging, we allocate a file on disk (mark it as temporary & delete on close) and then we mmap the file. That give us a way to guarantee that Linux will always have a space to page out any of our memory.

This also has the advantage of making it very clear how much scratch memory we are currently using, and on Azure / AWS machines, it is easier to place all of those scratch files on the fast temp local storage for better performance.

So we have a very large fsync going on, and a large amount of memory mapped files, and a lot of activity (that modify some of those files) and memory pressure.

That force the Kernel to evict some pages from memory to disk, to free something up. And under normal conditions, it would do just that. But here we run into a wrinkle, the memory we want to evict belongs to a memory mapped file, so the natural thing to do would be to write it back to its original file. This is actually what we expect the Kernel to do for us, and while for scratch files this is usually a waste, for the data file this is exactly the behavior that we want. But that is beside the point.

Look at the image above, we are supposed to be only using drive D, so why is C so busy? I’m not really sure, but I have a hypothesis.

Because we are currently running a very large fsync, I think that drive D is not currently process any additional write requests. The “write a page to disk” is something that has pretty strict runtime requirements, it can’t just wait for the I/O to return whenever that might be. Consider the fact that you can open a memory mapped file over a network drive, I think it very reasonable that the Kernel will have a timeout mechanism for this kind of I/O. When the pager sees that it can’t write to the original file fast enough, it shrugs and write those pages to the local page file instead.

This turns an  otherwise very bad situation (very long wait / crash) into a manageable situation. However, with the amount of work we put on the system, that effectively forces us to do heavy paging (in the orders of GBs) and that in turn lead us to a machine that appears to be locked up due to all the paging. So the fallback error handling is actually causing this issue by trying to recover, at least that is what I think.

When examining this, I wondered if this can be considered a DoS vulnerability, and after careful consideration, I don’t believe so. This issue involves using a lot of memory to cause enough paging to slow everything down, the fact that we are getting there in a somewhat novel way isn’t opening us to anything that wasn’t there already.

Sometimes it really IS not our fault

time to read 1 min | 150 words

imageSo we got an emergency support call during the Passover holiday, and as you can imagine, it was a strange one. Our investigation of the error basically boiled down (cutting down a lot of effort in between): “This can’t be happening.”

I hate this kind of answer, because it usually means that we are missing something. Usually that can be a strange error code, some race condition or just something strange about the environment.

While we were working the problem, the customer came back with, “Oh, we found the issue. A memory unit went rogue, and the firmware wasn’t able to catch it.” When they updated the firmware, it apparently caught it immediately.

So I guess we can close this support incident. Smile

The bug in the platform: Partial HTTP requests & early response

time to read 3 min | 408 words

Some of our tests were randomly hanging on the Linux machine. This was a cause for concern, obviously, so we started investigating that.

The problem was that we couldn’t really reproduce this except by running our full code, in which case it would hang. It clearly seemed timing related, because it wouldn’t always hang, but it was pretty consistent.  We resorted to sprinkling printf throughout our code to figure out what was going on, and we found that a particular request would get to the server, the server would respond to it, and then the client would hang.

Looking at TCP traces, we could clearly see that the response was generated by the server and sent to the client, but something was funky there. We reported it and try to figure out what was going on in the meantime.

We finally managed to get an isolated reproduction:

As a side note, the response from the Core FX & Kestrel guys is absolutely amazing. In a short while, it was determined to be an issue in Kestrel and fixed.

The details of the issue are really interesting.

  1. The client (libcurl) starts sending the request.
  2. Kestrel sends an error response.
  3. The client sees the error response and stops sending the request (this is a libcurl behavior), but does NOT close the sending side of the connection (HTTP clients typically wait until they receive the entire response before closing the connection, because most HTTP servers treat a mid-request FIN like a connection reset).
  4. After the application runs, Kestrel tries to consume the rest of the request body from the client. But the client has stopped sending the request, so Kestrel hangs here.
  5. The client hangs waiting for the chunked terminator from the server, which is only sent after it consumes the rest of the request body (we'll work on improving this behavior).

This behavior is specific to libcurl,  the WinHTTP implementation sends the full body of the request before processing any of the response.

As an aside, this difference in behavior is actually pretty interesting to know, because we need those low level details when designing interactions. For example, if I send a long request to the server, and the server want to tell me that it is not going to accept it, being able to know about it early is very advantageous.


The memory leak in the network partition

time to read 2 min | 354 words

RavenDB it meant to be a service that just runs and runs, for very long periods of time and under pretty much all scenarios. That means that as part of our testing, we are putting a lot of emphasis on its behavior. Amount of CPU used, memory utilization, etc. And we do that in all sort of scenarios. Because getting the steady state working doesn’t help if you have an issue, and then that issue kills you. So we put the system into a lot of weird states to see not only how it behaves, but what are the second order affects of that would be.

Once such configuration was a very slow network with a very short timeout setting, so effectively we’ll always be getting timeouts, and need to respond accordingly. We had a piece of code that is waiting for something to happen (an internal event, or a read from the network, or a timeout) and then does something accordingly.This is implemented as follows:

This is obviously extremely simplified, but it will reproduce the issue. If you will run this code, it will start using more and more memory. But why? On the face of it, this looks like a perfectly reasonable code.

What is actually happening is that the WaitAny will call CommonCWAnyLogic, which will call an AddCompletionAction on that task, which will track it, so we have a list of items there. So if we have a lot of waits on the same task, that is going to cause us to track all of those waits.

Here is what it looks like after a short while in the debugger.

image

And there is our memory leak.

The solution, by the way, was to not call WaitAny each time, but to call WhenAny, and then call Wait() on the resulting task, and keep that task around until it is completed, so we only register to the original event once.

The occasionally failing test

time to read 2 min | 211 words

This piece of code is part of a test that runs a scenario, and checks that the appropriate errors are logged. Very occasionally, this test would fail, and it would be nearly impossible to figure out why.

I’ve extracted the offending code from the test, ReadFromWebSocket returns a Task<string>, since that isn’t obvious from the code. Do you see the error?

image

Think about it, what is this doing?

This is reading from a socket, and there is absolutely no guarantee about the amount of data that will go over the wire in any particular point. Because this test assumes that the entire string will be read in a single call from the socket, if the expected value we are looking is actually going to be returned in two calls, we’ll miss it, and this will never return, leading to sadness and worry everywhere*.

* At least everywhere that care about our tests.

The fix was to remember the previous values and compare to all the data read from the socket, not just the values that were returned in the last call.

Feature intersection bugs are the hardest to predict

time to read 4 min | 720 words

This post is the story of RavenDB-6230, or as it is more commonly known as: “Creating auto-index on non-existent field breaks querying via Id”. It isn’t a big or important bug, and it has very little real world impact. But it is an interesting story because it shows one of the hardest things that we deal with, not an issue with a specific feature, but the behavior of the system as a whole, especially when we have multiple things that may affect the end result.

Let us see what the bug actually is. We have the following query:

image

And this works and gets you the right results. Under the cover, it will select the appropriate index (and create one if it isn’t there) to query and get the right results.

Now, issue the following query on a non existent field:

image

And issue the first query again. You’ll get no results.

That is a bit surprising, I’ll admit, but it makes absolute sense when you break it up to component parts.

First, a query by id can be answered by any index, but if you don’t have an index, one will be created for you. In this case, it will be on the “__document_id” field, since we explicitly queried on that. This isn’t typically done explicitly, which is important to understand the bug.

Then, we have another query, on another field, so we generate an index on that as well. We do that while taking into account the historical behavior of the queries on the server. However, we ignore the “__document_id” field because all indexes already contain it, so it is superfluous. That means that we have an index with one explicit field (Nice_Doggy_Nice, in this case) and an implicit one (__document_Id). Which works great, even though in one case it is implicit and the other explicit, there is no actual difference in how we treat them.

So what is the problem? The problem is that the field Nice_Doggy_Nice doesn’t actually exists. So when it comes the time to actually index documents using this index, we read the document and index that, but find that we have nothing to index. At this point, we have only a single field to index, just the document id, but as it is an implicit field, and we have nothing else, we skip indexing that document entirely.  The example I used in the office is that you can’t get an answer for “when was the last time you had given birth” if you ask a male (except Schwarzenegger, in which case the answer is 1994).

So far, it all makes sense. But we need to introduce another feature into the mix. The RavenDB query optimizer.

That component is responsible for routing dynamic queries to the most relevant index, and it is doing that with the idea that we should direct work to the biggest index around, because that would make it the most active, at which point we can retire the smaller indexes (which are superfluous once the new wider index is up to date).

All features are working as intended so far. The problem is that the query optimizer indeed selects the wider index, and it is an index that has filtered all the results, so the query by id returns nothing.

Everything works as designed, and yet the user is surprised.

In practice, there are several mitigating factors here. The only way you can get this issue is if you never made any queries on any other (valid) fields on the documents in question. If you have made even a single such query, you’ll not be able to reproduce it. So you have to really work hard at it to get it to fail. But the point isn’t so much the actual bug, but pointing how how multiple unrelated behaviors can combine to cause a bit of a problem.

Highly recommended reading: How Complex Systems Fail – Cook 2000

When the code says you are stupid, but you are too stupid to know that

time to read 1 min | 102 words

We recently made some big changes in how we handle writing to the Voron journal. As part of that, we introduced a subtle bug. It would only happen on specific data, and only if you were unlucky enough to hit it with the right time.

It took a lot of effort to track that done, but here is the offending line:

image

Sometimes, it just isn’t plain that the code is snigger to itself and thinking “stupid”.

FUTURE POSTS

  1. Zombies vs. Ghosts: The great debate - about one day from now
  2. Bug stories: The data corruption in the cluster - 3 days from now
  3. Bug stories: How do I call myself? - 4 days from now

There are posts all the way to Jun 28, 2017

RECENT SERIES

  1. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  2. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  3. RavenDB 4.0 (7):
    13 Jun 2017 - The etag simplification
  4. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats