Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,335 | Comments: 47,041

filter by tags archive

The memory leak in the network partition

time to read 2 min | 354 words

RavenDB it meant to be a service that just runs and runs, for very long periods of time and under pretty much all scenarios. That means that as part of our testing, we are putting a lot of emphasis on its behavior. Amount of CPU used, memory utilization, etc. And we do that in all sort of scenarios. Because getting the steady state working doesn’t help if you have an issue, and then that issue kills you. So we put the system into a lot of weird states to see not only how it behaves, but what are the second order affects of that would be.

Once such configuration was a very slow network with a very short timeout setting, so effectively we’ll always be getting timeouts, and need to respond accordingly. We had a piece of code that is waiting for something to happen (an internal event, or a read from the network, or a timeout) and then does something accordingly.This is implemented as follows:

This is obviously extremely simplified, but it will reproduce the issue. If you will run this code, it will start using more and more memory. But why? On the face of it, this looks like a perfectly reasonable code.

What is actually happening is that the WaitAny will call CommonCWAnyLogic, which will call an AddCompletionAction on that task, which will track it, so we have a list of items there. So if we have a lot of waits on the same task, that is going to cause us to track all of those waits.

Here is what it looks like after a short while in the debugger.

image

And there is our memory leak.

The solution, by the way, was to not call WaitAny each time, but to call WhenAny, and then call Wait() on the resulting task, and keep that task around until it is completed, so we only register to the original event once.

The occasionally failing test

time to read 2 min | 211 words

This piece of code is part of a test that runs a scenario, and checks that the appropriate errors are logged. Very occasionally, this test would fail, and it would be nearly impossible to figure out why.

I’ve extracted the offending code from the test, ReadFromWebSocket returns a Task<string>, since that isn’t obvious from the code. Do you see the error?

image

Think about it, what is this doing?

This is reading from a socket, and there is absolutely no guarantee about the amount of data that will go over the wire in any particular point. Because this test assumes that the entire string will be read in a single call from the socket, if the expected value we are looking is actually going to be returned in two calls, we’ll miss it, and this will never return, leading to sadness and worry everywhere*.

* At least everywhere that care about our tests.

The fix was to remember the previous values and compare to all the data read from the socket, not just the values that were returned in the last call.

Feature intersection bugs are the hardest to predict

time to read 4 min | 720 words

This post is the story of RavenDB-6230, or as it is more commonly known as: “Creating auto-index on non-existent field breaks querying via Id”. It isn’t a big or important bug, and it has very little real world impact. But it is an interesting story because it shows one of the hardest things that we deal with, not an issue with a specific feature, but the behavior of the system as a whole, especially when we have multiple things that may affect the end result.

Let us see what the bug actually is. We have the following query:

image

And this works and gets you the right results. Under the cover, it will select the appropriate index (and create one if it isn’t there) to query and get the right results.

Now, issue the following query on a non existent field:

image

And issue the first query again. You’ll get no results.

That is a bit surprising, I’ll admit, but it makes absolute sense when you break it up to component parts.

First, a query by id can be answered by any index, but if you don’t have an index, one will be created for you. In this case, it will be on the “__document_id” field, since we explicitly queried on that. This isn’t typically done explicitly, which is important to understand the bug.

Then, we have another query, on another field, so we generate an index on that as well. We do that while taking into account the historical behavior of the queries on the server. However, we ignore the “__document_id” field because all indexes already contain it, so it is superfluous. That means that we have an index with one explicit field (Nice_Doggy_Nice, in this case) and an implicit one (__document_Id). Which works great, even though in one case it is implicit and the other explicit, there is no actual difference in how we treat them.

So what is the problem? The problem is that the field Nice_Doggy_Nice doesn’t actually exists. So when it comes the time to actually index documents using this index, we read the document and index that, but find that we have nothing to index. At this point, we have only a single field to index, just the document id, but as it is an implicit field, and we have nothing else, we skip indexing that document entirely.  The example I used in the office is that you can’t get an answer for “when was the last time you had given birth” if you ask a male (except Schwarzenegger, in which case the answer is 1994).

So far, it all makes sense. But we need to introduce another feature into the mix. The RavenDB query optimizer.

That component is responsible for routing dynamic queries to the most relevant index, and it is doing that with the idea that we should direct work to the biggest index around, because that would make it the most active, at which point we can retire the smaller indexes (which are superfluous once the new wider index is up to date).

All features are working as intended so far. The problem is that the query optimizer indeed selects the wider index, and it is an index that has filtered all the results, so the query by id returns nothing.

Everything works as designed, and yet the user is surprised.

In practice, there are several mitigating factors here. The only way you can get this issue is if you never made any queries on any other (valid) fields on the documents in question. If you have made even a single such query, you’ll not be able to reproduce it. So you have to really work hard at it to get it to fail. But the point isn’t so much the actual bug, but pointing how how multiple unrelated behaviors can combine to cause a bit of a problem.

Highly recommended reading: How Complex Systems Fail – Cook 2000

When the code says you are stupid, but you are too stupid to know that

time to read 1 min | 102 words

We recently made some big changes in how we handle writing to the Voron journal. As part of that, we introduced a subtle bug. It would only happen on specific data, and only if you were unlucky enough to hit it with the right time.

It took a lot of effort to track that done, but here is the offending line:

image

Sometimes, it just isn’t plain that the code is snigger to itself and thinking “stupid”.

The crash at the Unicode text

time to read 4 min | 629 words

So a customer of ours has been putting RavenDB 4.0 through it paces, they have a large database with some pretty nasty indexes. They reported that when importing their existing data, they managed to crash RavenDB. The problem was probably some interaction of indexing and documents, but this was pretty consistent.

That was lucky, because I was able to narrow down the problem to the following document.

image

Literally just a week before getting this bug report we found a bug with multi byte Unicode characters (we were counting them in chars, but using that to index into a byte array), so that was our very first suspicion. In fact, review the relevant portions of the code, we were able to identify another location with a similar problem, and resolve it.

That didn’t solve the problem, however. I kept banging my head against the Unicode issue, until I finally just gave up and went to tracing the code very carefully. And then I found it.

The underlying issue was that the data here is part of an array inside a document, and we had an index that looked like this:

image

The idea is that we gather all the titles from the items in the action, and allow to search on those. And somehow, the title from the second entry above was always pointing to invalid memory.

It took a bunch more investigation before I found out the root cause.

Here is where we found the problem:

image

But it is going to take a bit of explaining first. The second title is pretty long, when using UTF8 it is 176 bytes, which is long enough for the blittable format to try to compress it, and it was indeed successfully compressed to a smaller size.

So far, so good, and a pretty awesome feature that we have. But the problem was that when we index a document, we need to access its property, and if it is a compress property, we need to decompress it, and that means that we need to decompress it somewhere. And at the time, using the native temp buffer seemed perfect, after all, it was exactly the kind of thing that it was meant for.

But in this case, a bit further down the document we had another compressed field, one that was even larger, so we again asked for a temp buffer. But since this is a temp buffer, and the expectation is that this will be short lived usage, we returned the old buffer and allocated a new one. However, we still maintained reference to the old buffer, and when we tried to index that title, it spat out garbage (because the memory was reused) or just crashed, depending on how severe we set the memory protection mechanism.

It was actually something that we already figured out and fixed, but we haven’t yet merged that branch, so it was quite annoying when we found the root cause.

We’ll also be removed the temp buffer concept entirely, we’ll have a buffer that you can checkout and return, and not something that can vanish behind you back.

But I have to say, that bug was certainly misleading in the extreme. I was sure it had to do with Unicode handling.

Tricks of working with native memory

time to read 2 min | 362 words

I want to start by saying that this isn’t my idea, I read about it a few times, and I recently encountered it with sodium_malloc, so I decided to write my own implementation of the world’s most expensive memory allocator.

What this code does is pretty simple, and quite brutal. It allocates memory in such a fashion that absolutely guarantee that you can’t get away with a whole host of memory problems.

For example, if you try overwrite a buffer allocated by this method, you’ll immediately hit the guard page and die horribly (and predictably, in the place where the error actually happened, not a long way off). If you somehow write before the buffer, that will be detected on free if this is a small under write (which tend to be much rarer, by the way), or immediately if this is a big change.

What is more, once the memory is freed, it is poisoned, and can never be used again. This pretty much rely on us running on 64 bits with effectively unlimited virtual memory, and has the nasty side effect of turning a 50 bytes allocation to something requiring 12 KB. Having said that, as a debugging tool, this is invaluable.

And yes, I’m aware that windows already have that with the heap verifier. But as I’m using this in .NET code, I needed to write my own (this also pretty much work the same way with Linux, you just need to switch the API, but the functionality is the same).

This was built because we were chasing a memory corruption error, and I run this, but it pointed me to a totally different location than suspected. So it is either doing a very good job, or it found me another issue.

… investigating …

Or, as the case may be, we found a bug in the actual memory guard (we didn’t handle allocations of exact page size correctly, and they broke), but at least it broke consistently and was pretty easy to find once I looked in the right place Smile.

The edge case is in the timing

time to read 1 min | 141 words

In a previous post, I showed how to get 10x performance boost by batching remote calls. The code included the following method:

This code has a subtle bug in it. Take a look and see if you can find it.

Yes, I’ll wait, honestly.

Go read the code and think about interesting ways to break it.

Okay, found it? Awesome, now let me explain anyway Smile.

This code suffer from the slow arrival syndrome. If we have a new request coming in every 100 ms, then the first request here will languish in the queue for over 25 seconds!

Luckily, the fix is simple:

We just need to make sure that we aren’t waiting for too long. In this case, we will wait a maximum of less than half a second for the messages to go over the wire.

A different sort of cross platform bug

time to read 2 min | 334 words

During the move of RavenDB to Linux we had to deal with the usual share of cross platform bugs. The common ones are things like file system case sensitivity, concatenating paths with \ separators, drive letters, etc. The usual stuff.

Note that the hairy stuff (different threading model, different I/O semantics, etc) I’m not counting, we already had dealt with them before.

The real problems had to do with the file systems, in particular, ext3 doesn’t support fallocate, which we kind of wanna use to allocate our disk space in advance (which can give the file system information so it can allocate all of that together). But that was pretty straightforward. ENOSUP is clear enough for me, and we could just use ext4, which works.

But this bug was special. To start with, it snuck in and lain there dormant for quite a while, and it was a true puzzle. Put simply, a certain database would work just fine as long as we didn’t restart it. On restart, some of its indexes would be in an error state.

It took a while to figure out what was going on. At first, we thought that there was some I/O error, but the issue turned out to be much simpler. We assumed that we would be reading indexes in order, in fact, our code output the index directories in such a way that they would be lexically sorted. That worked quite well, and had the nice benefit that on NTFS, we would always read the indexes in order.

But on Linux, there is no guaranteed order, and on one particular machine, we indeed hit the out of order issue. And in that case, we had an error raised because we were trying to create an index whose id was too low. Perfectly understandable, when you realize what is going on, and trivial to fix, but quite hard to figure out, because we kept assuming that we did something horrible to the I/O system.

The Red Alert Sleeper Agent Bug

time to read 6 min | 1170 words

Today I started out like most recent days, I was working on improving performance and running benchmarks. I made a small change in how we handle file allocation and mapping inside Voron. This is the kind of change that should have no observable effects. And indeed, except for making us run faster, everything worked.

Except that later today I merged some stuff from a colleague and suddenly I started getting invalid memory accesses. After quickly blaming my colleague for the issue, we eventually figured out that it was my change that caused it.

Unfortunately, the problem was a lot more serious than it immediately appeared. It wasn’t just that I needed to fix my code, what was happening there was that I made a certain situation (a new file mapping, and thus, exercising the cleanup routine) a lot more frequent. Which is all well and good, except that this is something that will happen routinely in Voron anyway, it just means that this is now much more likely.

And the problem is that we couldn’t for the life of us figure out why it was failing. Oh, we quickly figured out that we are accessing memory that has been unmapped, but how? The Voron codebase is really careful about such things, and we have quite enough production usage to know that this doesn’t really happen. But again, it might just be a sleeper…

The real problem wasn’t actually with the access violation, that was pretty obvious and would have come to our immediate attention. The problem was that the error looked like the Page Translation Table had a race condition. In this case, because we are much more eager about cleanup, it was obvious that we are accessing old information, but without this to trigger our attention, the fear was that we are actually racy, and that the Page Translation Table will serve incorrect information.

That means that Voron would violate its consistency rule, we’ll effectively be returning random garbage to the user and… Bad Things to Follow.

At various points during the day, we had five different people working on it in three continents, because it is that kind of bug. And we couldn’t figure it out. We traced the code that did that every which way. It is old code, that has been worked upon repeatedly, and it has been stable for years. And none of us could figure out what was going on. Theories ranging from cosmic rays to the wrath of Murphy has been thrown out.

Something was very rotten in Voron. Okay, after all of this exposition, let me explain what was going on. We started out with a Page Table that looked like this:

image

So the first number is the page number, and the second is the page number inside the scratch file (#1 or #2, above).

Basically, this means that when Transaction #3 looks asks for Page #0, it will actually get Page #238 from scratch #1. And when Transaction #4 asks for the same page, it will get it from Page #482 on scratch #2. If you got lost with the numbers, don’t worry, we did too.

The problem was… the failing issue was in transaction #5, and the problem was that it was access page #412 on scratch #1. And due to my change, we actually closed scratch #1. The problem is that we couldn’t figure out how this thing could actually happen. Crazy stuff. We tried reproducing this in all sort of crazy ways, but it would only fail on the most trivial of tests, and very unpredictably. And then we finally figured it out. Basically by tracing everything, putting locks on everything that moved or looked like it would move it I kicked it and plain head against wall rinse & repeat.

Eventually we focused on what happened around the location of the error. It always happened during a query, that was very consistent, when it happened. And finally we figured it out. We now use Lucene indexes stored inside Voron, and Lucene has some funky ideas about how it should be able to access the data. So we have to put a Voron transaction around the whole thing. And we have to flow the same transaction across multiple Lucene index input instances. So we put the transaction inside a thread local variable. And the query method is async.

I think that you can figure out what happened from here, right? When the async machinery jumped us threads, we would end up with a totally foreign transaction, our old transaction would be gone, and all of the carefully thought out premises that we have for transaction scope went out the window. Much cursing was to be heard.

So we did a quick fix and changed the ThreadLocal<Transaction> to be an AsyncLocal<Transaction>, so it would flow through the async calls. And then we run the tests, confident it would solve it. But it didn’t, in fact, we got the exact same error, in the exact same place, and we went back to head butting the wall to see who is smarter.

And then I realized that we were doing something else there. Lucene has the notion of cloning an input, which allows for multi threaded usage of an input. When we do that, we check if we are in the old transaction scope and can reuse previous work, or if we need to do the initialization for a new transaction. The problem was that we were doing this check by id.

Now, two transactions with the same id will always show the same data, period. However, how they do it is very different. Let us take a look at the Page Table diagram above. It shows that Page #412 is located on scratch #2 in position #8327. Now, we have a flushing background process that will take the data from the scratch file and move it to the data file. So the new Page Table will look like this:

image

Note that because the data on the data file in position #412 and scratch #2 in position #8327 is the same, that doesn’t actually matter. Except that when you have started in one transaction and started reading from scratch #2, then was bumped into another thread, and now are trying to keep on reading from the same place, only to end up blowing up entirely.

Once we have fixed this problem as well, all was well with the world. The sky wasn’t falling, and I was writing a blog post at midnight for relaxation.

FUTURE POSTS

  1. Performance as a feature - 7 hours from now
  2. Externalizing the HttpClient internals for fun & profit - about one day from now
  3. Reducing the cost of occasionally needed information - 2 days from now
  4. Why we aren’t publishing benchmarks for RavenDB 4.0 yet - 3 days from now
  5. Deleting highly performant code - 4 days from now

And 3 more posts are pending...

There are posts all the way to Apr 05, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  2. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats