Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,297 | Comments: 46,812

filter by tags archive

Tricks of working with native memory

time to read 2 min | 362 words

I want to start by saying that this isn’t my idea, I read about it a few times, and I recently encountered it with sodium_malloc, so I decided to write my own implementation of the world’s most expensive memory allocator.

What this code does is pretty simple, and quite brutal. It allocates memory in such a fashion that absolutely guarantee that you can’t get away with a whole host of memory problems.

For example, if you try overwrite a buffer allocated by this method, you’ll immediately hit the guard page and die horribly (and predictably, in the place where the error actually happened, not a long way off). If you somehow write before the buffer, that will be detected on free if this is a small under write (which tend to be much rarer, by the way), or immediately if this is a big change.

What is more, once the memory is freed, it is poisoned, and can never be used again. This pretty much rely on us running on 64 bits with effectively unlimited virtual memory, and has the nasty side effect of turning a 50 bytes allocation to something requiring 12 KB. Having said that, as a debugging tool, this is invaluable.

And yes, I’m aware that windows already have that with the heap verifier. But as I’m using this in .NET code, I needed to write my own (this also pretty much work the same way with Linux, you just need to switch the API, but the functionality is the same).

This was built because we were chasing a memory corruption error, and I run this, but it pointed me to a totally different location than suspected. So it is either doing a very good job, or it found me another issue.

… investigating …

Or, as the case may be, we found a bug in the actual memory guard (we didn’t handle allocations of exact page size correctly, and they broke), but at least it broke consistently and was pretty easy to find once I looked in the right place Smile.

The edge case is in the timing

time to read 1 min | 141 words

In a previous post, I showed how to get 10x performance boost by batching remote calls. The code included the following method:

This code has a subtle bug in it. Take a look and see if you can find it.

Yes, I’ll wait, honestly.

Go read the code and think about interesting ways to break it.

Okay, found it? Awesome, now let me explain anyway Smile.

This code suffer from the slow arrival syndrome. If we have a new request coming in every 100 ms, then the first request here will languish in the queue for over 25 seconds!

Luckily, the fix is simple:

We just need to make sure that we aren’t waiting for too long. In this case, we will wait a maximum of less than half a second for the messages to go over the wire.

A different sort of cross platform bug

time to read 2 min | 334 words

During the move of RavenDB to Linux we had to deal with the usual share of cross platform bugs. The common ones are things like file system case sensitivity, concatenating paths with \ separators, drive letters, etc. The usual stuff.

Note that the hairy stuff (different threading model, different I/O semantics, etc) I’m not counting, we already had dealt with them before.

The real problems had to do with the file systems, in particular, ext3 doesn’t support fallocate, which we kind of wanna use to allocate our disk space in advance (which can give the file system information so it can allocate all of that together). But that was pretty straightforward. ENOSUP is clear enough for me, and we could just use ext4, which works.

But this bug was special. To start with, it snuck in and lain there dormant for quite a while, and it was a true puzzle. Put simply, a certain database would work just fine as long as we didn’t restart it. On restart, some of its indexes would be in an error state.

It took a while to figure out what was going on. At first, we thought that there was some I/O error, but the issue turned out to be much simpler. We assumed that we would be reading indexes in order, in fact, our code output the index directories in such a way that they would be lexically sorted. That worked quite well, and had the nice benefit that on NTFS, we would always read the indexes in order.

But on Linux, there is no guaranteed order, and on one particular machine, we indeed hit the out of order issue. And in that case, we had an error raised because we were trying to create an index whose id was too low. Perfectly understandable, when you realize what is going on, and trivial to fix, but quite hard to figure out, because we kept assuming that we did something horrible to the I/O system.

The Red Alert Sleeper Agent Bug

time to read 6 min | 1170 words

Today I started out like most recent days, I was working on improving performance and running benchmarks. I made a small change in how we handle file allocation and mapping inside Voron. This is the kind of change that should have no observable effects. And indeed, except for making us run faster, everything worked.

Except that later today I merged some stuff from a colleague and suddenly I started getting invalid memory accesses. After quickly blaming my colleague for the issue, we eventually figured out that it was my change that caused it.

Unfortunately, the problem was a lot more serious than it immediately appeared. It wasn’t just that I needed to fix my code, what was happening there was that I made a certain situation (a new file mapping, and thus, exercising the cleanup routine) a lot more frequent. Which is all well and good, except that this is something that will happen routinely in Voron anyway, it just means that this is now much more likely.

And the problem is that we couldn’t for the life of us figure out why it was failing. Oh, we quickly figured out that we are accessing memory that has been unmapped, but how? The Voron codebase is really careful about such things, and we have quite enough production usage to know that this doesn’t really happen. But again, it might just be a sleeper…

The real problem wasn’t actually with the access violation, that was pretty obvious and would have come to our immediate attention. The problem was that the error looked like the Page Translation Table had a race condition. In this case, because we are much more eager about cleanup, it was obvious that we are accessing old information, but without this to trigger our attention, the fear was that we are actually racy, and that the Page Translation Table will serve incorrect information.

That means that Voron would violate its consistency rule, we’ll effectively be returning random garbage to the user and… Bad Things to Follow.

At various points during the day, we had five different people working on it in three continents, because it is that kind of bug. And we couldn’t figure it out. We traced the code that did that every which way. It is old code, that has been worked upon repeatedly, and it has been stable for years. And none of us could figure out what was going on. Theories ranging from cosmic rays to the wrath of Murphy has been thrown out.

Something was very rotten in Voron. Okay, after all of this exposition, let me explain what was going on. We started out with a Page Table that looked like this:

image

So the first number is the page number, and the second is the page number inside the scratch file (#1 or #2, above).

Basically, this means that when Transaction #3 looks asks for Page #0, it will actually get Page #238 from scratch #1. And when Transaction #4 asks for the same page, it will get it from Page #482 on scratch #2. If you got lost with the numbers, don’t worry, we did too.

The problem was… the failing issue was in transaction #5, and the problem was that it was access page #412 on scratch #1. And due to my change, we actually closed scratch #1. The problem is that we couldn’t figure out how this thing could actually happen. Crazy stuff. We tried reproducing this in all sort of crazy ways, but it would only fail on the most trivial of tests, and very unpredictably. And then we finally figured it out. Basically by tracing everything, putting locks on everything that moved or looked like it would move it I kicked it and plain head against wall rinse & repeat.

Eventually we focused on what happened around the location of the error. It always happened during a query, that was very consistent, when it happened. And finally we figured it out. We now use Lucene indexes stored inside Voron, and Lucene has some funky ideas about how it should be able to access the data. So we have to put a Voron transaction around the whole thing. And we have to flow the same transaction across multiple Lucene index input instances. So we put the transaction inside a thread local variable. And the query method is async.

I think that you can figure out what happened from here, right? When the async machinery jumped us threads, we would end up with a totally foreign transaction, our old transaction would be gone, and all of the carefully thought out premises that we have for transaction scope went out the window. Much cursing was to be heard.

So we did a quick fix and changed the ThreadLocal<Transaction> to be an AsyncLocal<Transaction>, so it would flow through the async calls. And then we run the tests, confident it would solve it. But it didn’t, in fact, we got the exact same error, in the exact same place, and we went back to head butting the wall to see who is smarter.

And then I realized that we were doing something else there. Lucene has the notion of cloning an input, which allows for multi threaded usage of an input. When we do that, we check if we are in the old transaction scope and can reuse previous work, or if we need to do the initialization for a new transaction. The problem was that we were doing this check by id.

Now, two transactions with the same id will always show the same data, period. However, how they do it is very different. Let us take a look at the Page Table diagram above. It shows that Page #412 is located on scratch #2 in position #8327. Now, we have a flushing background process that will take the data from the scratch file and move it to the data file. So the new Page Table will look like this:

image

Note that because the data on the data file in position #412 and scratch #2 in position #8327 is the same, that doesn’t actually matter. Except that when you have started in one transaction and started reading from scratch #2, then was bumped into another thread, and now are trying to keep on reading from the same place, only to end up blowing up entirely.

Once we have fixed this problem as well, all was well with the world. The sky wasn’t falling, and I was writing a blog post at midnight for relaxation.

The bug in the budgeting process

time to read 5 min | 841 words

In my preview post, I mentioned that removing artificial batch limits has caused us to double our performance. But what are those artificial batch limits?

Well, anything that doesn’t involve actual system resources. For example, limit batch size by time or by document count is artificial. We used to have to do that as a correlation to the amount of managed memory we use, and because it allowed us to parallelize I/O and computation work. Now, each index is actually working on its own, so if one index is stalling because it need to fetch data, other indexes will use the available core, and every one will be happy.

Effectively, an indexing batch stopped being a global database event that we had to fetch data for specifically and became something much smaller. That fact alone gave us leeway to remove drastic amounts of code to handle things like prefetching, I/O / memory / time / CPU balancing and a whole bunch of really crazy stuff that we had to do.

So all of that went away, and we learned that anything that would artificially reduce a batch size is bad, that we should make the batch size as big as possible to benefit from economy of scale effects.

But wait, what about non artificial limits? For example, running an indexing batch take some memory. We can now track it much better, and most of it is in unmanaged memory anyway, so we don’t worry about keeping it around for a long time. We do worry about running out of it, though.

If we have six indexes all running at the same time, each trying to use as much of the system resources as it possible could. Of course, if we actually let them to that they would allocate enough memory to push us into the page file, resulting in all our beautiful code spending all its time just paging in and out from disk, and our performance looking like it was hit in the face repeatedly with the hard disk needle.

So we have a budget. In fact, we have a pretty complete heuristics system in place.

  • Start by giving each index 16 MB to run.
  • Whenever the index exceed that budget, allow it to complete the current operation (typically a single document, so pretty small)
  • Check if there is enough memory available* that we can still use, and if so, increase the budget by another 16 MB

* Enough memory available is actually a really complex idea, enough so that I’ll dedicate the next post to it.

So that leads us to all indexes competing with one another to get more memory, until we hit the predefined limit (which is supposed to allow us memory to do other work as well). At that point, we hit a real limit, and we stop the batch, complete our work and carry on. After the batch is completed, we could release all of that memory and start from scratch, but that would probably be a waste, we already know that we haven’t gone too badly over budget, so why release all that precious memory just to immediately require it again?

So that is what we did, and we run our benchmarks again. And the performance was not nice to us.

It took a while to figure out what happened, but you can see this on the following graph.

image

We started allocating memory, and as you can see, we have some indexes that have high memory requirement. At some point, we have hit the memory ceiling we specified, and started completing batches so we won’t use too much memory.

All well and good. Except that the act of completing the batch will also (sometimes) release memory. This is typically done because we have found the ideal sizes we need for processing, so we discard everything that is too small. But the allocator is free to release memory if it thinks that this is the best for the system.

Unfortunately, we didn’t adjust the budget in this case. Consider the case of indexes C & F, both of which released significant amount of memory after the batch was completed. Index B, which was forced to make do with whatever memory it managed to grab, suddenly finds itself in a position to grab more memory, and it will slowly increase its budgets and allocations.

At the same time, indexes C & F are also going to allocate more memory, after all, they are well within their budget, since we didn’t account for the released memory that was gobbled up by index B. The fact that this starts happening only about 45 minutes into the batch, and it actually shows up as higher memory utilization about 4 hours after that is really quite annoying when you need to debug it.

How the debugger lied to my face with no shame at all

time to read 1 min | 155 words

Take a look at the following code:

image

As you can see, something pretty freaky is going on here.

We put a value in there, and then we try to get it out, we get a totally different value.

The real repro happened in a pretty complex piece of code, and because the old value was different than the value in the debugger, we were certain that we had something that was changing data behind our back. Only after we marked the memory as read only and still saw that behavior we realized that this is actually a debugger eval bug. Instead of reading the Id property as a long, it is reading it as a byte, leading to this misleading value.

Only cost me most of my hair.

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:

image

The problem happens when we convert it to a .NET object:

image

Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:

image

This generates a string whose size is less than 1KB.

But when parsing it:

image

The reason, by the way? It is the structure of the document.

The reason, by the way:

image

So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

I tell you, that thing is a bona fide ZEBRA, or a tale of being utterly stupid

time to read 3 min | 414 words

We run our test suite in a loop to discover any race conditions, timing issues, errors, etc. When doing so, we got a hard crash from the dotnet.exe, and investigating the issue produced a stack trace inside the GC.

So I took a dump of the process memory, and created an issue about that with the CoreCLR repository, while giving it a very high priority internally, and having someone look at that very closely. We are using unsafe code extensively, so it was either a real GC bug or we messed up somewhere are corrupted our own state.

Very quickly Jan Kotas was able to point out that it was a heap corruption issue as well as the likely avenues for investigation.

After looking at this, we found that the problem was in our tests. In particular, in one specific test. In order to test the memory corruption, we changed it to add markers on where it overwrote the buffer, and the test passed.

This caused us additional concern, because the only thing we could think about was that maybe there is some invariant that is being broken. Our suspicion focused on the fixed statement in C# not working properly. Yes, I know, “hoof beats, horses, not zebras”.

So I went to the issue again and reported my finding, and Andy Ayers was kind enough to find the problem, and point it to me.

Here is the relevant test code:

image

This is during debugging, so you can see what the problem is. We defined size to be 40, and we defined an input buffer, whose size is 100.

A little bit below, we created an output buffer based on the size variable (40), and then wrote to it with the expected size of input.Length, which is 100. Everything behaved as it should, and we had a buffer overrun in the test, the heap was corrupted, and sometimes the GC died.

Also, I feel very stupid about spouting all sort of nonsense about bugs in the CLR when our code is unable to do simple arithmetic.

The good news, the bug was only in the tests, and the kind of support that you get from Microsoft on the CoreCLR is absolutely phenomenal. Thank you very much guys.

Non reproducible / intermittent error handling

time to read 2 min | 370 words

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.

And we keep hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work Smile.

FUTURE POSTS

  1. Implementing low level trie: Solving with C++ - 9 hours from now
  2. Implementing low level trie: Digging into the C++ impl - about one day from now
  3. The low level trie Rust challenge - 2 days from now
  4. Scaffolding code as sign of maturity - 5 days from now
  5. The crash at the Unicode text - 6 days from now

And 9 more posts are pending...

There are posts all the way to Feb 13, 2017

RECENT SERIES

  1. Implementing low level trie (4):
    14 Dec 2016 - Part II
  2. Answer (9):
    20 Jan 2017 - What does this code do?
  3. Challenge (48):
    19 Jan 2017 - What does this code do?
  4. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats