Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,207 | Comments: 46,157

filter by tags archive

How the debugger lied to my face with no shame at all

time to read 1 min | 155 words

Take a look at the following code:

image

As you can see, something pretty freaky is going on here.

We put a value in there, and then we try to get it out, we get a totally different value.

The real repro happened in a pretty complex piece of code, and because the old value was different than the value in the debugger, we were certain that we had something that was changing data behind our back. Only after we marked the memory as read only and still saw that behavior we realized that this is actually a debugger eval bug. Instead of reading the Id property as a long, it is reading it as a byte, leading to this misleading value.

Only cost me most of my hair.

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:

image

The problem happens when we convert it to a .NET object:

image

Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:

image

This generates a string whose size is less than 1KB.

But when parsing it:

image

The reason, by the way? It is the structure of the document.

The reason, by the way:

image

So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

I tell you, that thing is a bona fide ZEBRA, or a tale of being utterly stupid

time to read 3 min | 414 words

We run our test suite in a loop to discover any race conditions, timing issues, errors, etc. When doing so, we got a hard crash from the dotnet.exe, and investigating the issue produced a stack trace inside the GC.

So I took a dump of the process memory, and created an issue about that with the CoreCLR repository, while giving it a very high priority internally, and having someone look at that very closely. We are using unsafe code extensively, so it was either a real GC bug or we messed up somewhere are corrupted our own state.

Very quickly Jan Kotas was able to point out that it was a heap corruption issue as well as the likely avenues for investigation.

After looking at this, we found that the problem was in our tests. In particular, in one specific test. In order to test the memory corruption, we changed it to add markers on where it overwrote the buffer, and the test passed.

This caused us additional concern, because the only thing we could think about was that maybe there is some invariant that is being broken. Our suspicion focused on the fixed statement in C# not working properly. Yes, I know, “hoof beats, horses, not zebras”.

So I went to the issue again and reported my finding, and Andy Ayers was kind enough to find the problem, and point it to me.

Here is the relevant test code:

image

This is during debugging, so you can see what the problem is. We defined size to be 40, and we defined an input buffer, whose size is 100.

A little bit below, we created an output buffer based on the size variable (40), and then wrote to it with the expected size of input.Length, which is 100. Everything behaved as it should, and we had a buffer overrun in the test, the heap was corrupted, and sometimes the GC died.

Also, I feel very stupid about spouting all sort of nonsense about bugs in the CLR when our code is unable to do simple arithmetic.

The good news, the bug was only in the tests, and the kind of support that you get from Microsoft on the CoreCLR is absolutely phenomenal. Thank you very much guys.

Non reproducible / intermittent error handling

time to read 2 min | 370 words

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug. So we looked into it pretty seriously.

And we keep hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kind of symptoms. We have a lot of experience in dealing with that, and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading, since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work Smile.

Debugging CoreCLR applications in WinDBG

time to read 2 min | 208 words

One of the nicest tools that you have as a developer is the ability to debug. WinDBG isn’t what I call the best debugger in the world, but it is certainly among the most powerful. This post is meant just to walk you through setting up WinDBG with a CoreCLR application.

In particular, this is important when you are getting a crash dump from somewhere. In order to actually debug things properly, you need to load the crash dump into WinDBG and then run the following commands:

  1. .sympath srv*c:\Symbols*https://msdl.microsoft.com/download/symbols
  2. .load C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0\sos.dll .
  3. .reload

The first step is to setup the symbols, so you can see method names, instead of addresses. The second will use the CoreCLR SOS dll (note, if you have different versions, you might need to get the sos.dll from the machine that the user is running along with the dump). And finally you are reloading the symbols (this will be slow on the first time).

That is it, now you can start working with WinDBG relatively normally, although I did notice some some of the commands have better UI. Looks like SOS and SOSEX had a meeting Smile.

ChallengeThe race condition in the TCP stack, answer

time to read 3 min | 410 words

In my previous post, I discussed a problem in missing data over TCP connection that happened in a racy manner, only every few hundred runs. As it turns out, there is a simple way to make the code run into the problem every single time.

The full code for the repro can be found here.

Change these lines:

image

And voila, you will consistently run into the problem .  Wait, run that by me again, what is going on here?

As it turns out, the issue is in the server, more specifically, here and here. We use a StreamReader to read the first line from the client, do some processing, and then hand it to the ProcessConnection method, which also uses a StreamReader. More significantly, it uses a different StreamReader.

Why is that significant? Well, because of this, the StreamReader has buffers, by default, that are 1KB in size. So here is what happens in the case above: we send a single packet to the server, and when the first StreamReader reads from the stream, it fills the buffer with the two messages. But since there is a line break between them, when we call ReadLineAsync, we actually only get the first one.

Then, we when get to the ProcessConnection method, we have another StreamReader, which also reads from the stream, but the second message had already been read (and is waiting in the first StreamReader buffer), so we are waiting for more information from the client, which will never come.

So how come it sort of works if we do this in two separate calls? Well, it is all about the speed. In most cases, when we split it into two separate calls, the server socket has only the first message in there when the first StreamReader runs, so the second StreamReader is successful in reading the second line. But in some cases, the client manages being fast enough and sending both messages to the server before the server can read them, and voila, we have the same behavior, only much more unpredictable.

The key problem was that it wasn’t obvious we were reading too much from the stream, and until we figured that one out, we were looking in a completely wrong direction. 

ChallengeThe race condition in the TCP stack

time to read 3 min | 463 words

Occasionally, one of our tests hangs. Everything seems to be honky dory, but it just freezes and does not complete. This is a new piece of code, and thus is it suspicious unless proven otherwise, but an exhaustive review of it looked fine. It took over two days of effort to narrow it down, but eventually we managed to point the finger directly at this line of code:

image

In certain cases, this line would simply not read anything on the server. Even though the client most definitely sent the data. Now, given that TCP is being used, dropped packets might be expected. But we are actually testing on the loopback device, which I expect to be reliable.

We spent a lot of time investigating this, ending up with a very high degree of certainty that the problem was in the TCP stack somewhere. Somehow, on the loopback device, we were losing packets. Not always, and not consistently, but we were absolutely losing packets, which led the server to wait indefinitely for the client to send the message it already did.

Now, I’m as arrogant as the next developer, but even I don’t think I found that big a bug in TCP. I’m pretty sure that if it was this broken, I would have known about it. Beside, TCP is supposed to retransmit lost packets, so even if there were lost packets on the loopback device, we should have recovered from that.

Trying to figure out what was going on there sucked. It is hard to watch packets on the loopback device in WireShark, and tracing just told me that a message is sent from the client to the server, but the server never got it.

But we continued, and we ended up with a small reproduction of the issue. Here is the code, and my comments are below:

This code is pretty simple. It starts a TCP server, and listens to it, and then it reads and writes to the client. Nothing much here, I think you’ll agree.

If you run it, however, it will mostly work, except that sometimes (anywhere between 10 runs and 500 runs on my machine), it will just hang. I’ll save you some time and let you know that there are no dropped packets, TCP is working properly in this case. But the code just doesn’t. What is frustrating is that it is mostly working, it takes a lot of work to actually get it to fail.

Can you spot the bug? I’ll continue discussion of this in my next post.

The deadlock in the fairness assumption

time to read 4 min | 744 words

Looking into some benchmark code, I noticed that the developer has put some stuff there that shouldn't be there (and is likely slowing down the benchmark). When I commented on that, the answer was: "If I don't do that, this breaks".

Indeed, he was able to show me that this will actually break. And that was very bad news. Because that was a pretty sensitive spot. In fact, afterward he actually filed several bugs, including scary looking deadlock detected one. The problem is that this code works. In fact, it has been deployed, in production, for years, and I have very high confidence that it works.

But eventually I was able to reproduce this locally, and then track it down to the actual root cause. First, let me show you the reproduction. First, the sample code:

public class Program
{
    public static void Main(string[] args)
    {
        var fair = new FairPractice();
        var thread = new Thread(fair.OtherWork);
        thread.Start();

        while (thread.IsAlive)
        {
            fair.DoMainWork();
        }
    }
}

public class FairPractice
{
    SemaphoreSlim _s = new SemaphoreSlim(1);

    public void OtherWork()
    {
        while (true)
        {
            Thread.Sleep(100);
            if (_s.Wait(1000) == false)
            {
                Console.WriteLine("\r\nCan't get it!");
                return;
            }
            Console.WriteLine("\r\nGot IT!!!!");
            _s.Release();
        }
    }

    private int _work;
    public void DoMainWork()
    {
        if (_s.Wait(500) == false)
            throw new TimeoutException("Can't get semaphore");

        Console.Write("\r{0}", ++_work);
        Thread.Sleep(30);
        _s.Release();
    }
}

Now, this code seems pretty simple. We have a bunch of work that happens in the main loop. Taking the semaphore and holding it, then doing a small amount of work, then releasing the semaphore.

On the other thread, we have a (long) wait for this semaphore, and that is it. Now, the interesting thing here is that we would expect the 2nd thread to register a wait on the semaphore so when it is released, the 2nd thread wakes up, and get the semaphore. In practice, it looks like the main thread is the one having all the fun, and it is fast enough to re-acquire the semaphore before the 2nd thread can catch it. Looking into the code of SemaphoreSlim seems to confirm it. There are actually a bunch of stuff going on there, and this particular scenario (very short wait, just a single entry in the semaphore, a second waiter with a high threshold) seems to hit it very hard, and violate all fairness assumptions.

The good thing is that in order to hit this issue, you need to have very high speed. The reason this is good is that we are benchmarking, and that means that our code (which actually also does disk I/O) is now fast enough to his this issue.

Logging yourself to death

time to read 4 min | 753 words

One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.

Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.

Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.

Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.

That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.

But what was causing all of this angst?

As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.

To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN.  Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.

Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.

This is implemented in the following method:

public void Enqueue(AsyncLogEventInfo logEventInfo)
{
    lock (this)
    {
        if (this.logEventInfoQueue.Count >= this.RequestLimit)
        {
            InternalLogger.Debug("Async queue is full");
            switch (this.OnOverflow)
            {
                case AsyncTargetWrapperOverflowAction.Discard:
                    InternalLogger.Debug("Discarding one element from queue");
                    this.logEventInfoQueue.Dequeue();
                    break;
                // other options removed for brevity
            }
        }

        this.logEventInfoQueue.Enqueue(logEventInfo);
    }
}

The log is configured to use Discard on overflow, which is why I’m showing only this code.

I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.

What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.

There is a word for it, and it goes like this:

The case of the queue in the pool

time to read 2 min | 363 words

Today we found a bug in the way our connection pool.  The code in question?

public HttpClient GetClient(TimeSpan timeout, OperationCredentials credentials, Func<HttpMessageHandler> handlerFactory)
{
    var key = new HttpClientCacheKey(timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());

    Tuple<long, HttpClient> client;
    while (queue.TryDequeue(out client))
    {
        if (Stopwatch.GetTimestamp() - client.Item1 >= _maxIdleTimeInStopwatchTicks)
        {
            client.Item2.Dispose();
            continue;
        }
        client.Item2.CancelPendingRequests();
        client.Item2.DefaultRequestHeaders.Clear();
        return client.Item2;
    }

    return new HttpClient(handlerFactory())
    {
        Timeout = timeout
    };
}
public void ReleaseClient(HttpClient client, OperationCredentials credentials)
{
    var key = new HttpClientCacheKey(client.Timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());
    queue.Enqueue(Tuple.Create(Stopwatch.GetTimestamp(), client));
}

Can you see the bug?

I'll let you think about it for a while.

 

 

 

 

Done?

Okay, here we go!

The basic idea is that we'll keep a bunch of client connected, and free them if they are too old.

The issue is with the use of the concurrent queue. Consider the case of having a sudden spike in traffic, and we suddenly need 100 connections. This code will generate them, and then put them back in the pool.

We then go back to normal level, only handle 20 concurrent requests, but because we are using a queue, we'll actually cycle all the 100 requests we have through the queue all the time, keeping all of them opened.

FUTURE POSTS

  1. RavenDB Retrospective: Unbounded result sets - 17 hours from now
  2. RavenDB Restorspective - 10 days from now

There are posts all the way to Oct 07, 2016

RECENT SERIES

  1. Voron internals (5):
    13 Sep 2016 - The diff is the way
  2. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  3. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  4. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  5. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats