Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,191 | Comments: 45,982

filter by tags archive

ChallengeThe race condition in the TCP stack, answer

time to read 3 min | 410 words

In my previous post, I discussed a problem in missing data over TCP connection that happened in a racy manner, only every few hundred runs. As it turns out, there is a simple way to make the code run into the problem every single time.

The full code for the repro can be found here.

Change these lines:

image

And voila, you will consistently run into the problem .  Wait, run that by me again, what is going on here?

As it turns out, the issue is in the server, more specifically, here and here. We use a StreamReader to read the first line from the client, do some processing, and then hand it to the ProcessConnection method, which also uses a StreamReader. More significantly, it uses a different StreamReader.

Why is that significant? Well, because of this, the StreamReader has buffers, by default, that are 1KB in size. So here is what happens in the case above: we send a single packet to the server, and when the first StreamReader reads from the stream, it fills the buffer with the two messages. But since there is a line break between them, when we call ReadLineAsync, we actually only get the first one.

Then, we when get to the ProcessConnection method, we have another StreamReader, which also reads from the stream, but the second message had already been read (and is waiting in the first StreamReader buffer), so we are waiting for more information from the client, which will never come.

So how come it sort of works if we do this in two separate calls? Well, it is all about the speed. In most cases, when we split it into two separate calls, the server socket has only the first message in there when the first StreamReader runs, so the second StreamReader is successful in reading the second line. But in some cases, the client manages being fast enough and sending both messages to the server before the server can read them, and voila, we have the same behavior, only much more unpredictable.

The key problem was that it wasn’t obvious we were reading too much from the stream, and until we figured that one out, we were looking in a completely wrong direction. 

ChallengeThe race condition in the TCP stack

time to read 3 min | 463 words

Occasionally, one of our tests hangs. Everything seems to be honky dory, but it just freezes and does not complete. This is a new piece of code, and thus is it suspicious unless proven otherwise, but an exhaustive review of it looked fine. It took over two days of effort to narrow it down, but eventually we managed to point the finger directly at this line of code:

image

In certain cases, this line would simply not read anything on the server. Even though the client most definitely sent the data. Now, given that TCP is being used, dropped packets might be expected. But we are actually testing on the loopback device, which I expect to be reliable.

We spent a lot of time investigating this, ending up with a very high degree of certainty that the problem was in the TCP stack somewhere. Somehow, on the loopback device, we were losing packets. Not always, and not consistently, but we were absolutely losing packets, which led the server to wait indefinitely for the client to send the message it already did.

Now, I’m as arrogant as the next developer, but even I don’t think I found that big a bug in TCP. I’m pretty sure that if it was this broken, I would have known about it. Beside, TCP is supposed to retransmit lost packets, so even if there were lost packets on the loopback device, we should have recovered from that.

Trying to figure out what was going on there sucked. It is hard to watch packets on the loopback device in WireShark, and tracing just told me that a message is sent from the client to the server, but the server never got it.

But we continued, and we ended up with a small reproduction of the issue. Here is the code, and my comments are below:

This code is pretty simple. It starts a TCP server, and listens to it, and then it reads and writes to the client. Nothing much here, I think you’ll agree.

If you run it, however, it will mostly work, except that sometimes (anywhere between 10 runs and 500 runs on my machine), it will just hang. I’ll save you some time and let you know that there are no dropped packets, TCP is working properly in this case. But the code just doesn’t. What is frustrating is that it is mostly working, it takes a lot of work to actually get it to fail.

Can you spot the bug? I’ll continue discussion of this in my next post.

The deadlock in the fairness assumption

time to read 4 min | 744 words

Looking into some benchmark code, I noticed that the developer has put some stuff there that shouldn't be there (and is likely slowing down the benchmark). When I commented on that, the answer was: "If I don't do that, this breaks".

Indeed, he was able to show me that this will actually break. And that was very bad news. Because that was a pretty sensitive spot. In fact, afterward he actually filed several bugs, including scary looking deadlock detected one. The problem is that this code works. In fact, it has been deployed, in production, for years, and I have very high confidence that it works.

But eventually I was able to reproduce this locally, and then track it down to the actual root cause. First, let me show you the reproduction. First, the sample code:

public class Program
{
    public static void Main(string[] args)
    {
        var fair = new FairPractice();
        var thread = new Thread(fair.OtherWork);
        thread.Start();

        while (thread.IsAlive)
        {
            fair.DoMainWork();
        }
    }
}

public class FairPractice
{
    SemaphoreSlim _s = new SemaphoreSlim(1);

    public void OtherWork()
    {
        while (true)
        {
            Thread.Sleep(100);
            if (_s.Wait(1000) == false)
            {
                Console.WriteLine("\r\nCan't get it!");
                return;
            }
            Console.WriteLine("\r\nGot IT!!!!");
            _s.Release();
        }
    }

    private int _work;
    public void DoMainWork()
    {
        if (_s.Wait(500) == false)
            throw new TimeoutException("Can't get semaphore");

        Console.Write("\r{0}", ++_work);
        Thread.Sleep(30);
        _s.Release();
    }
}

Now, this code seems pretty simple. We have a bunch of work that happens in the main loop. Taking the semaphore and holding it, then doing a small amount of work, then releasing the semaphore.

On the other thread, we have a (long) wait for this semaphore, and that is it. Now, the interesting thing here is that we would expect the 2nd thread to register a wait on the semaphore so when it is released, the 2nd thread wakes up, and get the semaphore. In practice, it looks like the main thread is the one having all the fun, and it is fast enough to re-acquire the semaphore before the 2nd thread can catch it. Looking into the code of SemaphoreSlim seems to confirm it. There are actually a bunch of stuff going on there, and this particular scenario (very short wait, just a single entry in the semaphore, a second waiter with a high threshold) seems to hit it very hard, and violate all fairness assumptions.

The good thing is that in order to hit this issue, you need to have very high speed. The reason this is good is that we are benchmarking, and that means that our code (which actually also does disk I/O) is now fast enough to his this issue.

Logging yourself to death

time to read 4 min | 753 words

One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.

Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.

Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.

Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.

That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.

But what was causing all of this angst?

As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.

To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN.  Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.

Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.

This is implemented in the following method:

public void Enqueue(AsyncLogEventInfo logEventInfo)
{
    lock (this)
    {
        if (this.logEventInfoQueue.Count >= this.RequestLimit)
        {
            InternalLogger.Debug("Async queue is full");
            switch (this.OnOverflow)
            {
                case AsyncTargetWrapperOverflowAction.Discard:
                    InternalLogger.Debug("Discarding one element from queue");
                    this.logEventInfoQueue.Dequeue();
                    break;
                // other options removed for brevity
            }
        }

        this.logEventInfoQueue.Enqueue(logEventInfo);
    }
}

The log is configured to use Discard on overflow, which is why I’m showing only this code.

I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.

What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.

There is a word for it, and it goes like this:

The case of the queue in the pool

time to read 2 min | 363 words

Today we found a bug in the way our connection pool.  The code in question?

public HttpClient GetClient(TimeSpan timeout, OperationCredentials credentials, Func<HttpMessageHandler> handlerFactory)
{
    var key = new HttpClientCacheKey(timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());

    Tuple<long, HttpClient> client;
    while (queue.TryDequeue(out client))
    {
        if (Stopwatch.GetTimestamp() - client.Item1 >= _maxIdleTimeInStopwatchTicks)
        {
            client.Item2.Dispose();
            continue;
        }
        client.Item2.CancelPendingRequests();
        client.Item2.DefaultRequestHeaders.Clear();
        return client.Item2;
    }

    return new HttpClient(handlerFactory())
    {
        Timeout = timeout
    };
}
public void ReleaseClient(HttpClient client, OperationCredentials credentials)
{
    var key = new HttpClientCacheKey(client.Timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());
    queue.Enqueue(Tuple.Create(Stopwatch.GetTimestamp(), client));
}

Can you see the bug?

I'll let you think about it for a while.

 

 

 

 

Done?

Okay, here we go!

The basic idea is that we'll keep a bunch of client connected, and free them if they are too old.

The issue is with the use of the concurrent queue. Consider the case of having a sudden spike in traffic, and we suddenly need 100 connections. This code will generate them, and then put them back in the pool.

We then go back to normal level, only handle 20 concurrent requests, but because we are using a queue, we'll actually cycle all the 100 requests we have through the queue all the time, keeping all of them opened.

Find the bugWhen you can't rely on your own identity

time to read 2 min | 234 words

We had a need to generate unique ids for connections in our db. We used the following code:

public class NotificationsClientConnection : IDisposable
{
    private static int _counter = 0;

    private readonly WebSocket _webSocket;
    private readonly DocumentDatabase _documentDatabase;
    private readonly DateTime _startedAt;


    public NotificationsClientConnection(WebSocket webSocket, DocumentDatabase documentDatabase)
    {
        _webSocket = webSocket;
        _documentDatabase = documentDatabase;
        _startedAt = SystemTime.UtcNow;
    }

    public int Id => Interlocked.Increment(ref _counter);

    public TimeSpan Age => SystemTime.UtcNow - _startedAt;
}

Do you see the bug? It is a subtle one.

When you are ready, peek below.

.

.

.

.

.

.

.

.

.

.

.

.

.

The issue is in the following line:

    public int Id => Interlocked.Increment(ref _counter);

The intent was to create a field that would be initialized on startup. Unfortunately, the usage of the lambda symbol "=>" instead of assignment "=" turned that into a computed property, which will return a new value on every call.

Production postmortemA null reference in our abstraction

time to read 4 min | 781 words

This little nugget has caused a database upgrade to fail. Consider the following code for a bit. We have CompoundKey, which has two versions, slow and fast. The idea is that we use this as keys into a cache, and there are two types because we can “query” the cache cheaply, but constructing the actual key for the cache is more expensive. Hence, the names.

public class CompoundKey
{
    public int HashKey;

}

public sealed class FastCompoundKey : CompoundKey
{
    public int Val;

}

public sealed class SlowCompoundKey : CompoundKey
{
    public int SlowVal;

}
public class CompoundKeyEqualityComparer : IEqualityComparer<CompoundKey>
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool Equals(CompoundKey x, CompoundKey y)
    {
        if (x == null || y == null)
            return false;

        SlowCompoundKey k;
        FastCompoundKey self;
        var key = x as FastCompoundKey;
        if (key != null)
        {
            self = key;
            k = y as SlowCompoundKey;
        }
        else
        {
            self = y as FastCompoundKey;
            k = x as SlowCompoundKey;
        }

        return self.Val == k.SlowVal;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int GetHashCode(CompoundKey obj)
    {
        return obj.HashKey;
    }
}

The problem was a null reference exception in the Equals method.  And I believe the term for how I felt was:

The problem is that it is obvious why a null reference exception can be thrown. If the values passed to this method are both SlowCompoundKey or both FastCompoundKey. But the reason the Equals method looks the way it does is that this is actually in a very performance sensitive portion of our code, and we have worked to make sure that it runs as speedily as possible. We considered the case where we would have both of them, but the code had a measured performance impact, and after checking the source of the dictionary class, we were certain that such comparisons were not possible.

We were wrong.

Boiling it all down, here is how we can reproduce this issue:

 var dic = new Dictionary<CompoundKey, object>(new CompoundKeyEqualityComparer());
 dic[new SlowCompoundKey { HashKey = 1 }] = null;
 dic[new SlowCompoundKey { HashKey = 1 }] = null;

What is going on here?

Well, it is simple. We add another value with the same hash code. That mean that the dictionary need to check if they are the same value, or just a hash collision. It does so by passing both the first key (slow) and the second key (slow) into the Equals method, and then hilarity enthused.

In retrospect, this is pretty much how it has to behave, and we should have considered that, but we were focused on looking at just the read side of the operation, and utterly forgot about how insert must work.

Luckily, this was an easy fix, although we still need to do regressive perf testing.

Production postmorterm: Houston, we have a problem

time to read 4 min | 660 words

As you read this post, you might want to also consider letting this play in the background. We had a UDP port leak in RavenDB. We squashed it like a bug, but somehow it kep repeating.

 

We found one cause of it (and fixed it), finally. That was after several rounds of looking at the code and fixing a few “this error condition can lead to the socket not being properly disposed”.

Finally, we pushed to our own internal systems, and monitored things, and saw that it was good. But the bloody bug kept repeating. Now, instead of manifesting as thousands of UDP ports, we had just a dozen or so, but they were (very) slowly increasing. And it drove us nuts. We had logging there, and we could see that we didn’t had the kind of problems that we had before. And everything looked good.

A full reproduction of the issue can be here, but the relevant piece of code is here:

Timer timer = new Timer(async state =>
{
    try
    {
        var addresses = await Dns.GetHostAddressesAsync("time.nasa.gov");
        var endPoint = new IPEndPoint(addresses[0], 123);

        using (var udpClient = new UdpClient())
        {
            udpClient.Connect(endPoint);
            udpClient.Client.ReceiveTimeout = 100;
            udpClient.Client.SendTimeout = 100;
            await udpClient.SendAsync(new byte[] {12, 32, 43}, 3);
            await udpClient.ReceiveAsync();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
        Environment.Exit(-1);
    }
});
timer.Change(500, 500);

As you can see, we are issuing a request to a time server, wrap the usage of the UDP socket in a using statement, make sure to have proper error handling, setup the proper timeouts, the works.

Our read code is actually awash with logging, detailed error handling, and we poured over that a crazy amount of time to figure out what was going on.

If you run this code, and watch the number of used TCP ports, you’ll see a very curious issue. It is always increasing. What is worse, there are no errors, nothing. It just goes into a black hole in the sky and doesn’t work.

In this case, I’ve explicitly created a malformed request, so it is expected that the remote server will not reply to me. That allows us to generate the proper scenario. In production, of course, we send the right value, and we typically get the right result, so we didn’t see this.

The error we had was the timeout values. The documentation quite clearly states that they apply to the syncronous method only, and since they don’t say a word about the async method, this does not apply to the async methods. Given how UDP works, that makes perfect sense. To support timeout on the async methods, the UdpClient would need to start a timer to do so.  However, given the API, it is very easy to see how we kept missing this.

The real issue is that when we make a request to a server, and for whatever reason, the UDP reply packet is dropped, we just hang in an async manner. That is, we have an async call that will never return. That call holds the UDP port open, and over time, that shows up as a leak. That is pretty horrible, but the good thing is that once we knew what the problem was, fixing it was trivial.

Find the bugThe case of the degrading system–Answer

time to read 3 min | 583 words

In my previous post I showed the following code, and asked what the bug was, and what the implications of that would be.

class Program
{
    private Timer nextcheck;
    public event EventHandler ServerSigFailed;
    static void Main(string[] args)
    {
        var program = new Program();
        if(program.ValidateServerSig() == false) 
            return;
        program.DoOtherStuff();
    }

    public bool ValidateServerSig()
    {
        nextcheck = new Timer(state => ValidateServerSig());

        var response = DoRequest("http://remote-srv/signature");
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        var result = Utils.CheckPublic KeySignatureMatches(response);
        if(result.Valid)
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        
        // setup next check
        nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
        return true;
    }
}

The answer is quite simple, look at the first line of the ValidateServerSig function. It setups a new timer that recursively call this function. So every 15 seconds, we are going to have a new timer to run this function.

And 15 seconds after that, we are going to have 4 timers, and 15 seconds after that…

But the interesting thing here that those are timers, so they get scheduled on the thread pool. So while the growth rate of the number of tasks is phenomenal, in practice, they don't get run all together.So after 20 minutes of this, it will only run about 800 times. Because the thread pool is growing slowly, we'll end up with a much slower growth rate, but eventually the thread pool queues are going to be filled with nothing but this task. And since we typically also process requests on the thread pool, eventually requests cannot be handled, because the threads are always busy running the validation code.

In the real codebase, the timing was 5 minutes, and this issue typically manifested itself only after several weeks of continuous running.

Find the bugThe case of the degrading system

time to read 2 min | 395 words

The following code contains a bug, can you spot it? 

class Program
{
    private Timer nextcheck;
    public event EventHandler ServerSigFailed;
    static void Main(string[] args)
    {
        var program = new Program();
        if(program.ValidateServerSig() == false) 
            return;
        program.DoOtherStuff();
    }

    public bool ValidateServerSig()
    {
        nextcheck = new Timer(state => ValidateServerSig());

        var response = DoRequest("http://remote-srv/signature");
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        var result = Utils.CheckPublic KeySignatureMatches(response);
        if(result.Valid)
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        
        // setup next check
        nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
        return true;
    }
}

What are the implications of this bug?

FUTURE POSTS

  1. re: Why Uber Engineering Switched from Postgres to MySQL - 13 hours from now
  2. Debugging CoreCLR applications in WinDBG - about one day from now
  3. Non reproducible / intermittent error handling - 5 days from now
  4. The Guts n’ Glory of Database Internals: What goes inside the transaction journal - 6 days from now
  5. The Guts n’ Glory of Database Internals: Log shipping and point in time recovery - 7 days from now

And 20 more posts are pending...

There are posts all the way to Aug 31, 2016

RECENT SERIES

  1. Production postmortem (16):
    05 Feb 2016 - A null reference in our abstraction
  2. The Guts n’ Glory of Database Internals (20):
    18 Jul 2016 - What the disk can do for you
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats