Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,130 | Comments: 45,558

filter by tags archive

The deadlock in the fairness assumption

time to read 4 min | 744 words

Looking into some benchmark code, I noticed that the developer has put some stuff there that shouldn't be there (and is likely slowing down the benchmark). When I commented on that, the answer was: "If I don't do that, this breaks".

Indeed, he was able to show me that this will actually break. And that was very bad news. Because that was a pretty sensitive spot. In fact, afterward he actually filed several bugs, including scary looking deadlock detected one. The problem is that this code works. In fact, it has been deployed, in production, for years, and I have very high confidence that it works.

But eventually I was able to reproduce this locally, and then track it down to the actual root cause. First, let me show you the reproduction. First, the sample code:

public class Program
{
    public static void Main(string[] args)
    {
        var fair = new FairPractice();
        var thread = new Thread(fair.OtherWork);
        thread.Start();

        while (thread.IsAlive)
        {
            fair.DoMainWork();
        }
    }
}

public class FairPractice
{
    SemaphoreSlim _s = new SemaphoreSlim(1);

    public void OtherWork()
    {
        while (true)
        {
            Thread.Sleep(100);
            if (_s.Wait(1000) == false)
            {
                Console.WriteLine("\r\nCan't get it!");
                return;
            }
            Console.WriteLine("\r\nGot IT!!!!");
            _s.Release();
        }
    }

    private int _work;
    public void DoMainWork()
    {
        if (_s.Wait(500) == false)
            throw new TimeoutException("Can't get semaphore");

        Console.Write("\r{0}", ++_work);
        Thread.Sleep(30);
        _s.Release();
    }
}

Now, this code seems pretty simple. We have a bunch of work that happens in the main loop. Taking the semaphore and holding it, then doing a small amount of work, then releasing the semaphore.

On the other thread, we have a (long) wait for this semaphore, and that is it. Now, the interesting thing here is that we would expect the 2nd thread to register a wait on the semaphore so when it is released, the 2nd thread wakes up, and get the semaphore. In practice, it looks like the main thread is the one having all the fun, and it is fast enough to re-acquire the semaphore before the 2nd thread can catch it. Looking into the code of SemaphoreSlim seems to confirm it. There are actually a bunch of stuff going on there, and this particular scenario (very short wait, just a single entry in the semaphore, a second waiter with a high threshold) seems to hit it very hard, and violate all fairness assumptions.

The good thing is that in order to hit this issue, you need to have very high speed. The reason this is good is that we are benchmarking, and that means that our code (which actually also does disk I/O) is now fast enough to his this issue.

Logging yourself to death

time to read 4 min | 753 words

One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.

Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.

Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.

Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.

That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.

But what was causing all of this angst?

As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.

To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN.  Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.

Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.

This is implemented in the following method:

public void Enqueue(AsyncLogEventInfo logEventInfo)
{
    lock (this)
    {
        if (this.logEventInfoQueue.Count >= this.RequestLimit)
        {
            InternalLogger.Debug("Async queue is full");
            switch (this.OnOverflow)
            {
                case AsyncTargetWrapperOverflowAction.Discard:
                    InternalLogger.Debug("Discarding one element from queue");
                    this.logEventInfoQueue.Dequeue();
                    break;
                // other options removed for brevity
            }
        }

        this.logEventInfoQueue.Enqueue(logEventInfo);
    }
}

The log is configured to use Discard on overflow, which is why I’m showing only this code.

I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.

What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.

There is a word for it, and it goes like this:

The case of the queue in the pool

time to read 2 min | 363 words

Today we found a bug in the way our connection pool.  The code in question?

public HttpClient GetClient(TimeSpan timeout, OperationCredentials credentials, Func<HttpMessageHandler> handlerFactory)
{
    var key = new HttpClientCacheKey(timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());

    Tuple<long, HttpClient> client;
    while (queue.TryDequeue(out client))
    {
        if (Stopwatch.GetTimestamp() - client.Item1 >= _maxIdleTimeInStopwatchTicks)
        {
            client.Item2.Dispose();
            continue;
        }
        client.Item2.CancelPendingRequests();
        client.Item2.DefaultRequestHeaders.Clear();
        return client.Item2;
    }

    return new HttpClient(handlerFactory())
    {
        Timeout = timeout
    };
}
public void ReleaseClient(HttpClient client, OperationCredentials credentials)
{
    var key = new HttpClientCacheKey(client.Timeout, credentials);
    var queue = cache.GetOrAdd(key, i => new ConcurrentQueue<Tuple<long, HttpClient>>());
    queue.Enqueue(Tuple.Create(Stopwatch.GetTimestamp(), client));
}

Can you see the bug?

I'll let you think about it for a while.

 

 

 

 

Done?

Okay, here we go!

The basic idea is that we'll keep a bunch of client connected, and free them if they are too old.

The issue is with the use of the concurrent queue. Consider the case of having a sudden spike in traffic, and we suddenly need 100 connections. This code will generate them, and then put them back in the pool.

We then go back to normal level, only handle 20 concurrent requests, but because we are using a queue, we'll actually cycle all the 100 requests we have through the queue all the time, keeping all of them opened.

Find the bugWhen you can't rely on your own identity

time to read 2 min | 234 words

We had a need to generate unique ids for connections in our db. We used the following code:

public class NotificationsClientConnection : IDisposable
{
    private static int _counter = 0;

    private readonly WebSocket _webSocket;
    private readonly DocumentDatabase _documentDatabase;
    private readonly DateTime _startedAt;


    public NotificationsClientConnection(WebSocket webSocket, DocumentDatabase documentDatabase)
    {
        _webSocket = webSocket;
        _documentDatabase = documentDatabase;
        _startedAt = SystemTime.UtcNow;
    }

    public int Id => Interlocked.Increment(ref _counter);

    public TimeSpan Age => SystemTime.UtcNow - _startedAt;
}

Do you see the bug? It is a subtle one.

When you are ready, peek below.

.

.

.

.

.

.

.

.

.

.

.

.

.

The issue is in the following line:

    public int Id => Interlocked.Increment(ref _counter);

The intent was to create a field that would be initialized on startup. Unfortunately, the usage of the lambda symbol "=>" instead of assignment "=" turned that into a computed property, which will return a new value on every call.

Production postmortemA null reference in our abstraction

time to read 4 min | 781 words

This little nugget has caused a database upgrade to fail. Consider the following code for a bit. We have CompoundKey, which has two versions, slow and fast. The idea is that we use this as keys into a cache, and there are two types because we can “query” the cache cheaply, but constructing the actual key for the cache is more expensive. Hence, the names.

public class CompoundKey
{
    public int HashKey;

}

public sealed class FastCompoundKey : CompoundKey
{
    public int Val;

}

public sealed class SlowCompoundKey : CompoundKey
{
    public int SlowVal;

}
public class CompoundKeyEqualityComparer : IEqualityComparer<CompoundKey>
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool Equals(CompoundKey x, CompoundKey y)
    {
        if (x == null || y == null)
            return false;

        SlowCompoundKey k;
        FastCompoundKey self;
        var key = x as FastCompoundKey;
        if (key != null)
        {
            self = key;
            k = y as SlowCompoundKey;
        }
        else
        {
            self = y as FastCompoundKey;
            k = x as SlowCompoundKey;
        }

        return self.Val == k.SlowVal;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int GetHashCode(CompoundKey obj)
    {
        return obj.HashKey;
    }
}

The problem was a null reference exception in the Equals method.  And I believe the term for how I felt was:

The problem is that it is obvious why a null reference exception can be thrown. If the values passed to this method are both SlowCompoundKey or both FastCompoundKey. But the reason the Equals method looks the way it does is that this is actually in a very performance sensitive portion of our code, and we have worked to make sure that it runs as speedily as possible. We considered the case where we would have both of them, but the code had a measured performance impact, and after checking the source of the dictionary class, we were certain that such comparisons were not possible.

We were wrong.

Boiling it all down, here is how we can reproduce this issue:

 var dic = new Dictionary<CompoundKey, object>(new CompoundKeyEqualityComparer());
 dic[new SlowCompoundKey { HashKey = 1 }] = null;
 dic[new SlowCompoundKey { HashKey = 1 }] = null;

What is going on here?

Well, it is simple. We add another value with the same hash code. That mean that the dictionary need to check if they are the same value, or just a hash collision. It does so by passing both the first key (slow) and the second key (slow) into the Equals method, and then hilarity enthused.

In retrospect, this is pretty much how it has to behave, and we should have considered that, but we were focused on looking at just the read side of the operation, and utterly forgot about how insert must work.

Luckily, this was an easy fix, although we still need to do regressive perf testing.

Production postmorterm: Houston, we have a problem

time to read 4 min | 660 words

As you read this post, you might want to also consider letting this play in the background. We had a UDP port leak in RavenDB. We squashed it like a bug, but somehow it kep repeating.

 

We found one cause of it (and fixed it), finally. That was after several rounds of looking at the code and fixing a few “this error condition can lead to the socket not being properly disposed”.

Finally, we pushed to our own internal systems, and monitored things, and saw that it was good. But the bloody bug kept repeating. Now, instead of manifesting as thousands of UDP ports, we had just a dozen or so, but they were (very) slowly increasing. And it drove us nuts. We had logging there, and we could see that we didn’t had the kind of problems that we had before. And everything looked good.

A full reproduction of the issue can be here, but the relevant piece of code is here:

Timer timer = new Timer(async state =>
{
    try
    {
        var addresses = await Dns.GetHostAddressesAsync("time.nasa.gov");
        var endPoint = new IPEndPoint(addresses[0], 123);

        using (var udpClient = new UdpClient())
        {
            udpClient.Connect(endPoint);
            udpClient.Client.ReceiveTimeout = 100;
            udpClient.Client.SendTimeout = 100;
            await udpClient.SendAsync(new byte[] {12, 32, 43}, 3);
            await udpClient.ReceiveAsync();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
        Environment.Exit(-1);
    }
});
timer.Change(500, 500);

As you can see, we are issuing a request to a time server, wrap the usage of the UDP socket in a using statement, make sure to have proper error handling, setup the proper timeouts, the works.

Our read code is actually awash with logging, detailed error handling, and we poured over that a crazy amount of time to figure out what was going on.

If you run this code, and watch the number of used TCP ports, you’ll see a very curious issue. It is always increasing. What is worse, there are no errors, nothing. It just goes into a black hole in the sky and doesn’t work.

In this case, I’ve explicitly created a malformed request, so it is expected that the remote server will not reply to me. That allows us to generate the proper scenario. In production, of course, we send the right value, and we typically get the right result, so we didn’t see this.

The error we had was the timeout values. The documentation quite clearly states that they apply to the syncronous method only, and since they don’t say a word about the async method, this does not apply to the async methods. Given how UDP works, that makes perfect sense. To support timeout on the async methods, the UdpClient would need to start a timer to do so.  However, given the API, it is very easy to see how we kept missing this.

The real issue is that when we make a request to a server, and for whatever reason, the UDP reply packet is dropped, we just hang in an async manner. That is, we have an async call that will never return. That call holds the UDP port open, and over time, that shows up as a leak. That is pretty horrible, but the good thing is that once we knew what the problem was, fixing it was trivial.

Find the bugThe case of the degrading system–Answer

time to read 3 min | 583 words

In my previous post I showed the following code, and asked what the bug was, and what the implications of that would be.

class Program
{
    private Timer nextcheck;
    public event EventHandler ServerSigFailed;
    static void Main(string[] args)
    {
        var program = new Program();
        if(program.ValidateServerSig() == false) 
            return;
        program.DoOtherStuff();
    }

    public bool ValidateServerSig()
    {
        nextcheck = new Timer(state => ValidateServerSig());

        var response = DoRequest("http://remote-srv/signature");
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        var result = Utils.CheckPublic KeySignatureMatches(response);
        if(result.Valid)
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        
        // setup next check
        nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
        return true;
    }
}

The answer is quite simple, look at the first line of the ValidateServerSig function. It setups a new timer that recursively call this function. So every 15 seconds, we are going to have a new timer to run this function.

And 15 seconds after that, we are going to have 4 timers, and 15 seconds after that…

But the interesting thing here that those are timers, so they get scheduled on the thread pool. So while the growth rate of the number of tasks is phenomenal, in practice, they don't get run all together.So after 20 minutes of this, it will only run about 800 times. Because the thread pool is growing slowly, we'll end up with a much slower growth rate, but eventually the thread pool queues are going to be filled with nothing but this task. And since we typically also process requests on the thread pool, eventually requests cannot be handled, because the threads are always busy running the validation code.

In the real codebase, the timing was 5 minutes, and this issue typically manifested itself only after several weeks of continuous running.

Find the bugThe case of the degrading system

time to read 2 min | 395 words

The following code contains a bug, can you spot it? 

class Program
{
    private Timer nextcheck;
    public event EventHandler ServerSigFailed;
    static void Main(string[] args)
    {
        var program = new Program();
        if(program.ValidateServerSig() == false) 
            return;
        program.DoOtherStuff();
    }

    public bool ValidateServerSig()
    {
        nextcheck = new Timer(state => ValidateServerSig());

        var response = DoRequest("http://remote-srv/signature");
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        var result = Utils.CheckPublic KeySignatureMatches(response);
        if(result.Valid)
        if(response.Failed)
        {
            var copy = ServerSigFailed;
            if(copy!=null) copy(this, EventArgs.Empty);
            return false;
        }
        
        // setup next check
        nextcheck.Change(TimeSpan.FromSeconds(15), TimeSpan.FromSeconds(15));
        return true;
    }
}

What are the implications of this bug?

Production postmortemThe case of the “it is slow on that machine (only)”

time to read 4 min | 661 words

A customer called with a major issue, on a customer machine, a particular operation took too long. In fact, it takes quite a bit more than too long. Instead of the few milliseconds or (at worst, seconds), the customer saw a value in the many minutes.

At first, we narrowed it down to an extreme load on the indexing engine. The customer had a small set of documents that were referenced using LoadDocument by large number of other documents. That meant that whenever those documents were updated, we would need to reindex all the referencing documents.

In their case, that was in the tens to hundreds of thousands of referencing documents in some cases. So an update to a single document could force re-indexing of quarter million documents. Except… that this wasn’t actually the case. What drove everyone crazy was that here was a reasonable, truthful and correct answer. And on one machine the exact same thing took 20 – 30 seconds, and on the customer machine the process took 20 minutes.

The customer also assured us that those documents that everyone referenced are very rarely, if at all, touched or modified, so that shouldn’t be the issue.

The machine with the problem was significantly more powerful from the one without the problem. This issue also started to occur recently, out of the blue. Tracing the resource utilization in the system showed moderate CPU usage, low I/O and memory consumption and nothing much really going on. We looked at the debug logs, and we couldn’t really figure out what it was doing. There were very large gaps in the log where nothing seems to be happening. % Time in GC was low, so that ruled out a long GC pause that would explain the gap in the logs.

This is in version 2.5, which predates all of our introspection efforts, so figuring out what was going on was pretty hard. I’ll have another post talking about that in this context later.

Eventually we gained access to the machine and was able to reproduce this, and take a few mini dumps along the way. Looking at the stack traces, we found this:

image

And now it all became clear. Suggestions in RavenDB is a really cool feature, which allows you to ask RavenDB to figure out what the user actually meant to ask. It is also extremely CPU intensive during indexing, which is really visible when you try to pump large number of documents through it. And it is a single threaded process.

Except that the customer wasn’t using Suggestions in their application…

So, what happened, in order to hit this issue the following things all needed to happen:

  • Suggestions to be enabled on the relevant index/indexes. Check while the customer wasn’t currently using it, they were using it in the past, and unfortunately that stuck.
  • A very large number of documents need to be indexed. Check – that happened when they updated one of the commonly referenced documents.
  • A commonly referenced document needed to be modified. Check – that happens when they started work for next year, which modified those rarely touched documents.

Now, why didn’t it manifest itself on the other machines? Simple, on those machines, they used the current version of the application, which didn’t use suggestions. On the machines that were problematic, they upgraded to the new version, so even though they weren’t using suggestions, that was still in affect, and still running.

According to a cursory check, those suggestions has been running there for over 6 months, and no one noticed, because you needed the confluence of all three aspects to actually get this issue.

Removing the suggestions once we know they were there was very easy, and the problem was resolved.

Configuration is user input too

time to read 3 min | 587 words

Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.

The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).

Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).

The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.

In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.

We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.

In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.

So now the problem is what to do?

On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations Smile, I just need it to work“, referring to a service that his company wrote.

So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.

FUTURE POSTS

  1. How to waste CPU and kill your disk by scaling 100 million inefficiently - 5 hours from now
  2. RavenDB Conference 2016–Slides - about one day from now

There are posts all the way to Jun 01, 2016

RECENT SERIES

  1. The design of RavenDB 4.0 (14):
    26 May 2016 - The client side
  2. RavenDB 3.5 whirl wind tour (14):
    25 May 2016 - Got anything to declare, ya smuggler?
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats