Ayende @ Rahien

It's a girl

The bug that ruined my weekend

This is bloody strange. I have a test failing in our unit tests, which isn’t an all too uncommon occurrence after a big work. The only problem is that this test shouldn’t fail, no one touched this part.

For reference, here is the commit where this is failing. You can reproduce this by running the Raven.Tryouts console project.

Note that it has to be done in Release mode. When that happens, we consistently get the following error:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at Raven.Client.Connection.MultiGetOperation.<TryResolveConflictOrCreateConcurrencyException>d__b.MoveNext() in c:\Work\ravendb\Raven.Client.Lightweight\Connection\MultiGetOperation.cs:line 156
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

Here is the problem with this stack trace:

image

Now, this only happens in release mode, but it happens there consistently. Now, I’ve verified that this isn’t an issue of me running old version of the code. So this isn’t possible. There is no concurrency going on, all the data this method is touching is only touched by this thread.

What is more, the exception is not thrown from inside the foreach loop in line 139. I’ve verified that by putting a try catch around the inside of the loop and still getting the NRE when thrown outside it. In fact, I have tried to figure it out in pretty much any way I can. Attaching a debugger make the problem disappear, as does disabling optimization, or anything like that.

In fact, at this point I’m going to throw up my hands in disgust, because this is not something that I can figure out. Here is why, this is my “fix” for this issue. I replaced the following failing code:

image

With the following passing code:

image

And yes, this should make zero difference to the actual behavior, but it does. I’m suspecting a JIT issue.

Tags:

Published at

Originally posted at

Comments (21)

Solve this bug!

This took some head scratching before I figured it out.

image

And here is the relevant code:

image

Tags:

Published at

Originally posted at

Comments (10)

Fixing a production issue

So we had a problem in our production environment. It showed up like this.

image

The first thing that I did was log into our production server, and look at the logs for errors. This is part of the new operational features that we have, and it was a great time to try it under real world conditions:

image

This gave me a subscription to the log, which gave me the exact error in question:

image

From there, I went to look at the exact line of code causing the problem:

image

Can you see the issue?

We create a dictionary that was case sensitive, and they used that to create a dictionary that was case insensitive. The actual fix was adding the ignore case comparer to the group by clause, and then pushing to production.

Tags:

Published at

Originally posted at

Comments (2)

Google Domains won’t take my money

This is frustrating. I’m trying to use the new Google Domains (if only because I’m sick of GoDaddy’s aggressive upsale approach).

This is what happened when I tried it.

image

That sucks, but I’m actually a business user for Google Apps, so I called their support. I’m currently on the 32th minute of the call, and so far they explained to me numerous times that this is not possible to do, and are still unable to tell me why they are refusing to take my card.

I can understand, beta service and all of that, but using Wallet isn’t beta, and not being able to tell me what is wrong is not cool.

Tags:

Published at

Originally posted at

Comments (8)

All I wanted was to set my shortcuts

I’m trying to se the ReSharper keyboard shortcuts, and I got this:

image

So I waited, and waited, and waited.

Then I downloaded process monitor and tried to see what the hell is going on in here.

All I could figure out was that it was reading some persistent solution cache. So I closed Visual Studio and opened it without opening a solution. I then tried the same thing again, and it completed in a mere few tens of seconds.

Tags:

Published at

Originally posted at

Comments (3)

Mixing Sync & Async calls

Take a look at the following code:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();   
for (int i = 0; i < 500; i++)
{
    var t = Task.Run(() =>
    {
        var webRequest = WebRequest.Create(new Uri(“http://google.com”));
webRequest.GetResponse().Close();
Interlocked.Increment(ref count); }); tasks.Add(t); } var whenAll = Task.WhenAll(tasks.ToArray()); while (whenAll.IsCompleted == false && whenAll.IsFaulted == false) { Thread.Sleep(1000); Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x=> x.IsCompleted == false)); } Console.WriteLine(sp.Elapsed);

As you can see, it is a pretty silly example of making 500 queries to Google. There is some stuff here about reporting & managing, but the key points is that we start 500 tasks to do some I/O. How long do you think that this is going to run?

On my machine, this code completes in roughly 22 seconds. Now, WebRequest is old school, and we want to use HttpClient, because it has a much better interface, so we wrote:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();
for (int i = 0; i < 500; i++)
{
    var t = Task.Run(() =>
    {
        var client = new HttpClient();
        client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com"))).Wait();
 
        Interlocked.Increment(ref count);
    });
    tasks.Add(t);
}
var whenAll = Task.WhenAll(tasks.ToArray());
while (whenAll.IsCompleted == false && whenAll.IsFaulted == false)
{
    Thread.Sleep(1000);
    Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x => x.IsCompleted == false));
}
Console.WriteLine(sp.Elapsed); 

I’ve been running this code on my machine for the past 7 minutes, and this code hasn’t yet issue a single request.

It took a while to figure it out, but the problem is in how this is structured. We are creating a lot of tasks, more than the available threads in the thread pool. Then we make an async call, and block on that. That means that we block the thread pool thread. Which means that to process the result of this call, we’ll need to use another thread pool thread. However, we have scheduled more tasks than we have threads for. So there is no thread available to handle the reply, so all the threads are stuck waiting for reply that there is no thread to handle to unstick them.

The thread pool notices that, and will decide to allocate more threads, but they are also taken up by the already existing tasks that will immediately block.

Now, surprisingly, eventually the thread pool will allocate enough threads (although it will take it several hours to do so, probably) to start handling the requests, and the issue is resolved. Expect that this ends up basically crippling the application while this is happening.

Obviously, the solution is to not wait on async calls inside a task like that, indeed, we can use the following code quite easily:

var t = Task.Run(async () =>
{
    var client = new HttpClient();
    await client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com")));

    Interlocked.Increment(ref count);
});

And this shows performance comparable to the WebRequest method.

However, that isn’t very helpful when we need to expose a synchronous API. In our case, in RavenDB we have the IDoctumentSession, which is a simple synchronous API. We want to use the a common codebase for sync and async operations. But we need to expose the same interface, even if we changed things. To make things worse, we have no control on how the http client is scheduling the async operations, and it has no sync operations.

That means that we are left with the choice of writing everything twice, once for synchronous operations and once for async ops or just living with this issue.

A work around we managed to find is to run the root tasks (which we do control) on our own task scheduler, so they won’t compete with the I/O operations. But that is hardly something that I am thrilled about.

Tags:

Published at

Originally posted at

Comments (16)

My poor little blameless Voron

I’m currently working on a project using Voron (although only incidentally), and I was horrified to get the FatalExecutionEngineException that was discussed previously. Naturally, I assumed that this is something that I did wrong in Voron.

image

After a lot of work, I managed to narrow it down to… not Voron. To be rather more exact, it is Voron that is causing it, but it isn’t Voron’s fault.

The actual issue is the fact that I’ve accidently using Json.NET to serialize a Stream that I got from Voron. And Voron is giving us an UnmanagedMemoryStream. I kept thinking that I was doing something wrong with releasing memory, but as it turns out, here is a very small repro:

 unsafe static void Main(string[] args)
 {
     JsonConvert.SerializeObject(new Foo { Ptr = (byte*)0 });

 }

 public unsafe class Foo
 {
     public byte* Ptr { get; set; }
 }

And that is enough to get the above mentioned error.

What actually happens is that Json.NET is using dynamic IL generation to optimize accessing properties. And it just doesn’t know how to handle pointer properties. What it ended up doing is to generate invalid IL, which resulted in a crash when we tried to actually use it.

Nothing to do with Voron at all, just a class that has a pointer property that was attempting serialization.  Nasty bug, and very annoying to try to figure out.

Tags:

Published at

Originally posted at

Today’s work is…

It isn’t enough to have just an ExecutionEngineException, I had to have this:

image

Tags:

Published at

Originally posted at

Comments (3)

Stack overflow… yeah, right!

My app crashed with stack overflow exception. That would be fine, except:

image

 

To be fair, that was under low memory conditions, since I was leaking virtual memory. So that make sense, still...

Tags:

Published at

Originally posted at

Comments (1)

Performance counters sucks

A while ago we added monitoring capabilities to RavenDB via performance counters. The intent was to give our users the ability to easily see exactly what is going on with RavenDB.

The actual usage, however, was a lot more problematic.

  • Performance counters API can just hang, effectively killing us (since we try to initialize it as part of setup db routine).
  • They require specific system permissions, and can fail without them.
  • They get corrupted, for mysterious reasons, and then you need to reset them all.
  • Even after you created them, they can still die on you for no apparent reason.

I would have been willing to assume that we are doing something really stupid. Except that SignalR had similar issues.

What is worse, it appears that using the performance counters needs to iterate the list of printers on the machine. 

This is really ridiculous. And it is to the point in which I am willing to just give it up entirely, if only I had something that I could use to replace it with.

Currently I threw a lot of try /catch and a background thread to hide that, but it is ugly and brittle.

Tags:

Published at

Originally posted at

Comments (17)

Is select() broken? Memory mapped files with unbufferred writes == race condition?

Let me start this post by stating that I am not even sure if what I am trying to do is legal here. But from reading the docs, it does appear to be a valid use of the API, and it does work, most of the time.

The full code can be found here: https://gist.github.com/ayende/7495987

The gist of it is that I am trying to do two things:

  • Write to a file opened with FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING.
  • Read from this file using a memory map.
  • Occasionally, I get into situations where after I wrote to the file, I am not reading what I wrote.

I have a repro, and we reproduced this on multiple machines. Both Windows 7 and Windows 8.

Here is the relevant code (the full code is in the link), explanation on it below:

   1: const uint nNumberOfBytesToWrite = 4096*3;
   2: var buffer = (byte*)(VirtualAlloc(IntPtr.Zero, new UIntPtr(nNumberOfBytesToWrite), AllocationType.COMMIT, MemoryProtection.READWRITE)
   3:             .ToPointer());
   4:  
   5: for (int i = 0; i < nNumberOfBytesToWrite; i++)
   6: {
   7:     *(buffer + i) = 137;
   8: }
   9:  
  10: var g = Guid.NewGuid().ToString();
  11:  
  12: var safeHandle = CreateFile(g,
  13:     NativeFileAccess.GenericRead | NativeFileAccess.GenericWrite,
  14:     NativeFileShare.Read, IntPtr.Zero,
  15:     NativeFileCreationDisposition.OpenAlways,
  16:     NativeFileAttributes.Write_Through | NativeFileAttributes.NoBuffering | NativeFileAttributes.DeleteOnClose,
  17:     IntPtr.Zero);
  18:  
  19: var fileStream = new FileStream(safeHandle, FileAccess.ReadWrite);
  20: fileStream.SetLength(1024 * 1024 * 1024); // 1gb
  21:  
  22: if (safeHandle.IsInvalid)
  23: {
  24:     throw new Win32Exception();
  25: }
  26:  
  27: FileStream mms = fileStream;
  28: //mms = new FileStream(g, FileMode.Open, FileAccess.Read, FileShare.ReadWrite | FileShare.Delete);
  29: var mmf = MemoryMappedFile.CreateFromFile(mms, Guid.NewGuid().ToString(), fileStream.Length,
  30:     MemoryMappedFileAccess.Read, null, HandleInheritability.None, true);
  31:  
  32: MemoryMappedViewAccessor accessor = mmf.CreateViewAccessor(0, fileStream.Length, MemoryMappedFileAccess.Read);
  33: byte* ptr = null;
  34: accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr);
  35:  
  36: Task.Factory.StartNew(() =>
  37: {
  38:     long lastPos = 0;
  39:     while (true)
  40:     {
  41:         int count = 0;
  42:         while (true)
  43:         {
  44:             if (*(ptr + lastPos) != 137)
  45:             {
  46:                 break;
  47:             }
  48:             lastPos += 4096;
  49:             count ++;
  50:         }
  51:         Console.WriteLine();
  52:         Console.WriteLine("Verified {0} MB", count * 4 / 1024);
  53:         Console.WriteLine();
  54:         Thread.Sleep(2000);
  55:     }
  56: });
  57:  
  58: for (int i = 0; i < 1024*64; i++)
  59: {
  60:     var pos = i*nNumberOfBytesToWrite;
  61:     if (i%100 == 0)
  62:         Console.Write("\r{0,10:#,#} kb", pos/1024);
  63:     var nativeOverlapped = new NativeOverlapped
  64:     {
  65:         OffsetHigh = 0,
  66:         OffsetLow = (int) pos
  67:     };
  68:  
  69:     uint written;
  70:     if (WriteFile(safeHandle, buffer, nNumberOfBytesToWrite, out written, &nativeOverlapped) == false)
  71:         throw new Win32Exception();
  72:  
  73:     for (int j = 0; j < 3; j++)
  74:     {
  75:         if (*(ptr + pos) != 137)
  76:         {
  77:             throw new Exception("WTF?!");
  78:         }
  79:         pos += 4096;
  80:     }
  81: }

This code is doing the following:

  • We setup a file handle using NoBuffering | Write_Through, and we also map the file using memory map.
  • We write 3 pages (12Kb) at a time to the file.
  • After the write, we are using memory map to verify that we actually wrote what we wanted to the file.
  • _At the same time_ we are reading from the same memory in another thread.
  • Occasionally, we get an error where the data we just wrote to the file cannot be read back.

Now, here is what I think is actually happening:

  • When we do an unbuffered write, Windows has to mark the relevant pages as invalid.
  • I _think_ that it does so before it actually perform the write.
  • If you have another thread that access that particular range of memory at the same time, it can load the _previously_ written data.
  • The WriteFile actually perform the write, but the pages that map to that portion of the file have already been marked as loaded.
  • At that point, when we use the memory mapped pointer to access the data, we get the data that was there before the write.

As I said, the code above can reproduce this issue (you might have to run it multiple times).

I am not sure if this is something that is valid issue or just me misusing the code. The docs are pretty clear about using regular i/o & memory mapped i/o. The OS is responsible to keeping them coherent with respect to one another. However, that is certainly not the case here.

It might be that I am using a single handle for both, and Windows does less checking when that happens? For what it is worth, I have also tried it using different handles, and I don’t see the problem in the code above, but I have a more complex scenario where I do see the same issue.

Of course, FILE_FLAG_OVERLAPPED is not specified, so what I would actually expect here is serialization of the I/O, according to the docs. But mostly I need a sanity check to tell me if I am crazy.

Tags:

Published at

Originally posted at

Comments (18)

Spot the bug: Unmanaged memory traps

Sometimes I forgot how good life is in the managed lane.  Then I do some unmanaged work and get a good reality check.

Let us look at the following structures:

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct PageHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public ushort Lower;

        [FieldOffset(10)] public ushort Upper;

        [FieldOffset(12)] public int OverflowSize;

        [FieldOffset(16)] public int ItemCount;
    }


    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct FileHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public LogHeader Active;

        [FieldOffset(44)] public LogHeader Backup;

        [FieldOffset(80)] public TreeRootHeader Root;

        [FieldOffset(142)] public TreeRootHeader FreeSpace;
    }

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct TreeRootHeader
    {
        [FieldOffset(0)] public long RootPageNumber;
        [FieldOffset(8)] public long BranchPages;
        [FieldOffset(16)] public long LeafPages;
        [FieldOffset(32)] public long OverflowPages;
        [FieldOffset(40)] public long PageCount;
        [FieldOffset(48)] public long EntriesCount;
        [FieldOffset(56)] public int Depth;
        [FieldOffset(60)] public TreeFlags Flags;
    }

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct LogHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public long LastLog;

        [FieldOffset(16)] public long LastLogPage;

        [FieldOffset(24)] public int ItemCount;

        [FieldOffset(28)] public long Options;
    }

And now we have the following code:

   1:  private static unsafe void Main()
   2:  {
   3:      IntPtr pagePtr = Marshal.AllocHGlobal(4096);
   4:   
   5:      var pageHeader = (PageHeader*) pagePtr.ToPointer();
   6:      pageHeader->ItemCount = 2;
   7:      pageHeader->Marker = 0x128314543423;
   8:      pageHeader->OverflowSize = 32;
   9:   
  10:      FileHeader* fileHeader = (FileHeader*) pageHeader + sizeof (PageHeader);
  11:   
  12:      fileHeader->Root.BranchPages = 0;
  13:   
  14:   
  15:      Marshal.FreeHGlobal(pagePtr);
  16:  }

The fun part about this code is that it would silently corrupt the state of the process.

Here is what happens when you run it under the debugger:

image

Can you figure out why?

Tags:

Published at

Originally posted at

Comments (6)

Sites outage

We have an outage that appears to have taken roughly 12 hours.

The reason it took so long to fix, it was after business hours, and while we have production support for our clients, we never hooked up our own websites to our own system. A typical case of the barefoot shoemaker.

The reason for the outage? Also pretty typical:

image

The reason for that? We had a remote backup process that put some temp files and didn’t clean them up properly. The growth rate was about 3-6 MB a day, so no one really noticed.

The fix:

image

All is working now, I sorry for the delay in fixing this. We’ll be having some discussion here to see how we can avoid repeat issues like that.

Tags:

Published at

Originally posted at

Comments (10)

Diagnosing RavenDB problem in production

We have run into a problem in our production system. Luckily, this is a pretty obvious case of misdirection.

Take a look at the stack trace that we have discovered:

image

The interesting bit is that this is an “impossible” error. In fact, this is the first time that we have actually seen this error ever.

But looking at the stack trace tells us pretty much everything. The error happens in Dispose, but that one is called from the constructor. Because we are using native code, we need to make sure that an exception in the ctor will properly dispose all unmanaged resources.

The code looks like this:

image

And here we can see the probable cause for error. We try to open a transaction, then an exception happens, and the Dispose is called, but it isn’t ready to handle this scenario, so throws.

The original exception is masked, and you have a hard debug scenario.

Published at

Originally posted at

The saga of the race condition re-fix

This is a story about a bug, RavenDB-1280, to be exact. It is really bad when we start with a race condition bug, and it only gets worse from there.

The story begins with the following index:

public class EmailIndex : AbstractIndexCreationTask<EmailDocument, EmailIndexDoc>
{
    public EmailIndex()
    {
        Map =
            emails => from email in emails
                    let text = LoadDocument<EmailText>(email.Id + "/text")                 
                    select new
                            {
                                email.To,
                                email.From,
                                email.Subject,
                                Body = text == null ? null : text.Body
                            };
    }
}

So far, this  is a pretty standard index using LoadDocument. The LoadDocument feature allows us to also index data from related documents, and the database will ensure that when the loaded document is changed, the requesting document will also be re-indexed.

What does this mean? It means that when we change the text of an email, we will also be re-indexing that email’s document. Internally, we store it in something that looks like this (using relation model because it is probably easy to follow for most of you):

CREATE TABLE DocumentReferences
{
   Source,
   Reference
}

So in the case above, we will have (Source: emails/1 and Reference: emails/1/text).

Whenever we modify a document, we check the stored references for matches. If we were using a relational database, the code would be something like:

foreach ( var src in (SELECT Source FROM DocumentReferences WHERE Reference = @ref))
{
   TouchDocument(src);
}

So far, so good. This ensures that we are always keep the index up to date. Obviously, this also means that the indexing process needs to update the references as well. Which brings us to the problematic part:

Parallel.For(0, iterations, i =>
{
    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailDocument {Id = "Emails/"+ i,To = "root@localhost", From = "nobody@localhost", Subject = "Subject" + i});
        session.SaveChanges();
    }

    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailText { Id = "Emails/" + i + "/text", Body = "MessageBody" + i });
        session.SaveChanges();
    }
});

This code will result in an interesting edge case. Because we are using two different sessions (and thus two different transactions), it is possible for the index to pick up the emails/1 document and start indexing it, but not pick up the emails/1/text document (it hasn’t been saved yet).

However, during the save of emails/1/text document, there wouldn’t be anything in the references storage, so we won’t know that we need to re-index emails/1. The result is that we violated our promise of re-indexing if the document changed.

As I said, just setting up the problem require parallel thinking and a severe case of headache. Luckily, we had a great bug report from Barry Hagan, which included an occasionally failing test. (Even just writing “occasionally failing” causes me to throw up a little bit).

After we identified the problem, we tried to solve it by holding up a list of modified documents in memory that were also required (and missing) during indexing using LoadDocument. Don’t bother to follow up on this statement, it is complex, and it is hard, and we did it. And it worked, except that it didn’t. That just moved the race condition from the entire process to the edges of the process. We have had two guys sitting on this for a couple of days and in danger of hair tearing, with no luck narrowing it down.

Eventually I sat down and tried to figure out how to deal with it. I was able to build on their work, and narrow down exactly where we had the race condition now. And just thinking about trying to solve it was way too hard. It would require multiple locks and pretty strange behavior overall. I wasn’t happy with the implications, and that code already created quite a lot of complications as it was.

So I started by reverting all of that code and had a clean slate to work with. Then I sat down and thought about it, and finally figured out a better way to do it. The problem was only relevant when we had a missing reference (an existing reference would properly generate the re-indexing under all conditions). So I started with that, I gathered up all of the missing references, and stored them in a db task.  A db task is a RavenDB notion of a transactionally safe way of registering things to run in another transaction. So after the indexing transaction was done, we would then go an execute that task.

The idea here is that doing it this way prevent us from having to deal with any issues with regards to “what is the server crashes midway”, etc. Anyway, it also means that this task is running in a separate transaction and after the indexing transaction is done. The only thing that this task is doing is just forcing re-indexing of all the documents that tried to call LoadDocument but got null as a result. Because we are now in a separate transaction, there are only two options:

  • The related document is still not there, in which case were are merely doing some extra work.
  • The related document is there, and will be indexed.

The first case deserve some more attention, even if the document that was missing comes while we are re-indexing, we don’t care, we already setup the references, and committed it, so it would force yet another re-indexing, and in the end, everyone will be happy.

And now, I have written this blog post while I was running a stress test on this feature, but it took me long enough that I am sure it works properly. So I’ll call it a day and go do something more fun, maybe 3.0 work.

Tags:

Published at

Originally posted at

Comments (2)

Challenge: Spot the bug

Speaking of unfair interview questions, this would be a pretty evil one.

image

Can you see the bug? And how would you fix it?

Tags:

Published at

Originally posted at

Comments (42)

Diagnosing a production issue on our servers

One of the steps that we take before releasing a stable is to push the latest build into our own production servers, and see what goes on. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and find the sort of bugs that are very hard to run in production.

In this case, after several hours (8 – 12, I am guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare & Commit phases of the operation.

Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened.  Sad smileSad smileSad smile

The good thing about having this happen on our production servers is that I have full access there. Of course, it is production, so outright debugging it out, but taking a dump and transferring that to my machine was easy enough.

Now, open it with WinDBG, run “.loadby sos clr” and start investigating.

First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.

Next, it was time to look a bit at the actual memory. I run: !DumpHeap -type Session

image

My reaction is Huh!!! There is absolutely zero justification for that.

Now, the only question is why. So I decided to look at the class that is holding the transaction state, assuming that this is probably what is holding into all those sessions. I run: !DumpHeap -type EsentTransactionContext

image

And that tells me quite a lot. There appear to be now a total of 317 in flight DTC transactions. Considering that I know what our usage is, that is a huge number. And it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat

image

In other words, we have 8 loaded dbs, each of them holding their in flight transactional state. And we have 317 opened transactions and 35 thousands sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single digits concurrent DTC transactions at any one time. So somehow we are leaking transactions & sessions. But I am still very unhappy with just “we are leaking sessions”. That is something that I knew before we start debugging everything.

I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file, and then analyze that in a much nicer manner. Doing so, I quickly found this out:

image

Huh?!

Sessions are finalizable, sure, but I am very careful about making sure to dispose of them. Especially after the previous code change. There should be just 317 undisposed sessions. And having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there. And the numbers don’t match us, either. We are missing about 7K items from the WinDBG numbers.

Okay, next, I pulled ClrMD and wrote the following:

   1: var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");
   2: var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();
   3: var rt = dt.CreateRuntime(moduleInfo);
   4:  
   5: var clrHeap = rt.GetHeap();
   6:  
   7: var finalized = new HashSet<ulong>();
   8:  
   9: int cnt = 0;
  10: foreach (var ptr in rt.EnumerateFinalizerQueue())
  11: {
  12:     var type = clrHeap.GetObjectType(ptr);
  13:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  14:     {
  15:         finalized.Add(ptr);
  16:     }
  17: }
  18: Console.WriteLine(finalized.Count);
  19:  
  20: var live = new HashSet<ulong>();
  21: foreach (var ptr in clrHeap.EnumerateObjects())
  22: {
  23:     var type = clrHeap.GetObjectType(ptr);
  24:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  25:     {
  26:         if (finalized.Contains(ptr) == false)
  27:             live.Add(ptr);
  28:     }
  29: }
  30: Console.WriteLine(live.Count);

This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?

I did a code review over everything once again, and I think that I got it. The culprit is this guy:

image

Where createContext is defined as:

image

Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method. But because those create values that have to be disposed…  Now, in the normal course of things, worst case scenario is that we would have them in the finalizer queue and they would be disposed in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.

At least ,this is my current theory. I changed the code to be like this:

image

So if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there aren’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.

Tags:

Published at

Originally posted at

Comments (21)

Reasons to hate C/C++ work

The following code won’t link. For fun, the set_mapsize and set_flags are both defined in the same file. Both for .h & .c, and the set_mapsize works just fine.

image

Tags:

Published at

Originally posted at

Comments (19)

Optimized, but way out of line

Yesterday I showed the following profiler results.

image

The first place someone would look at optimizing is the CaseInsensitiveComparator.Compare call, right? That is clearly taking a really long time.

In fact, we re-wrote that method to use pointer magic in order to get it to work faster. But that isn’t the issue. Look at the numbers. Compare is called 68 million times. That means that under the profiler, it is still completing an operation in 0.000073 ms. I am not even sure what time unit that is.

However, we can see that the major caller for Compare was FindGreaterOrEqual, right? And that is called only 11 thousand times. What does this tells you?

That tells me we have a Schlemiel the Painter's algorithm. So I double checked, and SkipList insert perf should be O(logN) on average. But what we were seeing is O(N) costs .Indeed, it appeared that we had a bug in which our skip list implementation would never do skips, essentially turning it into a sorted linked list, with insert perf of O(N). Once we fixed that bug… well, things got better, a lot better.

image

As you can see, the cost of this went down really hard. And now we can focus on other stuff.

Tags:

Published at

Originally posted at

Comments (6)

Optimize this, dude!

So, we have a perf problem in the managed leveldb implementation. I pulled out dotTrace and looked at the numbers:

image

Do you see where the problem is?

Tags:

Published at

Originally posted at

Comments (29)

Spot the bug…

image

Do you see it? It took me ages to find it.

Tags:

Published at

Originally posted at

Comments (11)

TPL and the case of the !@#(*@! hung process

So, here I am writing some really fun code, when I found out that I am running into dead locks in the code. I activate emergency protocols and went into deep debugging mode.

After being really through in figuring out several possible causes, I was still left with what is effectively a WTF @!(*!@ DAMN !(@*#!@* YOU !@*!@( outburst and a sudden longing for something to repeatedly hit.

Eventually, however, I figure out what was going on.

I have the following method: Aggregator.AggregateAsync(), inside which we have a call to the PulseAll method. That method will then go and execute the following code:

   1: public void PulseAll()
   2: {
   3:     Interlocked.Increment(ref state);
   4:     TaskCompletionSource<object> result;
   5:     while (waiters.TryDequeue(out result))
   6:     {
   7:         result.SetResult(null);
   8:     }
   9: }

After that, I return from the method. In another piece of the code (Aggregator.Dispose) I am waiting for the task that is running the AggregateAsync method to complete.

Nothing worked! It took me a while before I figured out that I wanted to check the stack, where I found this:

image

Basically, I had a dead lock because when I called SetResult on the completion source (which freed the Dispose code to run), I actually switched over to that task and allowed it to run. Still in the same thread, but in a different task, I run through the rest of the code and eventually got to the Aggregator.Dispose(). Now, I could only get to it if it the PulseAll() method was called. But, because we are on the same thread, that task hasn’t been completed yet!

In the end, I “solved” that by introducing a DisposeAsync() method, which allowed us to yield the thread, and then the AggregateAsync task was completed, and then we could move on.

But I am really not very happy about this. Any ideas about proper way to handle async & IDisposable?

Tags:

Published at

Originally posted at

Comments (11)

Deadlocking with the TPL, how to

As I mentioned, I run into a very nasty issue with the TPL. I am not sure if it is me doing things wrong, or an actual issue.

Let us look at the code, shall we?

We start with a very simple code:

   1: public class AsyncEvent
   2: {
   3:     private volatile TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
   4:     
   5:     public Task WaitAsync()
   6:     {
   7:         return tcs.Task;
   8:     }
   9:  
  10:     public void PulseAll()
  11:     {
  12:         var taskCompletionSource = tcs;
  13:         tcs = new TaskCompletionSource<object>();
  14:         taskCompletionSource.SetResult(null);
  15:     }
  16: }

This is effectively an auto reset event. All the waiters will be released when the PulseAll it called. Then we have this runner, which just execute work:

   1: public class Runner : IDisposable
   2: {
   3:     private readonly ConcurrentQueue<TaskCompletionSource<object>> items =
   4:         new ConcurrentQueue<TaskCompletionSource<object>>();
   5:     private readonly Task<Task> _bg;
   6:     private readonly AsyncEvent _event = new AsyncEvent();
   7:     private volatile bool _done;
   8:  
   9:     public Runner()
  10:     {
  11:         _bg = Task.Factory.StartNew(() => Background());
  12:     }
  13:  
  14:     private async Task Background()
  15:     {
  16:         while (_done == false)
  17:         {
  18:             TaskCompletionSource<object> result;
  19:             if (items.TryDequeue(out result) == false)
  20:             {
  21:                 await _event.WaitAsync();
  22:                 continue;
  23:             }
  24:  
  25:             //work here, note that we do NOT use await!
  26:  
  27:             result.SetResult(null);
  28:         }
  29:     }
  30:  
  31:     public Task AddWork()
  32:     {
  33:         var tcs = new TaskCompletionSource<object>();
  34:         items.Enqueue(tcs);
  35:  
  36:         _event.PulseAll();
  37:  
  38:         return tcs.Task;
  39:     }
  40:  
  41:     public void Dispose()
  42:     {
  43:         _done = true;
  44:         _event.PulseAll();
  45:         _bg.Wait();
  46:     }
  47: }

And finally, the code that causes the problem:

   1: public static async Task Run()
   2: {
   3:     using (var runner = new Runner())
   4:     {
   5:         await runner.AddWork();
   6:     }
   7: }

So far, it is all pretty innocent, I think you would agree. But this cause hangs with a dead lock. Here is why:

image

Because tasks can share threads, we are in the Background task thread, and we are trying to wait on that background task completion.

Result, deadlock.

If I add:

   1: await Task.Yield();

Because that forces this method to be completed in another thread, but that looks more like something that you add after you discover the bug, to be honest.

Tags:

Published at

Originally posted at

Comments (21)

Challenge: The problem of locking down tasks…

The following code has very subtle bug:

   1: public class AsyncQueue
   2: {
   3:     private readonly Queue<int> items = new Queue<int>();
   4:     private volatile LinkedList<TaskCompletionSource<object>> waiters = new LinkedList<TaskCompletionSource<object>>();
   5:  
   6:     public void Enqueue(int i)
   7:     {
   8:         lock (items)
   9:         {
  10:             items.Enqueue(i);
  11:             while (waiters.First != null)
  12:             {
  13:                 waiters.First.Value.TrySetResult(null);
  14:                 waiters.RemoveFirst();
  15:             }
  16:         }
  17:     }
  18:  
  19:     public async Task<IEnumerable<int>> DrainAsync()
  20:     {
  21:         while (true)
  22:         {
  23:             TaskCompletionSource<object> taskCompletionSource;
  24:             lock (items)
  25:             {
  26:                 if (items.Count > 0)
  27:                 {
  28:                     return YieldAllItems();
  29:                 }
  30:                 taskCompletionSource = new TaskCompletionSource<object>();
  31:                 waiters.AddLast(taskCompletionSource);
  32:             }
  33:             await taskCompletionSource.Task;
  34:         }
  35:     }
  36:  
  37:     private IEnumerable<int> YieldAllItems()
  38:     {
  39:         while (items.Count > 0)
  40:         {
  41:             yield return items.Dequeue();
  42:         }
  43:  
  44:     }
  45: }

I’ll even give you a hint, try to run the following client code:

   1: for (int i = 0; i < 1000 * 1000; i++)
   2: {
   3:     q.Enqueue(i);
   4:     if (i%100 == 0)
   5:     {
   6:         Task.Factory.StartNew(async () =>
   7:             {
   8:                 foreach (var result in await q.DrainAsync())
   9:                 {
  10:                     Console.WriteLine(result);
  11:                 }
  12:             });
  13:     }
  14:  
  15: }
Can you figure out what the problem is?

Published at

Originally posted at

Comments (7)