Ayende @ Rahien

Refunds available at head office

All I wanted was to set my shortcuts

I’m trying to se the ReSharper keyboard shortcuts, and I got this:

image

So I waited, and waited, and waited.

Then I downloaded process monitor and tried to see what the hell is going on in here.

All I could figure out was that it was reading some persistent solution cache. So I closed Visual Studio and opened it without opening a solution. I then tried the same thing again, and it completed in a mere few tens of seconds.

Tags:

Published at

Originally posted at

Comments (3)

Mixing Sync & Async calls

Take a look at the following code:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();   
for (int i = 0; i < 500; i++)
{
    var t = Task.Run(() =>
    {
        var webRequest = WebRequest.Create(new Uri(“http://google.com”));
webRequest.GetResponse().Close();
Interlocked.Increment(ref count); }); tasks.Add(t); } var whenAll = Task.WhenAll(tasks.ToArray()); while (whenAll.IsCompleted == false && whenAll.IsFaulted == false) { Thread.Sleep(1000); Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x=> x.IsCompleted == false)); } Console.WriteLine(sp.Elapsed);

As you can see, it is a pretty silly example of making 500 queries to Google. There is some stuff here about reporting & managing, but the key points is that we start 500 tasks to do some I/O. How long do you think that this is going to run?

On my machine, this code completes in roughly 22 seconds. Now, WebRequest is old school, and we want to use HttpClient, because it has a much better interface, so we wrote:

int count = 0;
var sp = Stopwatch.StartNew();
var tasks = new List<Task>();
for (int i = 0; i < 500; i++)
{
    var t = Task.Run(() =>
    {
        var client = new HttpClient();
        client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com"))).Wait();
 
        Interlocked.Increment(ref count);
    });
    tasks.Add(t);
}
var whenAll = Task.WhenAll(tasks.ToArray());
while (whenAll.IsCompleted == false && whenAll.IsFaulted == false)
{
    Thread.Sleep(1000);
    Console.WriteLine("{0} - {1}, {2}", sp.Elapsed, count, tasks.Count(x => x.IsCompleted == false));
}
Console.WriteLine(sp.Elapsed); 

I’ve been running this code on my machine for the past 7 minutes, and this code hasn’t yet issue a single request.

It took a while to figure it out, but the problem is in how this is structured. We are creating a lot of tasks, more than the available threads in the thread pool. Then we make an async call, and block on that. That means that we block the thread pool thread. Which means that to process the result of this call, we’ll need to use another thread pool thread. However, we have scheduled more tasks than we have threads for. So there is no thread available to handle the reply, so all the threads are stuck waiting for reply that there is no thread to handle to unstick them.

The thread pool notices that, and will decide to allocate more threads, but they are also taken up by the already existing tasks that will immediately block.

Now, surprisingly, eventually the thread pool will allocate enough threads (although it will take it several hours to do so, probably) to start handling the requests, and the issue is resolved. Expect that this ends up basically crippling the application while this is happening.

Obviously, the solution is to not wait on async calls inside a task like that, indeed, we can use the following code quite easily:

var t = Task.Run(async () =>
{
    var client = new HttpClient();
    await client.SendAsync(new HttpRequestMessage(HttpMethod.Get, new Uri("http://google.com")));

    Interlocked.Increment(ref count);
});

And this shows performance comparable to the WebRequest method.

However, that isn’t very helpful when we need to expose a synchronous API. In our case, in RavenDB we have the IDoctumentSession, which is a simple synchronous API. We want to use the a common codebase for sync and async operations. But we need to expose the same interface, even if we changed things. To make things worse, we have no control on how the http client is scheduling the async operations, and it has no sync operations.

That means that we are left with the choice of writing everything twice, once for synchronous operations and once for async ops or just living with this issue.

A work around we managed to find is to run the root tasks (which we do control) on our own task scheduler, so they won’t compete with the I/O operations. But that is hardly something that I am thrilled about.

Tags:

Published at

Originally posted at

Comments (16)

My poor little blameless Voron

I’m currently working on a project using Voron (although only incidentally), and I was horrified to get the FatalExecutionEngineException that was discussed previously. Naturally, I assumed that this is something that I did wrong in Voron.

image

After a lot of work, I managed to narrow it down to… not Voron. To be rather more exact, it is Voron that is causing it, but it isn’t Voron’s fault.

The actual issue is the fact that I’ve accidently using Json.NET to serialize a Stream that I got from Voron. And Voron is giving us an UnmanagedMemoryStream. I kept thinking that I was doing something wrong with releasing memory, but as it turns out, here is a very small repro:

 unsafe static void Main(string[] args)
 {
     JsonConvert.SerializeObject(new Foo { Ptr = (byte*)0 });

 }

 public unsafe class Foo
 {
     public byte* Ptr { get; set; }
 }

And that is enough to get the above mentioned error.

What actually happens is that Json.NET is using dynamic IL generation to optimize accessing properties. And it just doesn’t know how to handle pointer properties. What it ended up doing is to generate invalid IL, which resulted in a crash when we tried to actually use it.

Nothing to do with Voron at all, just a class that has a pointer property that was attempting serialization.  Nasty bug, and very annoying to try to figure out.

Tags:

Published at

Originally posted at

Today’s work is…

It isn’t enough to have just an ExecutionEngineException, I had to have this:

image

Tags:

Published at

Originally posted at

Comments (3)

Stack overflow… yeah, right!

My app crashed with stack overflow exception. That would be fine, except:

image

 

To be fair, that was under low memory conditions, since I was leaking virtual memory. So that make sense, still...

Tags:

Published at

Originally posted at

Comments (1)

Performance counters sucks

A while ago we added monitoring capabilities to RavenDB via performance counters. The intent was to give our users the ability to easily see exactly what is going on with RavenDB.

The actual usage, however, was a lot more problematic.

  • Performance counters API can just hang, effectively killing us (since we try to initialize it as part of setup db routine).
  • They require specific system permissions, and can fail without them.
  • They get corrupted, for mysterious reasons, and then you need to reset them all.
  • Even after you created them, they can still die on you for no apparent reason.

I would have been willing to assume that we are doing something really stupid. Except that SignalR had similar issues.

What is worse, it appears that using the performance counters needs to iterate the list of printers on the machine. 

This is really ridiculous. And it is to the point in which I am willing to just give it up entirely, if only I had something that I could use to replace it with.

Currently I threw a lot of try /catch and a background thread to hide that, but it is ugly and brittle.

Tags:

Published at

Originally posted at

Comments (17)

Is select() broken? Memory mapped files with unbufferred writes == race condition?

Let me start this post by stating that I am not even sure if what I am trying to do is legal here. But from reading the docs, it does appear to be a valid use of the API, and it does work, most of the time.

The full code can be found here: https://gist.github.com/ayende/7495987

The gist of it is that I am trying to do two things:

  • Write to a file opened with FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING.
  • Read from this file using a memory map.
  • Occasionally, I get into situations where after I wrote to the file, I am not reading what I wrote.

I have a repro, and we reproduced this on multiple machines. Both Windows 7 and Windows 8.

Here is the relevant code (the full code is in the link), explanation on it below:

   1: const uint nNumberOfBytesToWrite = 4096*3;
   2: var buffer = (byte*)(VirtualAlloc(IntPtr.Zero, new UIntPtr(nNumberOfBytesToWrite), AllocationType.COMMIT, MemoryProtection.READWRITE)
   3:             .ToPointer());
   4:  
   5: for (int i = 0; i < nNumberOfBytesToWrite; i++)
   6: {
   7:     *(buffer + i) = 137;
   8: }
   9:  
  10: var g = Guid.NewGuid().ToString();
  11:  
  12: var safeHandle = CreateFile(g,
  13:     NativeFileAccess.GenericRead | NativeFileAccess.GenericWrite,
  14:     NativeFileShare.Read, IntPtr.Zero,
  15:     NativeFileCreationDisposition.OpenAlways,
  16:     NativeFileAttributes.Write_Through | NativeFileAttributes.NoBuffering | NativeFileAttributes.DeleteOnClose,
  17:     IntPtr.Zero);
  18:  
  19: var fileStream = new FileStream(safeHandle, FileAccess.ReadWrite);
  20: fileStream.SetLength(1024 * 1024 * 1024); // 1gb
  21:  
  22: if (safeHandle.IsInvalid)
  23: {
  24:     throw new Win32Exception();
  25: }
  26:  
  27: FileStream mms = fileStream;
  28: //mms = new FileStream(g, FileMode.Open, FileAccess.Read, FileShare.ReadWrite | FileShare.Delete);
  29: var mmf = MemoryMappedFile.CreateFromFile(mms, Guid.NewGuid().ToString(), fileStream.Length,
  30:     MemoryMappedFileAccess.Read, null, HandleInheritability.None, true);
  31:  
  32: MemoryMappedViewAccessor accessor = mmf.CreateViewAccessor(0, fileStream.Length, MemoryMappedFileAccess.Read);
  33: byte* ptr = null;
  34: accessor.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr);
  35:  
  36: Task.Factory.StartNew(() =>
  37: {
  38:     long lastPos = 0;
  39:     while (true)
  40:     {
  41:         int count = 0;
  42:         while (true)
  43:         {
  44:             if (*(ptr + lastPos) != 137)
  45:             {
  46:                 break;
  47:             }
  48:             lastPos += 4096;
  49:             count ++;
  50:         }
  51:         Console.WriteLine();
  52:         Console.WriteLine("Verified {0} MB", count * 4 / 1024);
  53:         Console.WriteLine();
  54:         Thread.Sleep(2000);
  55:     }
  56: });
  57:  
  58: for (int i = 0; i < 1024*64; i++)
  59: {
  60:     var pos = i*nNumberOfBytesToWrite;
  61:     if (i%100 == 0)
  62:         Console.Write("\r{0,10:#,#} kb", pos/1024);
  63:     var nativeOverlapped = new NativeOverlapped
  64:     {
  65:         OffsetHigh = 0,
  66:         OffsetLow = (int) pos
  67:     };
  68:  
  69:     uint written;
  70:     if (WriteFile(safeHandle, buffer, nNumberOfBytesToWrite, out written, &nativeOverlapped) == false)
  71:         throw new Win32Exception();
  72:  
  73:     for (int j = 0; j < 3; j++)
  74:     {
  75:         if (*(ptr + pos) != 137)
  76:         {
  77:             throw new Exception("WTF?!");
  78:         }
  79:         pos += 4096;
  80:     }
  81: }

This code is doing the following:

  • We setup a file handle using NoBuffering | Write_Through, and we also map the file using memory map.
  • We write 3 pages (12Kb) at a time to the file.
  • After the write, we are using memory map to verify that we actually wrote what we wanted to the file.
  • _At the same time_ we are reading from the same memory in another thread.
  • Occasionally, we get an error where the data we just wrote to the file cannot be read back.

Now, here is what I think is actually happening:

  • When we do an unbuffered write, Windows has to mark the relevant pages as invalid.
  • I _think_ that it does so before it actually perform the write.
  • If you have another thread that access that particular range of memory at the same time, it can load the _previously_ written data.
  • The WriteFile actually perform the write, but the pages that map to that portion of the file have already been marked as loaded.
  • At that point, when we use the memory mapped pointer to access the data, we get the data that was there before the write.

As I said, the code above can reproduce this issue (you might have to run it multiple times).

I am not sure if this is something that is valid issue or just me misusing the code. The docs are pretty clear about using regular i/o & memory mapped i/o. The OS is responsible to keeping them coherent with respect to one another. However, that is certainly not the case here.

It might be that I am using a single handle for both, and Windows does less checking when that happens? For what it is worth, I have also tried it using different handles, and I don’t see the problem in the code above, but I have a more complex scenario where I do see the same issue.

Of course, FILE_FLAG_OVERLAPPED is not specified, so what I would actually expect here is serialization of the I/O, according to the docs. But mostly I need a sanity check to tell me if I am crazy.

Tags:

Published at

Originally posted at

Comments (18)

Spot the bug: Unmanaged memory traps

Sometimes I forgot how good life is in the managed lane.  Then I do some unmanaged work and get a good reality check.

Let us look at the following structures:

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct PageHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public ushort Lower;

        [FieldOffset(10)] public ushort Upper;

        [FieldOffset(12)] public int OverflowSize;

        [FieldOffset(16)] public int ItemCount;
    }


    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct FileHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public LogHeader Active;

        [FieldOffset(44)] public LogHeader Backup;

        [FieldOffset(80)] public TreeRootHeader Root;

        [FieldOffset(142)] public TreeRootHeader FreeSpace;
    }

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct TreeRootHeader
    {
        [FieldOffset(0)] public long RootPageNumber;
        [FieldOffset(8)] public long BranchPages;
        [FieldOffset(16)] public long LeafPages;
        [FieldOffset(32)] public long OverflowPages;
        [FieldOffset(40)] public long PageCount;
        [FieldOffset(48)] public long EntriesCount;
        [FieldOffset(56)] public int Depth;
        [FieldOffset(60)] public TreeFlags Flags;
    }

    [StructLayout(LayoutKind.Explicit, Pack = 1)]
    public struct LogHeader
    {
        [FieldOffset(0)] public long Marker;

        [FieldOffset(8)] public long LastLog;

        [FieldOffset(16)] public long LastLogPage;

        [FieldOffset(24)] public int ItemCount;

        [FieldOffset(28)] public long Options;
    }

And now we have the following code:

   1:  private static unsafe void Main()
   2:  {
   3:      IntPtr pagePtr = Marshal.AllocHGlobal(4096);
   4:   
   5:      var pageHeader = (PageHeader*) pagePtr.ToPointer();
   6:      pageHeader->ItemCount = 2;
   7:      pageHeader->Marker = 0x128314543423;
   8:      pageHeader->OverflowSize = 32;
   9:   
  10:      FileHeader* fileHeader = (FileHeader*) pageHeader + sizeof (PageHeader);
  11:   
  12:      fileHeader->Root.BranchPages = 0;
  13:   
  14:   
  15:      Marshal.FreeHGlobal(pagePtr);
  16:  }

The fun part about this code is that it would silently corrupt the state of the process.

Here is what happens when you run it under the debugger:

image

Can you figure out why?

Tags:

Published at

Originally posted at

Comments (6)

Sites outage

We have an outage that appears to have taken roughly 12 hours.

The reason it took so long to fix, it was after business hours, and while we have production support for our clients, we never hooked up our own websites to our own system. A typical case of the barefoot shoemaker.

The reason for the outage? Also pretty typical:

image

The reason for that? We had a remote backup process that put some temp files and didn’t clean them up properly. The growth rate was about 3-6 MB a day, so no one really noticed.

The fix:

image

All is working now, I sorry for the delay in fixing this. We’ll be having some discussion here to see how we can avoid repeat issues like that.

Tags:

Published at

Originally posted at

Comments (10)

Diagnosing RavenDB problem in production

We have run into a problem in our production system. Luckily, this is a pretty obvious case of misdirection.

Take a look at the stack trace that we have discovered:

image

The interesting bit is that this is an “impossible” error. In fact, this is the first time that we have actually seen this error ever.

But looking at the stack trace tells us pretty much everything. The error happens in Dispose, but that one is called from the constructor. Because we are using native code, we need to make sure that an exception in the ctor will properly dispose all unmanaged resources.

The code looks like this:

image

And here we can see the probable cause for error. We try to open a transaction, then an exception happens, and the Dispose is called, but it isn’t ready to handle this scenario, so throws.

The original exception is masked, and you have a hard debug scenario.

Published at

Originally posted at

The saga of the race condition re-fix

This is a story about a bug, RavenDB-1280, to be exact. It is really bad when we start with a race condition bug, and it only gets worse from there.

The story begins with the following index:

public class EmailIndex : AbstractIndexCreationTask<EmailDocument, EmailIndexDoc>
{
    public EmailIndex()
    {
        Map =
            emails => from email in emails
                    let text = LoadDocument<EmailText>(email.Id + "/text")                 
                    select new
                            {
                                email.To,
                                email.From,
                                email.Subject,
                                Body = text == null ? null : text.Body
                            };
    }
}

So far, this  is a pretty standard index using LoadDocument. The LoadDocument feature allows us to also index data from related documents, and the database will ensure that when the loaded document is changed, the requesting document will also be re-indexed.

What does this mean? It means that when we change the text of an email, we will also be re-indexing that email’s document. Internally, we store it in something that looks like this (using relation model because it is probably easy to follow for most of you):

CREATE TABLE DocumentReferences
{
   Source,
   Reference
}

So in the case above, we will have (Source: emails/1 and Reference: emails/1/text).

Whenever we modify a document, we check the stored references for matches. If we were using a relational database, the code would be something like:

foreach ( var src in (SELECT Source FROM DocumentReferences WHERE Reference = @ref))
{
   TouchDocument(src);
}

So far, so good. This ensures that we are always keep the index up to date. Obviously, this also means that the indexing process needs to update the references as well. Which brings us to the problematic part:

Parallel.For(0, iterations, i =>
{
    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailDocument {Id = "Emails/"+ i,To = "root@localhost", From = "nobody@localhost", Subject = "Subject" + i});
        session.SaveChanges();
    }

    using (var session = documentStore.OpenSession())
    {
        session.Store(new EmailText { Id = "Emails/" + i + "/text", Body = "MessageBody" + i });
        session.SaveChanges();
    }
});

This code will result in an interesting edge case. Because we are using two different sessions (and thus two different transactions), it is possible for the index to pick up the emails/1 document and start indexing it, but not pick up the emails/1/text document (it hasn’t been saved yet).

However, during the save of emails/1/text document, there wouldn’t be anything in the references storage, so we won’t know that we need to re-index emails/1. The result is that we violated our promise of re-indexing if the document changed.

As I said, just setting up the problem require parallel thinking and a severe case of headache. Luckily, we had a great bug report from Barry Hagan, which included an occasionally failing test. (Even just writing “occasionally failing” causes me to throw up a little bit).

After we identified the problem, we tried to solve it by holding up a list of modified documents in memory that were also required (and missing) during indexing using LoadDocument. Don’t bother to follow up on this statement, it is complex, and it is hard, and we did it. And it worked, except that it didn’t. That just moved the race condition from the entire process to the edges of the process. We have had two guys sitting on this for a couple of days and in danger of hair tearing, with no luck narrowing it down.

Eventually I sat down and tried to figure out how to deal with it. I was able to build on their work, and narrow down exactly where we had the race condition now. And just thinking about trying to solve it was way too hard. It would require multiple locks and pretty strange behavior overall. I wasn’t happy with the implications, and that code already created quite a lot of complications as it was.

So I started by reverting all of that code and had a clean slate to work with. Then I sat down and thought about it, and finally figured out a better way to do it. The problem was only relevant when we had a missing reference (an existing reference would properly generate the re-indexing under all conditions). So I started with that, I gathered up all of the missing references, and stored them in a db task.  A db task is a RavenDB notion of a transactionally safe way of registering things to run in another transaction. So after the indexing transaction was done, we would then go an execute that task.

The idea here is that doing it this way prevent us from having to deal with any issues with regards to “what is the server crashes midway”, etc. Anyway, it also means that this task is running in a separate transaction and after the indexing transaction is done. The only thing that this task is doing is just forcing re-indexing of all the documents that tried to call LoadDocument but got null as a result. Because we are now in a separate transaction, there are only two options:

  • The related document is still not there, in which case were are merely doing some extra work.
  • The related document is there, and will be indexed.

The first case deserve some more attention, even if the document that was missing comes while we are re-indexing, we don’t care, we already setup the references, and committed it, so it would force yet another re-indexing, and in the end, everyone will be happy.

And now, I have written this blog post while I was running a stress test on this feature, but it took me long enough that I am sure it works properly. So I’ll call it a day and go do something more fun, maybe 3.0 work.

Tags:

Published at

Originally posted at

Comments (2)

Challenge: Spot the bug

Speaking of unfair interview questions, this would be a pretty evil one.

image

Can you see the bug? And how would you fix it?

Tags:

Published at

Originally posted at

Comments (42)

Diagnosing a production issue on our servers

One of the steps that we take before releasing a stable is to push the latest build into our own production servers, and see what goes on. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and find the sort of bugs that are very hard to run in production.

In this case, after several hours (8 – 12, I am guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare & Commit phases of the operation.

Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened.  Sad smileSad smileSad smile

The good thing about having this happen on our production servers is that I have full access there. Of course, it is production, so outright debugging it out, but taking a dump and transferring that to my machine was easy enough.

Now, open it with WinDBG, run “.loadby sos clr” and start investigating.

First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.

Next, it was time to look a bit at the actual memory. I run: !DumpHeap -type Session

image

My reaction is Huh!!! There is absolutely zero justification for that.

Now, the only question is why. So I decided to look at the class that is holding the transaction state, assuming that this is probably what is holding into all those sessions. I run: !DumpHeap -type EsentTransactionContext

image

And that tells me quite a lot. There appear to be now a total of 317 in flight DTC transactions. Considering that I know what our usage is, that is a huge number. And it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat

image

In other words, we have 8 loaded dbs, each of them holding their in flight transactional state. And we have 317 opened transactions and 35 thousands sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single digits concurrent DTC transactions at any one time. So somehow we are leaking transactions & sessions. But I am still very unhappy with just “we are leaking sessions”. That is something that I knew before we start debugging everything.

I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file, and then analyze that in a much nicer manner. Doing so, I quickly found this out:

image

Huh?!

Sessions are finalizable, sure, but I am very careful about making sure to dispose of them. Especially after the previous code change. There should be just 317 undisposed sessions. And having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there. And the numbers don’t match us, either. We are missing about 7K items from the WinDBG numbers.

Okay, next, I pulled ClrMD and wrote the following:

   1: var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");
   2: var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();
   3: var rt = dt.CreateRuntime(moduleInfo);
   4:  
   5: var clrHeap = rt.GetHeap();
   6:  
   7: var finalized = new HashSet<ulong>();
   8:  
   9: int cnt = 0;
  10: foreach (var ptr in rt.EnumerateFinalizerQueue())
  11: {
  12:     var type = clrHeap.GetObjectType(ptr);
  13:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  14:     {
  15:         finalized.Add(ptr);
  16:     }
  17: }
  18: Console.WriteLine(finalized.Count);
  19:  
  20: var live = new HashSet<ulong>();
  21: foreach (var ptr in clrHeap.EnumerateObjects())
  22: {
  23:     var type = clrHeap.GetObjectType(ptr);
  24:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")
  25:     {
  26:         if (finalized.Contains(ptr) == false)
  27:             live.Add(ptr);
  28:     }
  29: }
  30: Console.WriteLine(live.Count);

This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?

I did a code review over everything once again, and I think that I got it. The culprit is this guy:

image

Where createContext is defined as:

image

Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method. But because those create values that have to be disposed…  Now, in the normal course of things, worst case scenario is that we would have them in the finalizer queue and they would be disposed in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.

At least ,this is my current theory. I changed the code to be like this:

image

So if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there aren’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.

Tags:

Published at

Originally posted at

Comments (21)

Reasons to hate C/C++ work

The following code won’t link. For fun, the set_mapsize and set_flags are both defined in the same file. Both for .h & .c, and the set_mapsize works just fine.

image

Tags:

Published at

Originally posted at

Comments (19)

Optimized, but way out of line

Yesterday I showed the following profiler results.

image

The first place someone would look at optimizing is the CaseInsensitiveComparator.Compare call, right? That is clearly taking a really long time.

In fact, we re-wrote that method to use pointer magic in order to get it to work faster. But that isn’t the issue. Look at the numbers. Compare is called 68 million times. That means that under the profiler, it is still completing an operation in 0.000073 ms. I am not even sure what time unit that is.

However, we can see that the major caller for Compare was FindGreaterOrEqual, right? And that is called only 11 thousand times. What does this tells you?

That tells me we have a Schlemiel the Painter's algorithm. So I double checked, and SkipList insert perf should be O(logN) on average. But what we were seeing is O(N) costs .Indeed, it appeared that we had a bug in which our skip list implementation would never do skips, essentially turning it into a sorted linked list, with insert perf of O(N). Once we fixed that bug… well, things got better, a lot better.

image

As you can see, the cost of this went down really hard. And now we can focus on other stuff.

Tags:

Published at

Originally posted at

Comments (6)

Optimize this, dude!

So, we have a perf problem in the managed leveldb implementation. I pulled out dotTrace and looked at the numbers:

image

Do you see where the problem is?

Tags:

Published at

Originally posted at

Comments (29)

Spot the bug…

image

Do you see it? It took me ages to find it.

Tags:

Published at

Originally posted at

Comments (11)

TPL and the case of the !@#(*@! hung process

So, here I am writing some really fun code, when I found out that I am running into dead locks in the code. I activate emergency protocols and went into deep debugging mode.

After being really through in figuring out several possible causes, I was still left with what is effectively a WTF @!(*!@ DAMN !(@*#!@* YOU !@*!@( outburst and a sudden longing for something to repeatedly hit.

Eventually, however, I figure out what was going on.

I have the following method: Aggregator.AggregateAsync(), inside which we have a call to the PulseAll method. That method will then go and execute the following code:

   1: public void PulseAll()
   2: {
   3:     Interlocked.Increment(ref state);
   4:     TaskCompletionSource<object> result;
   5:     while (waiters.TryDequeue(out result))
   6:     {
   7:         result.SetResult(null);
   8:     }
   9: }

After that, I return from the method. In another piece of the code (Aggregator.Dispose) I am waiting for the task that is running the AggregateAsync method to complete.

Nothing worked! It took me a while before I figured out that I wanted to check the stack, where I found this:

image

Basically, I had a dead lock because when I called SetResult on the completion source (which freed the Dispose code to run), I actually switched over to that task and allowed it to run. Still in the same thread, but in a different task, I run through the rest of the code and eventually got to the Aggregator.Dispose(). Now, I could only get to it if it the PulseAll() method was called. But, because we are on the same thread, that task hasn’t been completed yet!

In the end, I “solved” that by introducing a DisposeAsync() method, which allowed us to yield the thread, and then the AggregateAsync task was completed, and then we could move on.

But I am really not very happy about this. Any ideas about proper way to handle async & IDisposable?

Tags:

Published at

Originally posted at

Comments (11)

Deadlocking with the TPL, how to

As I mentioned, I run into a very nasty issue with the TPL. I am not sure if it is me doing things wrong, or an actual issue.

Let us look at the code, shall we?

We start with a very simple code:

   1: public class AsyncEvent
   2: {
   3:     private volatile TaskCompletionSource<object> tcs = new TaskCompletionSource<object>();
   4:     
   5:     public Task WaitAsync()
   6:     {
   7:         return tcs.Task;
   8:     }
   9:  
  10:     public void PulseAll()
  11:     {
  12:         var taskCompletionSource = tcs;
  13:         tcs = new TaskCompletionSource<object>();
  14:         taskCompletionSource.SetResult(null);
  15:     }
  16: }

This is effectively an auto reset event. All the waiters will be released when the PulseAll it called. Then we have this runner, which just execute work:

   1: public class Runner : IDisposable
   2: {
   3:     private readonly ConcurrentQueue<TaskCompletionSource<object>> items =
   4:         new ConcurrentQueue<TaskCompletionSource<object>>();
   5:     private readonly Task<Task> _bg;
   6:     private readonly AsyncEvent _event = new AsyncEvent();
   7:     private volatile bool _done;
   8:  
   9:     public Runner()
  10:     {
  11:         _bg = Task.Factory.StartNew(() => Background());
  12:     }
  13:  
  14:     private async Task Background()
  15:     {
  16:         while (_done == false)
  17:         {
  18:             TaskCompletionSource<object> result;
  19:             if (items.TryDequeue(out result) == false)
  20:             {
  21:                 await _event.WaitAsync();
  22:                 continue;
  23:             }
  24:  
  25:             //work here, note that we do NOT use await!
  26:  
  27:             result.SetResult(null);
  28:         }
  29:     }
  30:  
  31:     public Task AddWork()
  32:     {
  33:         var tcs = new TaskCompletionSource<object>();
  34:         items.Enqueue(tcs);
  35:  
  36:         _event.PulseAll();
  37:  
  38:         return tcs.Task;
  39:     }
  40:  
  41:     public void Dispose()
  42:     {
  43:         _done = true;
  44:         _event.PulseAll();
  45:         _bg.Wait();
  46:     }
  47: }

And finally, the code that causes the problem:

   1: public static async Task Run()
   2: {
   3:     using (var runner = new Runner())
   4:     {
   5:         await runner.AddWork();
   6:     }
   7: }

So far, it is all pretty innocent, I think you would agree. But this cause hangs with a dead lock. Here is why:

image

Because tasks can share threads, we are in the Background task thread, and we are trying to wait on that background task completion.

Result, deadlock.

If I add:

   1: await Task.Yield();

Because that forces this method to be completed in another thread, but that looks more like something that you add after you discover the bug, to be honest.

Tags:

Published at

Originally posted at

Comments (21)

Challenge: The problem of locking down tasks…

The following code has very subtle bug:

   1: public class AsyncQueue
   2: {
   3:     private readonly Queue<int> items = new Queue<int>();
   4:     private volatile LinkedList<TaskCompletionSource<object>> waiters = new LinkedList<TaskCompletionSource<object>>();
   5:  
   6:     public void Enqueue(int i)
   7:     {
   8:         lock (items)
   9:         {
  10:             items.Enqueue(i);
  11:             while (waiters.First != null)
  12:             {
  13:                 waiters.First.Value.TrySetResult(null);
  14:                 waiters.RemoveFirst();
  15:             }
  16:         }
  17:     }
  18:  
  19:     public async Task<IEnumerable<int>> DrainAsync()
  20:     {
  21:         while (true)
  22:         {
  23:             TaskCompletionSource<object> taskCompletionSource;
  24:             lock (items)
  25:             {
  26:                 if (items.Count > 0)
  27:                 {
  28:                     return YieldAllItems();
  29:                 }
  30:                 taskCompletionSource = new TaskCompletionSource<object>();
  31:                 waiters.AddLast(taskCompletionSource);
  32:             }
  33:             await taskCompletionSource.Task;
  34:         }
  35:     }
  36:  
  37:     private IEnumerable<int> YieldAllItems()
  38:     {
  39:         while (items.Count > 0)
  40:         {
  41:             yield return items.Dequeue();
  42:         }
  43:  
  44:     }
  45: }

I’ll even give you a hint, try to run the following client code:

   1: for (int i = 0; i < 1000 * 1000; i++)
   2: {
   3:     q.Enqueue(i);
   4:     if (i%100 == 0)
   5:     {
   6:         Task.Factory.StartNew(async () =>
   7:             {
   8:                 foreach (var result in await q.DrainAsync())
   9:                 {
  10:                     Console.WriteLine(result);
  11:                 }
  12:             });
  13:     }
  14:  
  15: }
Can you figure out what the problem is?

Published at

Originally posted at

Comments (7)

Bug hunting in a massively multi threaded environment

We got a really nasty bug report from a user. Sometimes, out of the blue, RavenDB would throw an error:

System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
at Raven.Json.Linq.RavenJArray.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJArray.cs:line 174
at Raven.Json.Linq.RavenJObject.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJObject.cs:line 275 at Raven.Json.Linq.RavenJObject.WriteTo(JsonWriter writer, JsonConverter[] converters) in c:\Builds\RavenDB-Stable\Raven.Abstractions\Json\Linq\RavenJObject.cs:line 275

Unfortunately, this error only happened once in a while, usually when the system was under load. But they weren’t able to provide a repro for that.

Luckily, they were able to tell us that they suspected that this is related to the replication support. I quickly setup a database with replication and wrote the following code to try to reproduce this:

using(var store = new DocumentStore
{
    Url = "http://localhost:8080",
    DefaultDatabase = "hello"
}.Initialize())
{
    using(var session = store.OpenSession())
    {
        session.Store(new ReadingList
        {
            UserId = "test/1",
            Id = "lists/1",
            Books = new List<ReadingList.ReadBook>()
        });
        session.SaveChanges();
    }
    Parallel.For(0, 100, i =>
    {
        while (true)
        {
            try
            {
                using (var session = store.OpenSession())
                {
                    session.Advanced.UseOptimisticConcurrency = true;
                    session.Load<ReadingList>("lists/1")
                            .Books.Add(new ReadingList.ReadBook
                            {
                                ReadAt = DateTime.Now,
                                Title = "test " + i
                            });
                    session.SaveChanges();
                }
                break;
            }
            catch (ConcurrencyException)
            {
                
            }
        }
    });
}

And that reproduced the bug! Hurrah! Done deal, we can move on, right?

Except that the bug was only there when we have massive amount of threads hitting the db at once, and trying to figure out what is actually going on there was next to impossible using standard debugging tools. Instead, I reached down to my tracing toolbelt and starting pulling stuff out. First, we identified that the problem occurred when iterating over RavenJArray, which is our own object, so we added the following:

        ConcurrentQueue<StackTrace>  addStackTraces = new ConcurrentQueue<StackTrace>();

        public void Add(RavenJToken token)
        {
            if (isSnapshot)
                throw new InvalidOperationException("Cannot modify a snapshot, this is probably a bug");

            addStackTraces.Enqueue(new StackTrace(true));

            Items.Add(token);
        }

And this one (where the exception is raised):

public override void WriteTo(JsonWriter writer, params JsonConverter[] converters)
{
    writer.WriteStartArray();

    if (Items != null)
    {
        try
        {
            foreach (var token in Items)
            {
                token.WriteTo(writer, converters);
            }
        }
        catch (InvalidOperationException e)
        {
            foreach (var stackTrace in addStackTraces)
            {
                Console.WriteLine(stackTrace);
            }
            throw;
        }
    }

    writer.WriteEndArray();
}

With the idea that we would actually be able to get what is going on there. By tracking down who added items to this particular instance, I hoped that I would be able to figure out why we have an instance that is shared among multiple threads.

When I had that, it was pretty easy to see that it was indeed the replication bundle that was causing the issue. The problem was that the replication bundle was modifying an inner array inside the document metadata. We protected the root properties of the metadata from concurrent modifications, and most of the time, it works just fine. But the problem was that now we had a bundle that was modifying a nested array, which wasn’t protected.

This is one of those bugs that are really hard to catch:

  • My understanding of the code said that this is not possible, since I believed that we protected the nested properties as well*.
  • This bug will only surface if and only if:
    • You have the replication bundle enabled.
    • You have a great deal of concurrent modifications (with optimistic concurrency enabled) to the same document.
    • You are unlucky.

I was grateful that the user figured out the replication connection, because I already sat on that bug previously, and there was no way I could figure out what is going on unless I had the trace to point me to where the actual problem was.

Production issue: ASP.Net Cache kills the application

In one of our production deployments, we occasionally get a complete server process crash. Investigating the event log, we have this:

Exception: System.InvalidOperationException

Message: Collection was modified; enumeration operation may not execute.

StackTrace:    at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()

   at System.Web.Hosting.ObjectCacheHost.TrimCache(Int32 percent)

   at System.Web.Hosting.HostingEnvironment.TrimCache(Int32 percent)

   at System.Web.Hosting.HostingEnvironment.TrimCache(Int32 percent)

   at System.Web.Hosting.ApplicationManager.TrimCaches(Int32 percent)

   at System.Web.Hosting.CacheManager.CollectInfrequently(Int64 privateBytes)

   at System.Web.Hosting.CacheManager.PBytesMonitorThread(Object state)

   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)

   at System.Threading._TimerCallback.PerformTimerCallback(Object state)

As you can see, this is a case of what appears to be a run of the mill race condition, translated to a process killing exception because it was thrown from a separate thread.

This thread, by the way, is the ASP.Net Cache cleanup thread, and we have no control whatsoever over that. To make things worse, this application doesn’t even use the ASP.NET Cache in any way shape or form.

Any ideas how to resolve this would be very welcome.

Tags:

Published at

Originally posted at

Comments (33)

Silverlight streaming - the race condition is already included

I am not sure how to call this issue, except maddening. For a simple repro, you can check this github repository.

The story is quite simple, let us assume that you need to send a set of values from the server to the client. For example, they might be tick values, or updates, or anything of this sort.

You can do this by keeping an HTTP connection open and sending data periodically. This is a well known technique and it works quite well. Except in Silverlight, where it works, but only if you put the appropriate Thread.Sleep() in crucial places.

Here is an example of the server behavior:

var listener = new HttpListener
{
    Prefixes = {"http://+:8080/"}
};
listener.Start();
while (true)
{
    var ctx = listener.GetContext();
    using (var writer = new StreamWriter(ctx.Response.OutputStream))
    {
        writer.WriteLine("first\r\nsecond");
        writer.Flush();
    }
    Console.ReadKey();
}

In this case, note that we are explicitly flushing the response, then just wait. If you look at the actual network traffic, you can see that this will actually be sent, the connection will remain open, and we can actually send additional data as well.

But how do you consume such a thing in Silverlight?

var webRequest = (HttpWebRequest)WebRequestCreator.ClientHttp.Create(new Uri("http://localhost:8080/"));
webRequest.AllowReadStreamBuffering = false;
webRequest.Method = "GET";

Task.Factory.FromAsync<WebResponse>(webRequest.BeginGetResponse, webRequest.EndGetResponse, null)
    .ContinueWith(task =>
    {
        var responseStream = task.Result.GetResponseStream();
        ReadAsync(responseStream);
    });

 

We start by making sure that we disable read buffering, then we get the response and start reading from it. The read method is a bit complex, because is has to deal with partial response, but it should still be fairly obvious what is going on:

byte[] buffer = new byte[128];
private int posInBuffer;
private void ReadAsync(Stream responseStream)
{
    Task.Factory.FromAsync<int>(
        (callback, o) => responseStream.BeginRead(buffer, posInBuffer, buffer.Length - posInBuffer, callback, o),
        responseStream.EndRead, null)
        .ContinueWith(task =>
        {
            var read = task.Result;
            if (read == 0) 
                throw new EndOfStreamException();
            // find \r\n in newly read range

            var startPos = 0;
            byte prev = 0;
            bool foundLines = false;
            for (int i = posInBuffer; i < posInBuffer + read; i++)
            {
                if (prev == '\r' && buffer[i] == '\n')
                {
                    foundLines = true;
                    // yeah, we found a line, let us give it to the users
                    var data = Encoding.UTF8.GetString(buffer, startPos, i - 1 - startPos);
                    startPos = i + 1;
                    Dispatcher.BeginInvoke(() =>
                    {
                        ServerResults.Text += data + Environment.NewLine;
                    });
                }
                prev = buffer[i];
            }
            posInBuffer += read;
            if (startPos >= posInBuffer) // read to end
            {
                posInBuffer = 0;
                return;
            }
            if (foundLines == false)
                return;

            // move remaining to the start of buffer, then reset
            Array.Copy(buffer, startPos, buffer, 0, posInBuffer - startPos);
            posInBuffer -= startPos;
        })
        .ContinueWith(task =>
        {
            if (task.IsFaulted)
                return;
            ReadAsync(responseStream);
        });
}

While I am sure that you could find bugs in this code, that isn’t the crucial point.

If we run the server, then run the SL client, we could see that we get just one lousy byte, and that is it. Now, reading about this, it appears that in some versions of some browsers, you need to send 4KB of data to get the connection going. But that isn’t what I have observed. I tried sending 4KB+ of data, and I still saw the exact same behavior, we got called for the first byte, and nothing else.

Eventually, I boiled it down to the following non working example:

writer.WriteLine("first");
writer.Flush();
writer.WriteLine("second");
writer.Flush();

Versus this working example:

writer.WriteLine("first");
writer.Flush();
Thread.Sleep(50);
writer.WriteLine("second");
writer.Flush();

Yes, you got it right, if I put the thread sleep in the server, I’ll get both values in the client. Without the Thread.Sleep, we get only the first byte. It seems like it isn’t an issue of size, but rather of time, and I am at an utter loss to explain what is going on.

Oh, and I am currently awake for 27 hours straight, most of them trying to figure out what the )(&#@!)(DASFPOJDA(FYQ@YREQPOIJFDQ#@R(AHFDS:OKJASPIFHDAPSYUDQ)(RE is going on.

Tags:

Published at

Originally posted at

Comments (22)

The HIGH cost of ConcurrentBag in .NET 4.0

I got some strange results when using concurrent collections, so I decided to try to track it down, and wrote the following code:

var count = ?;
var list = new List<object>(count);
var sp = Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
    list.Add(new ConcurrentBag<int>());
}
sp.Stop();
Console.WriteLine("{0} {2} items in {1:#,#;;0}ms = {3:#,#;;0}ms per item",
    sp.Elapsed, sp.ElapsedMilliseconds, count, sp.ElapsedMilliseconds / count);

And then I started to play with the numbers, and it is not good.

  • 10 items in 2ms = 0ms per item

This is incredibly high number, you have to understand. Just to compare, List<int> takes 8 ms to create 100,000 items.

Let us see how it works when we use more of this.

  • 100 items in 5ms = 0ms per item
  • 1,000 items in 37ms = 0ms per item
  • 10,000 items in 2,319ms = 0ms per item

Note the numbers, will you?

1,000 items in 37 ms, but 10,000 items? 2.3 seconds!

  • 20,000 items in 21,331ms = 1ms per item

And doubling the amount took ten times as long?

  • 25,000 items in 32,588ms = 1ms per item

And at this point, I stopped trying, because I didn’t have the patience.

Note that the other concurrent collection, ConcurrentStack, ConcurrentQueue and ConcurrentDictionary do not suffer from the same problem.

I contacted Microsoft about this, and this is already resolved in .NET 4.5. The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn’t expect to have a lot of instances. That has been fixed, and now can run fairly fast.

Tags:

Published at

Originally posted at

Comments (10)