Ayende @ Rahien

Refunds available at head office

Micro optimization decision process

There are some parts of our codebase that are simply going to have to be called a large number of times. Those are the ones that we want to optimize, but at the same time, unless they are ridiculously inefficient, there isn’t that much room for improvement.

Let us look at this for a second:

image_thumb

The numbers are pretty hard to read in this manner, so I generally translate it to the following table:

Method name Cost per 1,000 invocations
StringEqualsToBuffer 7 ms
get_Item 0.2 ms
get_Length 0.2 ms
GetHashCode 4 ms
Equals 1 ms

It is important to note that what I am trying to check here is relative cost of calling a method. I use the thousands invocation just to give us back a number that we can actually understand easily, instead of dealing with nanoseconds.

As you can see, all of the methods in this piece of code are actually pretty fast, the slowest will complete in under ten nanoseconds. The problem is that they are called a lot. StringEqualsToBuffer cost me 90 seconds in this test run. This means that to improve its performance, we need to get it to drop to even fewer nanoseconds, or reduce the number of times it is called. Both of which are going to be hard.

You can look at how I dealt with this particular case in this post, but right now I want to talk about the decision process, not just the action that I took.

Usually, in such situations, I find the most costly function (StringEqualsToBuffer in this case) and then find any functions that it called, in this case, we can see that get_Item and get_Length are both costly functions called from StringEqualsToBuffer. Stupid micro optimization tactics, like referencing a field directly instead of through a property have enormous consequences in this type of scenario.

Next, we have things like GetHashCode, which looks to be very slow (it takes 4 nanoseconds to complete, I have hard time calling it slow :-)). This function is slow not because we are doing something that can be optimized, but simply because of what it does. Since we can’t optimize the code itself, we want to do the next best thing, and see if we can optimize the number of times that this code is called. In other words, apply caching to the issue. Applying caching means that we need to handle invalidation, so we need to consider whatever we will gain something from that, mind you. Often, the cost of managing the cache can be higher than the cost of calculating the value from scratch when we are talking about this kind of latencies.

Another issue to consider is the common memory vs. time argument, it is easy to err into one side of them when you are focused on micro benchmarks. You get a routine that completes in 1 nanosecond in the common case but uses up 10 Mb of cache. Sometimes you want that, sometimes it is a very bad tradeoff.

I generally start with simple performance tuning, finding out hotspots and figuring out how to fix them. Usually, it is some sort of big O problem, either in the function itself or what it is called on. Those tend to be easy to fix and produce a lot of benefit. Afterward, you get to true algorithmic fixes (find a better algo for this problem). Next, I run tests for memory usage, seeing if under the most extreme likely conditions, I am hitting my specified limits.

I’ll talk about reducing memory usage in a separate post, but once you run through that, another run to verified that you haven’t traded off in the other direction (reduced memory at the expense of running time) would complete the process.

Memory obesity and the curse of the string

I believe that I have mentioned that my major problem with the memory usage in the profiler is with strings. The profiler is doing a lot with strings, queries, stack traces, log messages, etc are all creating quite a lot of strings that the profiler needs to inspect, analyze and finally produce the final output.

Internally, the process looks like this:

image

On my previous post, I talked about the two major changes that I made so far to reduce memory usage, you can see them below. I introduced string interning in the parsing stage and serialized the model to disk so we wouldn’t have to keep it all in memory, which resulted in the following structure:

image

However, while those measures helped tremendously, there is still more that I can do. The major problem with string interning is that you first have to have the string in order to check it in the interned table. That means that while you save on memory in the long run, in the short run, you are still allocating a lot of strings. My next move is to handle interning directly from buffered input, skipping the need to allocate memory for a string to use as the key for interning.

Doing that has been a bit hard, mostly because I had to go deep into the serialization engine that I use (Protocol Buffers) and add that capability. It is also fairly complex to handle something like this without having to allocating a search key in the strings table. But, once I did that, I noticed three things.

First, while memory increased during operation, there weren’t any jumps & drops, that is, we couldn’t see any periods in which the GC kicked in and released a lot of garbage. Second, memory consumption was relatively low through the operation. Before optimizing the memory usage, we are talking about 4 GB for processing and 1.5 GB for final result, after the previous optimization it was 1.9 GB for processing and 1.3 for final result. But after this optimization, we have a fairly simple upward spike up to 1.3 GB. You can see the memory consumption during processing in the following chart, memory used in in GB on the Y axis.

image

As you can probably tell, I am much happier with the green line than the other. Not only just because it takes less memory in general, but because it is much more predictable, it means that the application’s behavior is going to be easier to reason about.

But this optimization brings to mind the question, since I just introduced interning at the serialization level, do I really need to have interning at the streaming level? On the face of it, it looks like an unnecessary duplication. Indeed, removing the string interning that we did in the streaming level reduce overall memory usage from 1.3GB to 1.15GB.

Overall, I think this is a nice piece of work.

When mini benchmarks are important

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" - Donald Knuth

I have expressed my dislike for micro benchmarks in the past, and in general, I still have this attitude, but sometimes, you really care.

A small note, while a lot of namespaces you are going to see are Google.ProtocolBuffers, this represent my private fork of this library that was customized to fit UberProf’s needs. Some of those things aren’t generally applicable (like string interning at the serialization level), so please don’t try to project from the content of this post to the library itself.

Let me show you what I mean:

image

The following is profiling this piece of code:

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)
{
    if(byteString.Length != byteBuffer.Length)
        return false;
    for (int i = 0; i < byteString.Length; i++)
    {
        if(byteString[i] != byteBuffer.Buffer[byteBuffer.Offset+i])
            return false;
    }
    return true;
}

This looks pretty simple right?

Now, it is important to understand that this isn’t some fake benchmark that I contrived, this is the profile results from testing a real world scenario. In general, methods such as Equals or GetHashCode, or anything that they call, is likely to be called a lot of times, so paying attention to its performance is something that you should think about.

This are a couple of very easy things that I can do to make this easier, remove the call to the ByteString indexer (which show up as get_Item in the profiler results) to a direct array access and consolidate the calls to the ByteString.Length property.

After applying those two optimizations, we get the following code:

private static bool StringEqaulsToBuffer(ByteString byteString, ByteBuffer byteBuffer)
{
    var strLen = byteString.Length;
    if(strLen != byteBuffer.Length)
        return false;
    for (int i = 0; i < strLen; i++)
    {
        if(byteString.bytes[i] != byteBuffer.Buffer[byteBuffer.Offset+i])
            return false;
    }
    return true;
}

And this profiler result:

image

You can see that the this simple change resulted in drastic improvement to the StringEqualsToBuffer mehtod. As it stands now, I don’t really see a good way to optimize this any further, so I am going to look at the other stuff that showed up. Let us take a look at ByteBuffer.GetHashCode() now:

public override int GetHashCode()
{
    var ret = 23;
    for (var i = Offset; i < Offset+Length; i++)
    {
        ret = (ret << 8) | Buffer[i];
    }
    return ret;
}

The problem is that I don’t really see a way to optimize that, instead, I am going to cache that in a field. There is some problem here with the fact that ByteBuffer is mutable, but I can handle that by forcing all call sites that change it to call a method that will force hash recalculation. Note how different this decision is from the usual encapsulation that I would generally want. Placing additional burdens on call sites is a Bad Thing, but by doing so, I think that I can save quite significantly on the hash code calculation overhead.

Next, let us look at the DoCleanupIfNeeded method and see why it is taking so much time.

private void DoCleanupIfNeeded()
{
    if (strings.Count <= limit)
        return;

    // to avoid frequent thrashing, we will remove the bottom 10% of the current pool in one go
    // that means that we will hit the limit fairly infrequently
    var list = new List<KeyValuePair<ByteStringOrByteBuffer, Data>>(strings);
    list.Sort((x, y) => x.Value.Timestamp - y.Value.Timestamp);

    for (int i = 0; i < limit/10; i++)
    {
        strings.Remove(list[i].Key);                
    }
}

From the profiler output, we can see that it is an anonymous method that is causing the holdup, that is pretty interesting, since this anonymous method is the sort lambda. I decided to see if the BCL can do better, and changed that to:

private void DoCleanupIfNeeded()
{
    if (strings.Count <= limit)
        return;

    // to avoid frequent thrashing, we will remove the bottom 10% of the current pool in one go
    // that means that we will hit the limit fairly infrequently
    var toRemove = strings.OrderBy(x=>x.Value.Timestamp).Take(limit/10).ToArray();

    foreach (var valuePair in toRemove)
    {
        strings.Remove(valuePair.Key);                
    }
}

This isn’t really what I want, since I can’t take a dependency on v3.5 on this code base, but it is a good perf test scenario. Let us see what the profiler output is after those two changes:

image

This is much more interesting, isn’t it?

First, we can see that the call to ByteBuffer.GetHashCode went away, but we have a new one, ByteBuffer.ResetHash. Note, however, that ResetHash only took half as much time as the previous appearance of GetHashCode and that it is called only half as many times. I consider this a net win.

Now, let us consider the second change that we made, where previously we spend 11.1 seconds on sorting, we can see that we now spend 18 seconds, even if the number of calls is so much lower. That is a net lose, so we will revert that.

And now, it is the time for the only test that really matters, is it fast enough? I am doing that by simply running the test scenario outside of the profiler and checking to see if its performance is satisfactory. And so far, I think that it does meet my performance expectation, therefore, I am going to finish with my micro optimizations and move on to more interesting things.

Fighting the profiler memory obesity

When I started looking into persisting profiler objects to disk, I had several factors that I had to take into account:

  • Speed in serializing / deserializing.
  • Ability to intervene in the serialization process at a deep level.
  • Size (also effect speed).

The first two are pretty obvious, but the third requires some explanation. The issue is, quite simply, that I can apply some strategies to significantly reduce both speed & size of serialization by making sure that the serialization pipeline knows exactly what is going on (string tables & flyweight objects).

I started looking into the standard .NET serialization pipeline, but that was quickly ruled out. There are several reasons for that, first, you literally cannot hook deep enough into the serialization pipeline to do the sort of things that I wanted to do (you cannot override how System.String get persisted), and it is far too slow for my usages.

My test data started as a ~900Mb of messages, which I loaded into the profiler (resulting in a 4 GB footprint during processing and a 1.5GB footprint when processing is done). Persisting the in memory objects using BinaryFormatter resulted in a file whose size is 454Mb and whose deserialization I started before I started writing this post and at this point in time has not completed yet. Currently the application (simple cmd line test app that only does deserialization, takes 1.4 GB).

So that was utterly out. So I set out to write my own serialization format. Since I wanted it to be fast, I couldn’t use reflection, (BF app currently takes 1.6 GB) but by the same token, writing serialization by hand is labor intensive, error prone method. That lives aside the question of handling changes in the objects down the road, that is not something that I would like to do.

Having come to that conclusion, I decided to make use of CodeDOM to generate a serialization assembly on the fly. That would give me the benefits of no reflection, handle addition of new members to the serialized objects and would allow me to incrementally improve how (BF app now takes 2.2 GB, and I am getting ready to kill it). My first attempt in doing so, applying absolutely not optimization techniques, result in a 381 Mb file and an 8 seconds parsing time.

That is pretty good, but I wanted to do a bit more.

Now, note that this is an implementation specific for a single use. After applying a simple string table optimization, the results of the serialization are two files, the string table is 10Mb in length and the actual saved data is 215Mb and de-serialization takes ~10 seconds. Taking a look at what actually happened, it looked like the cost of maintaining string table is quite high. Since I care more about responsiveness than file size, and since the code for maintaining the string table is complex, I dropped that in favor of in memory only MRU string interning.

Initial testing shows that this should be quite efficient in reducing memory usage. In fact, in my test scenario, memory consumption during processing dropped down 4 GB to just 1.8 – 1.9 GB and 1.2 GB when processing is completed. And just using the application shows that the user level performance is pretty good, even if I say so myself.

There are additional options that I intend to take, but I’ll talk about them in a later post.

The operation was successful, but the patient is still dead… deferring the obvious doesn’t work

So, I have a problem with the profiler. At the root of things, the profiler is managing a bunch of strings (SQL statements, stack traces, alerts, etc). When you start pouring large amount of information into the profiler, the number of strings that it is going to keep in memory is going to increase, until you get to say hello to OutOfMemoryException.

During my attempt to resolve this issue, I figured out that string interning was likely to be the most efficient way to resolve my problem. After all, most of the strings that I have to display are repetitive. String interning has one problem, it exists forever. I spent a few minutes creating a garbage collectible method of doing string interning. In my first test, which was focused on just interning stack traces, I was able to reduce memory consumption by 50% (about 800Mb, post GC) and it is fully garbage collectible, so it won’t hung around forever.

Sounds good, right?

Well, not really. While it is an interesting thought experiment, using interning is a great way of handling things, but it only mask the problem, and that only for a short amount of time. The problem is still an open ended set of data that I need to deal with, and while there are a whole bunch of stuff that I can do to delay the inevitable, defeat is pretty much ensured. The proper way of doing that is not trying to use hacks to reduce memory usage, but to deal with the root cause, keeping everything in memory.

UberProf performance improvements, nothing helps if you are stupid

The following change took a while to figure out, but it was a huge performance benefit (think, 5 orders of magnitude). The code started as:

private readonly Regex startOfParametersSection = 
            new Regex(@"(;\s*)[@:?]p0 =", RegexOptions.Compiled);

And the optimization is:

private static readonly Regex startOfParametersSection = 
            new Regex(@"(;\s*)[@:?]p0 =", RegexOptions.Compiled);

The story behind this is interesting, this piece of code (and a few others like it) used to be in a class that has a singleton lifestyle. At some point, it was refactored into a command class that is created often, which obviously had… drastic effect on the system performance.

UberProf performance improvements, beware of linq query evaluation

This is a diff from the performance improvement effort of UberProf. The simple addition of .ToList() has significantly improved the performance of this function:

image

Why?

Before adding the ToList(), each time we try to run our aggregation functions on the statements enumerable, we would force re-evaluation of the filtering (which can be quite expensive). By adding ToList() I am now making the filtering run only once.

There is another pretty obvious performance optimization that can be done here, can you see it? And why did I choose not to implement it?

UberProf performance improvements, or: when O(N^3 + N) is not fast enough

While working on improving the performance of the profiler, I got really annoyed. The UI times for updating sessions with large amount of statements was simply unacceptable. We already virtualized the list UI, so it couldn’t be that. I started tracking it down, and it finally came down to the following piece of code:

protected override void Update(IEnumerable<IStatementSnapshot> snapshots)
{
    foreach (var statementSnapshot in snapshots)
    {
        var found = (from model in unfilteredStatements
                     where model.Id == statementSnapshot.StatementId
                     select model).FirstOrDefault();

        if (found == null)
        {
            unfilteredStatements.Add(
                new StatementModel(applicationModel, statementSnapshot) {CanJumpToSession = false});
        }
        else
        {
            found.UpdateWith(statementSnapshot);
        }
    }

    if (snapshots.Any())
    {
        hasPolledForStatements = true;
        StatementsChanged(this, 
            new StatementsChangedEventArgs {HasPolledForStatments = hasPolledForStatements});
    }
}

This looks pretty innocent, right? At a guess, what is the performance characteristics of this code?

It isn’t O(N), that is for sure, look at the linq query, it will perform a linear search in the unfilteredStatements, which is an O(N) operation. *

At first glance, it looks like O(N*N), right? Wrong!

unfilteredStatements is an ObservableCollection, and guess what is going on in the CollectionChanged event? A statistics function that does yet another O(N) operation for every single added statement.

Finally, we have the StatementsChanged handler, which will also perform an O(N) operation, but only once.

protected override void Update(IEnumerable<IStatementSnapshot> snapshots)
{
    var hasStatements = false;
    foreach (var statementSnapshot in snapshots)
    {
        IStatementModel found;
        statementsById.TryGetValue(statementSnapshot.StatementId, out found);
        hasStatements = true;

        if (found == null)
        {
            unfilteredStatements.Add(new StatementModel(applicationModel, statementSnapshot) {CanJumpToSession = false});
            statementsById.Add(statementSnapshot.StatementId, found);
        }
        else
        {
            found.UpdateWith(statementSnapshot);
        }
    }

    if (hasStatements == false) 
        return;
    HandleStatementsOrFiltersChanging(true);
}

The changes here are subtle, first, we killed off the O(N) linq query in favor of an O(1) lookup in a hashtable. Second, unfilteredStatements is now a simple List<T>, and HandleStatementsOrFilterChanging is the one responsible for notifications. Just these two changes were sufficent from the O(N^3+N) that we had before to a simple O(N+N) (because we still run some statistical stuff at the end) which collapses so a simple O(N).

Once that is done, showing sessions with 50,000 statements in them became instantaneous.

* Yes, it is supposed to be O(M) because the unfilteredStatements.Count is different than snapshots.Count, but I am simplifying here to make things easier.

UberProf performance improvements

I mentioned before that I run into some performance problems, I thought it would be interesting to see how I solved them.

The underlying theme was finding O(N) operations are eradicating them. In almost all the cases, the code is more complex, but significantly faster

Old code New code
public static class TimestampExtensions
{
    public static void AddInPlace<T>(this IList<T> self, T model)
        where T : IHaveTimestamp
    {
        int index = self.Count;
        for (var i = 0; i < self.Count; i++)
        {
            if (self[i].Timestamp > model.Timestamp)
            {
                index = i;
                break;
            }
        }
        self.Insert(index, model);
    }    
}
public static class TimestampExtensions
{
    public static void AddInPlace(this IList<SqlStatement> self, SqlStatement model)
    {
        var timestamp = model.Timestamp;
        var index = self.Count;
        if (index > 0 && timestamp >= self[index - 1].Timestamp)
        {
            self.Add(model);
            return;
        }
        for (var i = 0; i < self.Count; i++)
        {
            if (self[i].Timestamp < timestamp) 
                continue;
            index = i;
            break;
        }
        self.Insert(index, model);
    }    
}

Here we optimize for the common case where we add the statement to the end of the list. We also changed from generic method with interface constraint to the actual type used. That allows the JIT to apply additional optimizations. That method alone was responsible for almost 10 minutes in my test runs.

Old code New code
var statement = session.Statements
    .Reverse()
    .Where(x => x.SessionId.Timestamp <= queryDuration.SessionId.Timestamp)
    .FirstOrDefault();
var statement = session.Statements.LastOrDefault();

session.Statements is already ordered by the statements timestamps. The code perform a slightly differently here, and it is less robust to async message streaming (when events that happened sequentially arrive in different order than they occured), but that is rare enough that I feel that the performance boost from this is acceptable.

Old code New code
public override void AfterAttachingToSession(
SessionInformation sessionInformation, SqlStatement statement) { var isSelectStatement = statement.IsSelectStatement; if (isSelectStatement == false) return; if (statement.RawSql == mySqlSelectLastIdentity) return; int countOfStatementsWithIdenticalSql = sessionInformation.Statements .Count(x => x.RawSql == statement.RawSql); if (countOfStatementsWithIdenticalSql <
configuration.MaxRepeatedStatementsForSelectNPlusOne) return; statement.AcceptAlert(AlertInformation.SelectNPlusOne()); }
public int GetCountOfStatementsByRawSql(string rawSql)
{
    int value;
    countsByRawSql.TryGetValue(rawSql, out value);
    return value;
}

public void AddStatement(SqlStatement statement)
{
    int count;
    if (countsByRawSql.TryGetValue(statement.RawSql, out count) == false)
        count = 0;
    countsByRawSql[statement.RawSql] = count + 1;
    
    statements.AddInPlace(statement);
    StatementsChanged();
}


public override void AfterAttachingToSession(
    SessionInformation sessionInformation, SqlStatement statement)
{
    var isSelectStatement = statement.IsSelectStatement;
    if (isSelectStatement == false)
        return;

    if (statement.RawSql == mySqlSelectLastIdentity)
        return;

    if (sessionInformation.GetCountOfStatementsByRawSql(statement.RawSql) < 
        configuration.MaxRepeatedStatementsForSelectNPlusOne)
        return;
    statement.AcceptAlert(AlertInformation.SelectNPlusOne());
}

On the left, we have the old code, that did the calculations manually. On the right, the two top methods will do the calculation up front, and then it is just a matter of reading it in the appropriate times.

Old code New code
protected void ClearStackTraceFromInfrastructureFrameworks(StackTraceInfo stackTrace)
{
    if (configuration.InfrastructureNamespaces == null)
        return;
    
    stackTrace.Frames =stackTrace.Frames.Where(
        frame =>
        {
            var isInfrastructureNamespace = 
                configuration.InfrastructureNamespaces.Any(
                ns => frame.Namespace != null && frame.Namespace.StartsWith(ns)
                );

            return isInfrastructureNamespace == false ||
                   configuration.NonInfrastructureNamespaces.Any(
                    ns => frame.Namespace != null && frame.Namespace.StartsWith(ns)
                    );
        }).ToArray();
}
public StackTraceProcessor(ProfilerConfiguration configuration)
{
    this.configuration = configuration;
    var sb = new StringBuilder();
    foreach (var infrastructureNamespace in 
        configuration.InfrastructureNamespaces ?? new List<string>())
    {
        sb.Append("(^")
            .Append(Regex.Escape(infrastructureNamespace))
            .Append(")|");
    }
    sb.Length = sb.Length - 1;
    namespacesToClear = new Regex(sb.ToString(),
        RegexOptions.Compiled | RegexOptions.IgnoreCase);
    sb.Length = 0;

    foreach (var infrastructureNamespace in 
        configuration.NonInfrastructureNamespaces ?? new List<string>())
    {
        sb.Append("(^")
            .Append(Regex.Escape(infrastructureNamespace))
            .Append(")|");
    }
    sb.Length = sb.Length - 1;
    namespacesToKeep = new Regex(sb.ToString(), 
        RegexOptions.Compiled | RegexOptions.IgnoreCase);
}

protected void ClearStackTraceFromInfrastructureFrameworks(StackTraceInfo stackTrace)
{
    if (configuration.InfrastructureNamespaces == null)
        return;

    stackTrace.Frames =
        stackTrace.Frames.Where(IsValidStackFrameMethod).ToArray();
}

private bool IsValidStackFrameMethod(StackTraceFrame frame)
{
    if (frame.Namespace == null)
        return true;
    if(namespacesToClear.IsMatch(frame.Namespace) == false)
        return true;
    return namespacesToKeep.IsMatch(frame.Namespace);
}

Interestingly enough, this code produced much better runtime performance, simply because, while it is more complex, the number of iterations that are happening are so much smaller.

Overall, the change result in order of magnitude or two difference for processing messages in the profiler pipeline.

UberProf performance challenges

One of the things that the profiler is supposed to do is to handle large amount of data, so I was pretty bummed to hear that users are running into issues when they try to use the profiler in high load scenarios, such as load tests.

Luckily, one of the things that the profiler can do is output to a file, which let me simulate what is going on on the customer site very easily.

The first performance problem is that the profiler isn’t processing things fast enough:

image

The second is that there are some things that just totally breaks my expectations

image

There are solutions for those problems, and I guess that I know what I’ll be doing in the near future :-)

Transactional queuing system perf test

After running into sever performance issues when using MSMQ and transactions, I decided to run a more thorough test case.

Writing 10,000 messages to MSMQ Transactional Queue, separate transactions:

private static void AddData(MessageQueue q1, byte[] bytes)
{
    Console.WriteLine("{0:#,#}", bytes.Length);
    var sp = Stopwatch.StartNew();
    for (int i = 0; i < 10000; i++)
    {
        using (var msmqTx = new MessageQueueTransaction())
        {
        
            msmqTx.Begin();
            
            q1.Send(new Message
            {
                BodyStream = new MemoryStream(bytes),
                Label = i.ToString()
            }, msmqTx);

            msmqTx.Commit();
        }

        if (i%10 == 0) Console.WriteLine(i);
    }
    q1.Dispose();

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This took 6.08 seconds, or about 1,600 messages per second.

Writing 10,000 messages to MSMQ Transaction Queue, single transaction:

private static void AddData(MessageQueue q1, byte[] bytes)
{
    Console.WriteLine("{0:#,#}", bytes.Length);
    var sp = Stopwatch.StartNew();
    using (var msmqTx = new MessageQueueTransaction())
    {

        msmqTx.Begin(); for (int i = 0; i < 10000; i++)
        {


            q1.Send(new Message
            {
                BodyStream = new MemoryStream(bytes),
                Label = i.ToString()
            }, msmqTx);


            if (i % 10 == 0) Console.WriteLine(i);
        }
        msmqTx.Commit();
    }
    q1.Dispose();

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This took 0.825 seconds for 10,000 messages, so it should be able to process about 12,000 messages per second. But this is not a realistic scenario.

Now, to my test scenario, which is touching two queues for 10,000 messages…

Wait! I accidently run the wrong code, and that led me down the following path:

private static void JustTx()
{
    var sp = Stopwatch.StartNew();
    int msgs = 0;
    for (int i = 0; i < 1000; i++)
    {
        using (var msmqTx = new MessageQueueTransaction())
        {
            msmqTx.Begin();
            msgs = msgs + 1;
            if (msgs % 10 == 0)
                Console.WriteLine(msgs);
            
            msmqTx.Commit();
        }
    }
    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This code is just running & committing MSMQ transaction, no actual operation is done on the queue, and there shouldn’t be any work for the system. We open & close a 1,000 transactions. That takes 17 seconds.

Remember, this is just open/close MSMQ local transaction, with no work done, and that gives me about 60 transactions per second. I think that I found my culprit.  I have other independent verification of this, and I find it extremely sad.

I am still waiting to hear from MSMQ experts about what is going on.

In the meantime, I tested this with my own queuing project, Rhino Queues. The first thing that I tried was the simplest, unrealistic scenario of sending 10,000 messages in a single transaction:

private static void AddData(IQueueManager manager, byte[] bytes)
{
    var sp = Stopwatch.StartNew();
    using (var tx = new TransactionScope())
    {
        for (int i = 0; i < 10000; i++)
        {
            manager.Send(new Uri("rhino.queues://localhost:2222/test1"), new MessagePayload
                {
                    Data = bytes
                });


            if (i % 10 == 0)
                Console.WriteLine(i);
        }
        tx.Complete();
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This code takes 12.2 seconds to run, giving about 800 message per second. Not bad, but not really good either. In this scenario MSMQ finished everything in less than a second.

Let us see a more realistic scenario, of sending 10,000 messages in 10,000 separated transactions:

private static void AddData(IQueueManager manager, byte[] bytes)
{
    var sp = Stopwatch.StartNew();

    for (int i = 0; i < 10000; i++)
    {
        using (var tx = new TransactionScope())
        {
            manager.Send(new Uri("rhino.queues://localhost:2222/test1"), new MessagePayload
            {
                Data = bytes
            });


            if (i % 10 == 0)
                Console.WriteLine(i);
            tx.Complete();
        }
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This completes in 3.7 minutes, for about 45 messages per second. Slightly worse than what MSMQ can do, which isn’t good.

However, the good part about using Rhino Queues is that I know what is going on there and I intentionally left some points of optimization out (get it working, get it working right, get it working fast). After exploring some of those optimizations, the same code base run for 2.7 minutes, so we saved 60 seconds on the runtime, bringing us to 60 messages per second.

Rhino Queues is now comparable to MSMQ performance in this scenario. I find this spooky, to tell you the truth. Profiling Rhino Queues tells me that most of the time with Rhino Queues (over 40%!) is spent not in Rhino Queues, but inside System.Transactions.Transaction.Dispose().

I wonder how I can reduce that load.

The next thing I tried was implementing ISinglePhaseNotification. This means that if there are no other durable enlistments for the DTC, Rhino Queues will be able to take advantage of the lightweight transactions support in System.Transactions.

That change had a dramatic effect. I can’t really grasp the perf difference!

The code (same code!) now executes in 16.7 seconds! That means 600 messages per second.

Of course, things aren’t as good when you compare the CopyData routine:

private static void CopyData(IQueueManager manager)
{

    Stopwatch sp = Stopwatch.StartNew();
    for (int i = 0; i < 10000; i++)
    {
        using (var tx = new TransactionScope())
        {
            var message = manager.Receive("test1");
            manager.Send(new Uri("rhino.queues://localhost/test2"), new MessagePayload
            {
                Data = message.Data,
                Headers = message.Headers
            });

            if (i % 10 == 0)
                Console.WriteLine(i);
            tx.Complete();
        }
    }

    Console.WriteLine("{0:#,#}", sp.ElapsedMilliseconds);
}

This guy takes 5.7 minutes to complete, at a rate 30 messages per second. That is pretty lame. Similar code with MSMQ did 60 message per second on my machine, and I considered that utterly unacceptable.

Firing up dotTrace showed me that I was being STUPID (as in, not just stupid, or Stupid with a capital s, but full fledged stupid in all caps) and didn’t clean out the queue from read messages. After I fixed that, the same code run in 1.5 minutes, or about 110 messages per second.

Nice.

And I still consider being almost twice as fast as MSMQ to be very spooky.

Asynchronous order processing

One of the more common challenges that I run into when discussing the notion of async as the main communication mechanism is that there seems to be an entrenched belief that things should be synchronous. It appear to make things simpler, from a conceptual level, while making them significantly more difficult to actually implement them in a production worthy way.

Arguably the most common issue that I hear about is with downloadable materials, and it can be summed up as some variation of:

What do you mean we aren’t going to just start downloading stuff immediately?

That was why I was heartened when I tried to get a few Kindle books (ebooks) and saw that:

image

After about 5 minutes, it changed to this and allowed downloading this:

image

It is sad to say it, but applying to authority may suck as a argument tactic in a moral sense, but it is a damn effective one. Being able to say, here are all the reasons that you want to do that, and this is how Amazon is doing things, make it so much easier to sell this approach.

Analyzing a performance problem – Is a prisoner dangerous?

Recently I run into a performance problem in an application that I was reviewing, and I thought that it would make a great post. Since I can’t use the actual application model, I decided that I am tired of using the same old online shop model and turned to the one domain in which I am a domain expert. Prisons.

Let us imagine that this is part of the Prison 10’s* Dashboard. It looks pretty simple, right?

image

Let us talk about this as SQL, ignoring all layers in the middle. We can express this as:

SELECT TOP 10 Prisoner.Id, Prisoner.FirstName, Prisoner.LastName, Prisoner.Status
FROM Prisoner JOIN Imprisonment on Prisoner.Id = Imprisonment.Prisoner
WHERE Imprisonment.Prion = 6 AND Imprisonment.IsCurrent = 1
ORDER BY Imprisonment.ArrivalDate DESC

Seems simple enough, right? But notice that we don’t have Prisoner.Dangerous column on the result. That is because there is no such column in the table. So what does our good developer do? He goes to the domain expert and ask him where he is supposed to get the data from. The domain expert answer is that this is not an attribute of the prisoner itself, it is a decision based on many factors, any of which can flag a prisoner as a dangerous. He then gives the developer a list of the few common ones:

  • If the prisoner is not convicted
  • If the prisoner has a disciplinary notice in the last 2 months
  • If the prisoner has more than one disciplinary notice in the last 6 months
  • If the prisoner is charged with violent crime
  • If the prisoner is currently in withdrawal
  • If the prisoner request for vacation was denied in the last 3 months

As you can see, the list goes on (and on, and on). Our developer starts working on the problem. Since he is working on a small local database, he is starting out with a spike of everything as in memory filters. It is beautifully Object Oriented, and it looks like this:

public class NotConvicted : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.Status != PrisonerStatus.Convicted;
    }
}

public class DisciplinaryNoticeInLast2Months : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.DisciplinaryNotices
                .Any(x=>x.Date > DateTime.Today.AddMonths(-2));
    }
}

public class MorethanOneDisciplinaryNoticeInLast6Months : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.DisciplinaryNotices
                .Where(x=>x.Date > DateTime.Today.AddMonths(-6))
                .Count() > 1;
    }
}

public class ChargedWithViolentCrime : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.Charges
                .Any(x=>x.ChargeStatus.IsViolent);
    }
}

public class CurrentlyInDrugWithdrawal : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.MedicalRecord
                .Withdrawals
                .Any(x=>x.IsCurrent);
    }
}

public class VacationRequestDeniedInLast3Months : IPrisonerDangerousEvaluator
{
    public bool Eval(Prisoner p)
    {
        return p.Requests
                .Where(x=>x.RequestType == RequestType.Vacation && x.Date > DateTime.Today.AddMonths(-3))
                .Any(x=>x.Status.Approved == false);
    }
}

Now, this works, and from OO perspective, it works just great. From a performance perspective, this is horrible. Let us do the math of how many queries this is going to generate, oaky?

  • Get all prisoners – 1
    • For each prisoner – N(Prisoners)
      • NotConvicted – 0 (load data in the prisoner entity, which was already loaded)
      • DisciplinaryNoticeInLast2Months – 1 (load DisciplinaryNotices)
      • MorethanOneDisciplinaryNoticeInLast6Months – 0 (DisciplinaryNotices already loaded)
      • ChargedWithViolentCrime – 1 (load Charges)
      • CurrentlyInDrugWithdrawal – 2 (load Medical Record, load Withdrawals)
      • VacationRequestDeniedInLast3Months – 2 (load Requests, load Status)

This isn’t SELECT N+1 ! This is so much worse. What we have is actually: SELECT 1 + N(1+1+2+2)

Or, in other words, assuming we want to display the first ten inmates, showing that little grid up above is going to result in 41 queries!

This actually show us several important worst practices:

  • In memory filtering should be avoided if possible.
  • Trying to ignore the realities of data access is going to bite you, hard.
  • There is a good reason for the call for aggregate roots and the avoidance of highly connected object graphs.

So, what is the obvious approach? We can change the way that we load the data. Instead of trying to go through the object model, we can query the data directly from the database. It is going to look something like this:

SELECT TOP 10 Prisoner.Id, Prisoner.FirstName, Prisoner.LastName, Prisoner.Status,
(
    SELECT 1 
    WHERE Prisoner.Status != 'Convicted'
    OR EXISTS (SELECT 1 FROM DisciplinaryNotices 
        WHERE DisciplinaryNotices.Prisoner = Prisoner.Id
        AND   DisciplinaryNotices.Date > getdate()-60)
    OR 1 < (SELECT COUNT(*) FROM DisciplinaryNotices 
        WHERE DisciplinaryNotices.Prisoner = Prisoner.Id
        AND   DisciplinaryNotices.Date > getdate()-180)
    OR  EXISTS (SELECT 1 FROM Charges
        WHERE Charges.Prisoner = Prsioner.Id
        AND Charges.IsViolent = 1)
    OR  EXISTS (SELECT 1 FROM MedicalRecords JOIN Withdrawals ON MedicalRecords.Id = Withdrawal.Record
        WHERE MedicalRecords.Prisoner = Prisoner.Id
        AND Withdrawal.IsCurrent = 1)
    OR  EXISTS (SELECT 1 FROM Requests JOIN RequestsStatus ON Requests.Id = RequestsStatus.Request
        WHERE Requests.Prisnoer = Prisoner.Id
        AND   Request.Date > getdate() - 90
        AND   RequestsStatus.Approved = 0
        AND   Request.Type = 'Vacation')
) as Dangerous 
FROM Prisoner JOIN Imprisonment on Prisoner.Id = Imprisonment.Prisoner
WHERE Imprisonment.Prion = 6 AND Imprisonment.IsCurrent = 1
ORDER BY Imprisonment.ArrivalDate DESC

From the point of view of database access, we are in a much better position, we now have only a single query to execute. For that matter, we can change the IPrisonerDangerousEval to perform their evaluation in the database, retaining the nice OO model that we have. It would look something like this:

public class NotConvicted : IPrisonerDangerousEvaluator
{
    public void AddSelectionCriteria(DetachedCriteria crit)
    {
        return crit.Add(Restrictions.NotEq("Status", PrisonerStatus.Convicted));
    }
}

public class DisciplinaryNoticeInLast2Months : IPrisonerDangerousEvaluator
{
    public void AddSelectionCriteria(DetachedCriteria crit)
    {
        return crit.CreateCriteria("DisciplinaryNotices)
                   .Add(Restrictions.Gt("Date", DateTime.Today.AddMonths(-2)));
    }
}

Note, doing something like this require a fairly interesting infrastructure, to make sure that the results are consistent and each rule isn’t stepping on its other toes.

Problem solved?

Not… really.

Take a look at the SQL that we have above. This query is hitting 8 tables, and please remember that I explicitly stated that I kept the number of things that make a prisoner dangerous limited, there are about forty rules relating to this, not half a dozen. But even with just half a dozen, trying to optimize the above query is going to be… problematic.

This is where we figure out something really interesting. In most systems, the number of reads far outweigh the number of writes. The most basic optimization that we can do is move work from the read phase of the application to the write phase, since it is going to execute so much less often.

It seems like it would be a very simple solution for the application to execute the rules on save and set the IsDangerous property, right? Well, it would, except that there are so many different places in the application that can need this. Remember, there are a lot of rules, and just about anything can change that. So we would need to execute the rules whenever we change something in the application.

Here, again, we see the importance of aggregates, because instead of spreading that logic all over the system, we can define Prisoner as the aggregate, and force all changes to it to occur through the entity. When saving a prisoner, then we can execute this.

However, there is another aspect, this can cause an unacceptable performance for saving, since we expect the number of rules to grow and only become more complex over time. This looks like a good opportunity to to shift work to a background process. There is also the matter that we have to process those rules against every day, to make sure that we reset any rules that depend on time.

I would build such a system using the following structure:

image

The web server would send a notification to the batch processing server whenever a prisoner aggregate root is updated. The batch server will then process all rules for the prisoner. Once a day, the batch server will re-evaluate the rules for all prisoners. This solves the performance problem that we have when updating a prisoner. It introduce a new one, a problem of consistency. Now, there is going to be some delay between updating a prisoner and updating the IsDangerous status. In practice, I would expect the window of inconsistency to be be in the order of the time it takes to process all the rules. I would be extremely surprised if it was more than a second, and probably a lot less.

In the duration, we can show in the UI that the prisoner’s danger status in unknown. The UI can then query the application periodically until the danger status became known again using Ajax. From the user experience point of view, there would be almost no change, but in cases where the evaluation takes long enough for the user to notice, it will be made explicit what is going on.

Yes, it is a more complex solution, but it is also one that neatly avoids all the complications that we have run into during this post.

* I am sorry, but this post is going to be choke full of jokes that only I will probably understand.

Out of process session state vs. explicit state management

The question came up in the alt.net mailing list, and I started replying there, before deciding that it would make a great post. I actually want to talk specifically about the notion that came of avoiding explicit state management in favor of out of process session state (on top of a database or memcached).

The problem is that those two are not interchangeable. Using a session make it easy to preserve the illusion of statefullness on the web, but it is an illusion. In general, you have to give me pretty good reasons before I will go with an abstraction layer. Session state make a lot of sense, but it carry with it its own set of problems.

On the face of it, explicit state management (making direct call to memcached or database) seems like it share the same problem, with the additional problem is that you don’t get the session abstraction to manage that for you.

That is true, but there are distinct advantages for that.

  • Different pieces of the state usually have different longelivity, refresh rate and scope scenarios. With a session, you get a single option (usually predetermined) for all of them.
  • You can pick and choose what you want to read, resulting in less data moving on the network. With the session, you have to get it all or none at all.
  • Reduced writes, because you only write on changes, the session has to flush itself (or do non trivial change tracking) every time.

There is also another issue, of concurrency. With a session, in order to maintain the sequential facade, you take a lock on the session for the duration of the request. That tend to create contention if you have concurrent request from the same user (which are quite frequent now, thanks to ajax and multi tabs), I now that we had to write code around that several times.

An allegory of an optimization story

I got a call today from a team mate about a piece of software that we wrote. We initially planned it to work with data sizes of hundreds to thousands of records, and considering that we wrote that one in a day, I considered it a great success. We didn’t pay one bit of attention to optimization, but the perf was great for what we needed.

However, a new requirement came up which require us to handle hundred thousand records, and our software was working… well under the new constraints. As a matter of fact, it would take about 0.4 seconds to compute a result when the data size was hundred thousand records. That was probably well within acceptable range, but it had me worried, because 0.4 seconds may be acceptable for this single computation, but it was a CPU bound computation, and it was probably going to kill us when we start hammering the server with multiple requests for that.

Moreover, adding 0.4 to the clients of the system would add an unacceptable delay. So we sat down with dotTrace and tried to figure out what we were doing that could be better.

We spent several hours on that. At some point, a simple extraction of a variable from a loop reduced 7% of the total execution time, but we weren’t really interesting in the small fish. We soon identified a O(N) operation in the code, which was taking about 20% of the total time of the calculation, and focused on that.

By changing the way that we worked with this, we changed this from an O(N) operation to O(1). We did it by introducing indexes and doing intersects of those indexes. The usual choice of time vs. memory has saved us again.

Current time for computing a result? 20 – 50 ms, and that include network time.

I think that I’ll find that acceptable.

It depends on what you are optimizing for…

Today I was writing this code:

public class FakeRandomValueGenerator : IRandomValueGenerator
{
    private readonly int valueToReturn;

    public FakeRandomValueGenerator(int valueToReturn)
    {
        this.valueToReturn = valueToReturn;
    }

    public int Next(int min, int max)
    {
        return valueToReturn;
    }
}

This caused some concern to my pair, who asked me why I was hand rolling a stub instead of using a mocking framework. My answer was simple.

It will take me less time to write the class than it would take me to bring up the Add Reference dialog.

Schema-less databases

This post about how Friend Feed is using schema-less storage for most of their work is fascinating. In the ALT.Net Seattle there was a session about that, which generated a lot of interest.

My next post will have more details about the actual implementation details of doing something like that in a manner easily accessible in .Net, but just reading the post is very interesting. Another item that I found that was an interesting read, although it is far harder to read is: http://highscalability.com/how-i-learned-stop-worrying-and-love-using-lot-disk-space-scale

Reducing the cost of getting a stack trace

image I am trying to find ways to reduce the cost of the stack trace used in NH Prof. The access to the stack trace is extremely valuable, but there is a significant cost of using it, so we need a better way of handling this. I decided to run a couple of experiment running this.

All experiments were run 5,000 times, on a stack trace of 7 levels.

  • new StackTrace(true) - ~600ms
  • new StackTrace(false) - ~150ms

So right there, we have a huge cost saving, but let us continue a bit.

  • throwing exception - ~400ms

That is not so good, I have to say.

Well, when in doubt, cheat!

Using reflector and some _really_ nasty stuff, I came up with this:

var stackFrameHelperType = typeof(object).Assembly.GetType("System.Diagnostics.StackFrameHelper");
var GetStackFramesInternal = Type.GetType("System.Diagnostics.StackTrace, mscorlib").GetMethod("GetStackFramesInternal",BindingFlags.Static|BindingFlags.NonPublic);
 
var method = new DynamicMethod("GetStackTraceFast",typeof(object),new Type[0],typeof(StackTrace),true);
 
var generator = method.GetILGenerator();
generator.DeclareLocal(stackFrameHelperType);
generator.Emit(OpCodes.Ldc_I4_0);
generator.Emit(OpCodes.Ldnull);
generator.Emit(OpCodes.Newobj, stackFrameHelperType.GetConstructor(new[] { typeof(bool), typeof(Thread) }));
generator.Emit(OpCodes.Stloc_0);
generator.Emit(OpCodes.Ldloc_0);
generator.Emit(OpCodes.Ldc_I4_0);
generator.Emit(OpCodes.Ldnull);
generator.Emit(OpCodes.Call, GetStackFramesInternal);
generator.Emit(OpCodes.Ldloc_0);
generator.Emit(OpCodes.Ret);
getTheStackTrace = (Func<object>)method.CreateDelegate(typeof(Func<object>));

Calling getTheStackTrace 5000 times with depth of 7 frames is… 54ms. And now that is a horse of a different color indeed.

And the best part is, I can use the StackFrameHelper as a key into cached stack traces.

And yes, I am aware that if anyone from the CLR team is reading this, a ninja team will be dispatched to… discuss with me the notion of supported operations vs. unsupported operation.

NH Prof: Performance implications for the profiled application

I got a couple of complaints about this, and I think that this is a really important issue. Using NH Prof slow down the profiled application. That is to be expected, since we require the application to do more work, but it is not as good as we can hope it would be.

Here is the test scenario, running against the Northwind database (thanks Davy for supplying it):

class Program
{
	static void Main(string[] args)
	{
		var sp = Stopwatch.StartNew();

		HibernatingRhinos.NHibernate.Profiler.Appender.NHibernateProfiler.Initialize();
		
		ISessionFactory sessionFactory = CreateSessionFactory();

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		using (var session = sessionFactory.OpenSession())
		{
			var products = session.CreateCriteria(typeof(Product))
				.SetCacheable(true).List<Product>();
		}

		Console.WriteLine(sp.ElapsedMilliseconds);
	}

	private static ISessionFactory CreateSessionFactory()
	{
		Configuration configuration = new Configuration()
			.Configure()
			.AddAssembly("nhprof_experiment");

		return configuration.BuildSessionFactory();
	}
}

With NH Prof active, this takes: 13 - 16 seconds to run.

Without NH Prof running, this takes: 12 seconds to run.

The reason for that difference is probably that NH Prof is doing some work when new events come in, so it steals some CPU time from the other process.

Disabling NH Prof (commenting the NHibernateProfiler.Initialize()), this takes 4 - 6 seconds to run. This is a huge difference.

I decided to see whatever this is something that I can actually do something about, and turned the NHibernateAppender into a no op. So the next run would measure just the time that it takes to NHibernate to generate the event that I care about, and not whatever processing I am doing on the profiled application side.

The result both surprised and thrilled me: 4.8 seconds!

Why was I happy about this?

Because it meant that I was actually at fault here. And if I was actually at fault, it means that I can fix things. I pulled up the trusty dotTrace Profiler, and quickly identified a hot spot:

image

Well, in this case, you need to understand a bit about how NH Prof get the data from the profiled application. We gather the event stream and batch send it to the actual profiler. Those sends are RPC calls, and it is expected that they would be slow. Except, they are not happening in the same thread as the actual application, so they should have little effect on the actual performance of the test scenario that we have above.

I decided to try it out again, this time, we will change FlushMEssagesToProfiler and SendStatsticInformation to no ops, and see what happens.

Running the application again, we got runtime of 13 seconds. That confirmed my original suspicion, that while those actions takes a long time, they do not contribute to the actual perf problem in the actual code.

I run the code again under the profiler, looking deeper. That led me to this discovery:

image

Well, I knew that stack trace was slow, but I sincerely thought that I was doing better job in reducing the number of times that I had to call it. Calling it 4,178 times is obviously wrong.

We track the stack trace for the following event sources:

  • NHibernate.SQL
  • NHibernate.Event.Default.DefaultLoadEventListener
  • NHibernate.Cache.StandardQueryCache
  • NHibernate.Transaction.AdoTransaction

The problem is that some things, like DefaultLoadEventListener, are generating a lot of events that I don't really care about. I started approaching the problem by applying filters, so I would only ever create the stack trace for the events I truly care about. That brought the cost down significantly, so now it took 6 - 7 seconds. We shaved 5 - 10 seconds of the total running time, awesome.

But I think that we can still do better than that. Let us see what dotTrace is saying:

image

dotTrace is saying that we now reduced the time this takes from 31% to 3.5%. We have better things to do than to look at this piece of code again, then.

Let us focus on something that will give us more benefit, then:

image

This is actually the initial RPC call between the appender and the profiler. It establish the TCP connection, among other things, so it takes a while to process. Almost 0.7 seconds is a lot of time. I changed that code to run on a background thread, further improving the startup time of the application.

Time to execute the code now? 5.8 seconds.

Time to go to dotTrace again, I think.

image

And we can see that what dominate the actual time for the test is the startup concerns of NHibernate. But I think that in a real scenario, the time for generating a stack trace would again become a meaningful for performance.

There are probably additional things that I can do to optimize this, but I think that this is a  great step. To remind you, the initial costs were:

  • No profiler: 4 - 6 seconds
  • With profiler: 13 - 16 seconds

After the changes, we are with: 5 - 7 seconds, which I consider entirely reasonable.

And for a few hours of performance tuning, reducing the time by 8-9 seconds and by over 60%, I think I did good.

The code pass all the tests and was committed, so in about ten minutes, you would be able to download it from the NH Prof site.

A WCF Perf Mystery

Anyone can tell me why this is taking a tad over 11 seconds?

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         try
   6:         {
   7:             var sw = Stopwatch.StartNew();
   8:             var host = new ServiceHost(new Srv(), new Uri("net.tcp://localhost:5123"));
   9:             host.AddServiceEndpoint(typeof (ISrv), new NetTcpBinding(), new Uri("net.tcp://localhost:5123"));
  10:             host.Open();
  11:  
  12:             var srv = ChannelFactory<ISrv>.CreateChannel(new NetTcpBinding(),
  13:                                                              new EndpointAddress(new Uri("net.tcp://localhost:5123")));
  14:             srv.Test("hello"); // if I remove this, it finishes in 0.3s - 0.5s
  15:  
  16:             host.Close();
  17:  
  18:             Console.WriteLine(sw.Elapsed);
  19:         }
  20:         catch (Exception e)
  21:         {
  22:             Console.WriteLine(e);
  23:         }
  24:     }
  25: }
  26:  
  27: [ServiceContract]
  28: public interface ISrv
  29: {
  30:     [OperationContract]
  31:     int Test(string x);
  32: }
  33:  
  34: [ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Multiple)]
  35: public class Srv : ISrv
  36: {
  37:     public int Test(string x)
  38:     {
  39:         return x.GetHashCode();
  40:     }
  41: }

The reason that I care is that I am doing that in my tests, and this is significantly slow them down.

Is there anything that I am missing?

NH Prof: Getting to zero friction

Icon

Here is a new (passing) integration test in the NHibernate Profiler:

image

Just to give you an idea, here is the implementation of this rule, which has to be one of the more complex ones that we have at the moment, because the data for it comes from several sources, so we need to actually execute the logic in an event, instead of directly.

public class TooManyRowsReturnedPerQuery : IStatementProcessor
{
private ProfilerConfiguration configuration;

public TooManyRowsReturnedPerQuery(ProfilerConfiguration configuration)
{
this.configuration = configuration;
}

public void AfterAttachingToSession(SessionInformation sessionInformation, FormattedStatement statement,
OnNewAction newAction)
{
statement.ValuesRefreshed += delegate { if (statement.CountOfRows == null)
return;
if (statement.CountOfRows > configuration.MaxRowCountThatWouldGenerateWarning)
newAction(new ActionInformation
{
Severity = Severity.Warning,
Title = "This query returned an excessive number of rows."
});
else if (statement.CountOfRows > configuration.MaxRowCountThatWouldGenerateSuggestion)
newAction(new ActionInformation
{
Severity = Severity.Suggestion,
Title = "This query returned a large number of rows."
});
};
}
}

This is what I when I am talking about zero friction software. Assuming that you understand the architecture of the system, which isn't hard, you can just plug things in. Not only that, but the system follows the Open Close Principle. By far, most changes to the system involves adding new code, rather than modifying existing code.

Another interesting aspect, notice just how many things are handled for you by the infrastructure, and what the resulting code looks like.

NHProf: What is the role of the DBA?

IconI am now discussing what sort of reports we want to give the DBA from the NHibernate Profiler. My first thought was just to give the DBA a list of the statements that the application has executed and the number of times they were repeated.  That should allow him to get enough information to use his own tools to optimize the application physical data structure.

What do you think? Is this a good scenario?

What other scenarios can you see for NH Prof in the hands of the DBA?

NHProf: The stack is not as simple as you wish it to be

One of the nicest features that NH Prof has to offer is this, allowing you to go from the query issued to the database directly to the line of code that caused this query to be generated. Icon

image 

A few days ago I posted that you can either build something in one day, or in three months, but nothing much in the middle. The proof of concept that convinced me that I can build NH Prof was written during a single evening, along with two pints of Guinness. The overall concept that I have now is drastically different, but it is that evening that made sure that I would start it.

However, I wanted to use this opportunity to talk about some of the things that take a one day project and stretch it into three months. In this case, the query to line of code is a good feature, because it looks simple. After all, the only thing that I need to do, basically, is just to use the built in facility to get the stack trace, and extrapolate from there. Well, even for the normal case, that is not so easy. There are issues of user experience that play a deciding role in how we are going to deal with this.

But this isn't what was really challenging. The challenging part came from the law of least surprises. That is, if I am providing a stack trace for queries, I should provide them for all queries. As it turned out, there was one scenario in which I am not able to give a complete stack trace. When we are mixing queries generated as a result of a lazy proxy being instantiated. On the whole, this is a minor concern, but it really bugged me. You would get a stack trace that just ended in some weird type that you never seen before.

At that point, I felt like I had to solve this.

Let us go back and outline the problem. Stack traces relies on PDBs to do their work. NHibernate's lazy proxies are generated at runtime, have no PDBs and mask the actual call to the method / property that we are actually are interested in. The mix of the two wasn't encouraging, I must say.

Given that, I decided to pull out the big guns, and started messing with the PDB myself. Obviously, I don't want to do it on my own, that is why Cecil is so useful. But it was still a bit tricky to get it right. And the funny part is that I am pretty certain that this is one feature that absolutely no one will actually notice. If it wasn't there, however, people would notice.

Anyway, it 3:30 AM here, and for some reason I managed to do more work tonight than in the previous week(!). NH Prof is now in private beta, and I hope to be able to take it to public beta in a few weeks.