Ayende @ Rahien

It's a girl

How to lead a convoy to safety

image I recently run into a convoy situation in NH Prof. Under sustained heavy load (not a realistic scenario for NH Prof), something very annoying would happen.

Messages would stream in from the profiled application faster than NH Prof could process them.

The term that I use for this is Convoy. It is generally bad news. With NH Prof specifically, it meant that it would consume larger and larger amounts of memory, as messages waiting to be processed queued up faster than NH Prof could handle them.

NH Prof uses the following abstraction to handle queuing:

public interface IQueue<T>
{
void Enqueue(T o);
T Dequeue();
bool IsEmpty { get; }
}

Now, there are a few things that we can do to avoid having a convoy. The simplest solution is to put some threshold on the queue and just start dropping messages if we reached it. NH Prof is actually designed to handle such things as interrupted message stream, but i don’t think that this would would be nice thing to do.

Another alternative would be write everything to disk, so we don’t have memory pressure and can handle much larger queue sizes. The problem is, of course, that this requires something very subtle. T now must be serializable, and not just T, but everything that T references.

Oh, Joy!

This is one of the cases where just providing the abstraction is not going to be enough, providing an alternative implementation means having to touch a lot of other code as well.

Comments

Tuna Toksoz
09/20/2009 05:33 AM by
Tuna Toksoz

Use an object database :)

Richard Dingwall
09/20/2009 08:42 AM by
Richard Dingwall

"not a realistic scenario for NH Prof" <-- I think you overestimate your customers.

I can think of at least half a dozen pages in one web application I work on that take anything from 70-700 SQL/cache requests per hit (30-40 mapped classes, 500 tables, 30GB database). During this time NH Prof frequently becomes unresponsive, and often remains busy for a few secs after the session ended.

We know our code is not the best -- using domain models for building a report, automapper resolvers getting more details per item, recursive trees, leaning far too much on the cache etc. Even after lots of fetching/joins/caching tuning there is still lots of SELECT N+1.

So unfortunately overloading NH Prof is a very realistic scenario for us.

Rafal
09/20/2009 09:40 AM by
Rafal

Maybe you should add an option of offline profiling - some small component would write all the trace information to a log and NH Prof would then be used to analyze that log? Live profiling is a problem in production environment - if you have memory/performance problems and want to analyze that with a profiler, the profiler will add more load to the system and seriously worsen the situation.

Frank Quednau
09/20/2009 09:46 AM by
Frank Quednau

My question would be...what questions regarding NH usage can NH Prof answer in a heavy load scenario that couldn't be answered when running the app under less heavy load?

In such a case it might be OK to have NH Prof "degrade" to processing only messages of severe importance until it catches up again...

Of course this falls down again if the application is so bitchy that all messages are severe...

Ayende Rahien
09/20/2009 09:47 AM by
Ayende Rahien

Richard,

I am sorry, but we have different definitions for what sustained heavy load means. When I am talking about this I am talking about doing this for 30 minutes or so of non stop activity. That is rarely the case.

Anyway, I already have a branch where I am taking care of this, and I'll publish it sometimes this week.

Ayende Rahien
09/20/2009 09:49 AM by
Ayende Rahien

Rafal,

InitializeOfflineProfiling() - it is there. :-)

Ayende Rahien
09/20/2009 09:50 AM by
Ayende Rahien

Frank,

The problem isn't with showing the information, the problem is in processing it fast enough

Frank Quednau
09/20/2009 10:28 AM by
Frank Quednau

I didn't think UI was the problem...so I gather that the queuing of messages is absolutely "dumb" in that all possible messages are gathered, while I thought that there might be some form of "pre-processing". I suppose that isn't really possible, though, since defining whether a message is "severe" or not probably involves quite a bit of knowledge (= processor time).

Otoh, how expensive is RAM these days? If you're profiling an app with such throughput I'd hope that people could spare a few dollars on a couple of GBs.

Ayende Rahien
09/20/2009 10:43 AM by
Ayende Rahien

Frank,

It is possible that this would lead to an Out Of Memory Exception

And in general it is better not to try walking that line

Kyle Szklenski
09/20/2009 04:11 PM by
Kyle Szklenski

Hm, I wonder if you could do a meta-analysis over a given number of messages knowing that some messages have been dropped. For example, if your profiler could run, say, 10 times on the same system with approximately the same load, you could average together the results, in a sense, to guarantee a stable conclusion. This would probably require some kind of ability to drop pseudo-random messages though, as you wouldn't be able to rely on just dropping when it starts to get overloaded - if you tried that, then you could very well be missing the exact thing which is causing the overload.

Differently, you could define certain messages (and that which they are dependent on) to be knowingly serializable, then only serialize those with a marker saying where they show up in the queue. This would probably end up creating a scheduling problem over the queue, though, so it's most likely not worth it.

Thomas Krause
09/21/2009 02:07 PM by
Thomas Krause

Instead of dropping messages when you reach a threshold... why not simply block the host application, so it has to wait until it can write the next message to the queue?

granted, this would reduce the performance of the host application, but if i want to debug/trace my application i usually would want to get all messages, even if it means that my application may run a bit slower while being traced...

Mike Rettig
09/21/2009 03:07 PM by
Mike Rettig

Can you gain efficiency through batching? For instance, are you updating the screen on every update? With a slow resource such as a UI, file, or socket, batching can give you better throughput by merging updates and limiting the number of slow calls required.

For Example:

public void OnBatch(List <updates updates){

ApplyAll(updates);

UpdateScreen();

}

This way updates are efficiently throttled and the Queue doesn't fall far behind.

Of course, this is something that Retlang does for you.

http://code.google.com/p/retlang/>

Mike

Ayende Rahien
09/21/2009 05:42 PM by
Ayende Rahien

Thomas,

One of the design goals is to have as little impact as possible on the profiled application.

Stopping the profiled application is not an option.

Ayende Rahien
09/21/2009 05:43 PM by
Ayende Rahien

Mike,

You seem to be missing the point. It isn't the time to update the screen that is meaningful. It is the time to process the messages.

I'll have a separate post about it, but let us just say that the same problem exists with no UI as well

Comments have been closed on this topic.