Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 79

filter by tags archive

How to lead a convoy to safety

time to read 2 min | 310 words

image I recently run into a convoy situation in NH Prof. Under sustained heavy load (not a realistic scenario for NH Prof), something very annoying would happen.

Messages would stream in from the profiled application faster than NH Prof could process them.

The term that I use for this is Convoy. It is generally bad news. With NH Prof specifically, it meant that it would consume larger and larger amounts of memory, as messages waiting to be processed queued up faster than NH Prof could handle them.

NH Prof uses the following abstraction to handle queuing:

public interface IQueue<T>
void Enqueue(T o);
T Dequeue();
bool IsEmpty { get; }

Now, there are a few things that we can do to avoid having a convoy. The simplest solution is to put some threshold on the queue and just start dropping messages if we reached it. NH Prof is actually designed to handle such things as interrupted message stream, but i don’t think that this would would be nice thing to do.

Another alternative would be write everything to disk, so we don’t have memory pressure and can handle much larger queue sizes. The problem is, of course, that this requires something very subtle. T now must be serializable, and not just T, but everything that T references.

Oh, Joy!

This is one of the cases where just providing the abstraction is not going to be enough, providing an alternative implementation means having to touch a lot of other code as well.


Tuna Toksoz

Use an object database :)

Richard Dingwall

"not a realistic scenario for NH Prof" <-- I think you overestimate your customers.

I can think of at least half a dozen pages in one web application I work on that take anything from 70-700 SQL/cache requests per hit (30-40 mapped classes, 500 tables, 30GB database). During this time NH Prof frequently becomes unresponsive, and often remains busy for a few secs after the session ended.

We know our code is not the best -- using domain models for building a report, automapper resolvers getting more details per item, recursive trees, leaning far too much on the cache etc. Even after lots of fetching/joins/caching tuning there is still lots of SELECT N+1.

So unfortunately overloading NH Prof is a very realistic scenario for us.


Maybe you should add an option of offline profiling - some small component would write all the trace information to a log and NH Prof would then be used to analyze that log? Live profiling is a problem in production environment - if you have memory/performance problems and want to analyze that with a profiler, the profiler will add more load to the system and seriously worsen the situation.

Frank Quednau

My question would be...what questions regarding NH usage can NH Prof answer in a heavy load scenario that couldn't be answered when running the app under less heavy load?

In such a case it might be OK to have NH Prof "degrade" to processing only messages of severe importance until it catches up again...

Of course this falls down again if the application is so bitchy that all messages are severe...

Ayende Rahien


I am sorry, but we have different definitions for what sustained heavy load means. When I am talking about this I am talking about doing this for 30 minutes or so of non stop activity. That is rarely the case.

Anyway, I already have a branch where I am taking care of this, and I'll publish it sometimes this week.

Ayende Rahien


InitializeOfflineProfiling() - it is there. :-)

Ayende Rahien


The problem isn't with showing the information, the problem is in processing it fast enough

Frank Quednau

I didn't think UI was the problem...so I gather that the queuing of messages is absolutely "dumb" in that all possible messages are gathered, while I thought that there might be some form of "pre-processing". I suppose that isn't really possible, though, since defining whether a message is "severe" or not probably involves quite a bit of knowledge (= processor time).

Otoh, how expensive is RAM these days? If you're profiling an app with such throughput I'd hope that people could spare a few dollars on a couple of GBs.

Ayende Rahien


It is possible that this would lead to an Out Of Memory Exception

And in general it is better not to try walking that line

Kyle Szklenski

Hm, I wonder if you could do a meta-analysis over a given number of messages knowing that some messages have been dropped. For example, if your profiler could run, say, 10 times on the same system with approximately the same load, you could average together the results, in a sense, to guarantee a stable conclusion. This would probably require some kind of ability to drop pseudo-random messages though, as you wouldn't be able to rely on just dropping when it starts to get overloaded - if you tried that, then you could very well be missing the exact thing which is causing the overload.

Differently, you could define certain messages (and that which they are dependent on) to be knowingly serializable, then only serialize those with a marker saying where they show up in the queue. This would probably end up creating a scheduling problem over the queue, though, so it's most likely not worth it.

Thomas Krause

Instead of dropping messages when you reach a threshold... why not simply block the host application, so it has to wait until it can write the next message to the queue?

granted, this would reduce the performance of the host application, but if i want to debug/trace my application i usually would want to get all messages, even if it means that my application may run a bit slower while being traced...

Mike Rettig

Can you gain efficiency through batching? For instance, are you updating the screen on every update? With a slow resource such as a UI, file, or socket, batching can give you better throughput by merging updates and limiting the number of slow calls required.

For Example:

public void OnBatch(List <updates updates){




This way updates are efficiently throttled and the Queue doesn't fall far behind.

Of course, this is something that Retlang does for you.



Ayende Rahien


One of the design goals is to have as little impact as possible on the profiled application.

Stopping the profiled application is not an option.

Ayende Rahien


You seem to be missing the point. It isn't the time to update the screen that is meaningful. It is the time to process the messages.

I'll have a separate post about it, but let us just say that the same problem exists with no UI as well

Comment preview

Comments have been closed on this topic.


  1. Production postmortem: The industry at large - 11 hours from now
  2. The insidious cost of allocations - about one day from now
  3. Buffer allocation strategies: A possible solution - 4 days from now
  4. Buffer allocation strategies: Explaining the solution - 5 days from now
  5. Buffer allocation strategies: Bad usage patterns - 6 days from now

And 2 more posts are pending...

There are posts all the way to Sep 11, 2015


  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    01 Sep 2015 - The case of the lying configuration file
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats