Ayende @ Rahien

It's a girl

The cost of abstraction - Part III

Here is another thing to note, computers are fast. I can't tell you how fast because it would take too long. Thinking about micro performance is a losing proposition. Sasha has asked an important question:

Now assume you have a non-functional requirement saying that you must support 1,000,000 messages per second inserted into this queue. Would you still disregard the fact using an interface is a BAD decision?

My answer, yes. The reason for that? Let us take a look at the slowest method I could think of to do in process queue:

public static void Main(string[] args)
{
	Queue<string> queue = new Queue<string>();
	ThreadPool.QueueUserWorkItem(delegate(object state)
	{
		Stopwatch startNew = Stopwatch.StartNew();
		for (int i = 0; i < 100000000; i++)
		{
			lock(queue)
			{
				bool queueEmpty = queue.Count == 0;
				queue.Enqueue("test");
				if(queueEmpty)
					Monitor.Pulse(queue);
			}
		}
		Console.WriteLine("Done publishing in: " + startNew.ElapsedMilliseconds);
	});
	ThreadPool.QueueUserWorkItem(delegate(object state)
	{
		Stopwatch startNew = Stopwatch.StartNew();
		for (int i = 0; i < 100000000; i++)
		{
			lock (queue)
			{
				while (queue.Count == 0) 
					Monitor.Wait(queue);
                                queue.Dequeue();
			}
		}
		Console.WriteLine("Done reading in: " + startNew.ElapsedMilliseconds);
	});
	Console.ReadLine();
}

On my machine, this outputs:

Done publishing in: 23044 (ms)
Done reading in: 26866 (ms)

Which means that it processed one hundred million items in just 26 seconds or so.

This also puts us at close to 4 million messages per second, using the most trivial and slow performing approach possible.

In fact, using a LockFreeQueue, we get significantly worse performance (3 times as slow!). I am not quite sure why, but I don't care. Pure pumping of messages is quick & easy, and scaling to millions of messages a second is trivial.

Yes, dispatching the messages is costly, I'll admit. Changing the reader thread to use:

instance.Handle(queue.Dequeue());

Has dropped the performance to a measly 2 million messages a second. Changing the message handler to use late bound semantics, like this:

IHandler instance = Activator.CreateInstance<Handler>();
instance.Handle(queue.Dequeue());

Would drop the performance down to 300,000 per second. So it is far more interesting to see what you are doing with the message the the actual dispatching.

And yes, here, in the dispatching code, I would probably do additional work. Using GetUninitializedObject() and cached constructor I was able to raise that to 350,000 messages per second.

And yes, this is a flawed benchmark, you need to test this with multiple readers & writers to get a more real understanding what what would actually happen. But initial results are very promising, I think.

And again, computers are fast, don't look for microseconds.

Comments

Mendelt Siebenga
04/13/2008 09:13 AM by
Mendelt Siebenga

In my experience when you encounter more complicated examples than this the hardest part of making code more performant is understanding the code.

The slight performance penalty you get making the code more flexible will probably pay itself back as soon as you need to resolve a real performance bottleneck.

Mihailo
04/17/2008 05:32 PM by
Mihailo

Hi,

Just wanted to contribute to this interesting conversation. I'm from the "don't optimize prematurely" school of thought.

Premature optimization is waste of time - making right decisions upfront is not waste of time and should be practiced. Now it is not a rocket science to make good choices upfront, just follow widely adopted and well known good practices and you're good.

Someone said in earlier posts that he decides to use "optimized" algorithm upfront, seems as the safest example to hook on to. On contrary I tend to choose simpler and easier to understand then the state of the art rocket speed algorithm if it is more complex.

First of all if you make part of your code twice as fast but it is contributing with less then 1% in overall execution time figure yourself if you wasted weeks of effort (count in maintenance here, which we often forget). And if you followed all the best practices then you can easily swap algorithm if it becomes a bottle neck, right?

At the end, one thing easily forgotten is proficiency of your fellow developers. Not everyone is able to understand your highly optimized code. I would like to use in my work new languages which are more fit to purpose then what we use today, but it would be only me who knows how to use them: so I learned about this problem first hand, and it isn't to be left aside.

I'd have lots more to write on the subject but I'd say understandability before optimization is my choice.

Cheers,

Mihailo

Julien
04/22/2008 07:02 AM by
Julien

Following your blog post, I recently had the same kind of conversation with a co-worker about the readonly keyword and its performance impact. I had to made some micro-benchmarking myself to get numbers, but basically the conclusions are exactly the same as yours: don't use that kind of techniques to speed code execution! More infos on http://www.thedotnetfrog.com/2008/04/22/performance-impact-of-the-readonly-keyword/

Ayende Rahien
04/22/2008 08:44 AM by
Ayende Rahien

Julien,

In the readonly keyword case, it has meaning beyond the perf scenario, which are importnat

Greg Young
04/28/2008 05:36 PM by
Greg Young

A lock-free algorithm is slower than an uncontended monitor (uncontended monitors are really fast)?

that doesn't surprise me at all ... try it with 8 threads running on a dual quad core and you should see a difference.

generally lock-free structures will actually perform worse in situations where you run on a single processor.

Cheers,

Greg

Comments have been closed on this topic.