Ayende @ Rahien

It's a girl

Challenge: What killed the application?

I have been doing a lot of heavy performance testing on Raven, and I run into a lot of very strange scenarios. I found a lot of interesting stuff (runaway cache causing OutOfMemoryException, unnecessary re-parsing, etc). But one thing that I wasn’t able to resolve was the concurrency issue.

In particular, Raven would slow down and crash under load. I scoured the code, trying to figure out what was going on, but I couldn’t figure it out. It seemed that after several minutes of executing, request times would grow longer and longer, until finally the server would start raising errors on most requests.

I am ashamed to say that it took me a while to figure out what was actually going on. Can you figure it out?

Here is the client code:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

That should be enough information to figure out what is going on.

Why did the application crash?

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Peter Ibbotson
04/28/2010 09:20 AM by
Peter Ibbotson

Wild guess is that it ran out of IP source port numbers?

LS
04/28/2010 09:23 AM by
LS

Directory.GetFiles("Docs",".json") should be Directory.EnumerateFiles("Docs",".json") if you want to be Parallel.

Ayende Rahien
04/28/2010 10:09 AM by
Ayende Rahien

Henning,

No, there is no association between the two.

Ayende Rahien
04/28/2010 10:10 AM by
Ayende Rahien

Peter,

No, we haven't got that. But I run into this before.

It usually only pop up using HTTPS, or authenticated connections, though.

LS,

Actually, no, we parallelize the action, not the enumeration, but thanks for letting me know about the new API

Henning Anderssen
04/28/2010 10:25 AM by
Henning Anderssen

Your testclient is sending more requests than the server can handle, maybe you're using some sort of queue on the server which overflows.

Wild guessing from my side.

DK
04/28/2010 10:27 AM by
DK

Is it because the directory contains too many files?

Frank Quednau
04/28/2010 10:32 AM by
Frank Quednau

Depending how your test is set up, could it be that Parallel ForEach and Raven DB are getting worker threads from the same thread pool?

manningj
04/28/2010 10:40 AM by
manningj

hit OOM because the server was buffering all the post'ed files? It's gotta get the whole request (including file contents) into memory before passing it along AFAIK

Rafal
04/28/2010 11:09 AM by
Rafal

What about the underlying database - maybe it had some concurrency problems - deadlocks, transaction timeouts, or run out of pooled connections?

Tim van der Weijde
04/28/2010 11:10 AM by
Tim van der Weijde

A wild guess, doesthe Directory.GetFiles() method return a non-generic collection instead of a generic one? If so, you should cast it.

Paul
04/28/2010 11:15 AM by
Paul

It effectively DoS'd the server by uploading too many files at the one time (there were more parallel threads going on the client than the server could accept, so they started to timeout).

Richard Dingwall
04/28/2010 11:21 AM by
Richard Dingwall

90,000 files @ 100-200ms each, no limit on the degrees of parallelization - lemme guess you had around 8,000 threads active, with 1MB stack allocated each, and hit OOM?

Barry
04/28/2010 11:46 AM by
Barry

Was it getting the same set of files ..

Dan Finucane
04/28/2010 11:49 AM by
Dan Finucane

Unless you modify the registry to increase the limit WININET makes at most two distinct connections to the same remote host so you are only going to benefit from two threads. The other threads are going to block waiting for one of the two connections and if you have more than two processors in your system you are going to spin up more and more threads out of the .NET thread pool all of them blocking and taking up 1-2mb of virtual address space.

tobi
04/28/2010 12:12 PM by
tobi

The thread-pool was spawning more and more threads (max by default is 250) because from its perspective the work was IO bound (waiting on the posts). It tries to saturate the CPU by spawning more threads.

Shaun
04/28/2010 12:32 PM by
Shaun

Is PostTo doing an async post? I can't imagine how Parallel.ForEach would be bogging down the server since it limits the number of parallel tasks to the number of cores that you have. So if you are doing synchronous POST requests, it is only going to be posting 2-4 requests at a time, which is obviously not a lot.

jonnii
04/28/2010 12:35 PM by
jonnii

Is it something to do with the fact that you're posting to the same uri over and over again?

I can imagine a scenario where at some point you decide to persist the documents, by recursively walking the documents to be written and because there are so many you end up blowing the stack somehow.

Dag
04/28/2010 01:35 PM by
Dag

HttpWebRequest.KeepAlive was set to its default "true" value?

tobi
04/28/2010 02:21 PM by
tobi

I am impressed because many creative solutions have been posted. By coincidence I faced the same issue 5min ago. It was the threadpool. Breaking in the debugger and executing ThreadPool.SetMaxThreads in the immediate window helped so I did not have to restart my long-running batch job.

Dan Finucane
04/28/2010 05:08 PM by
Dan Finucane

The .NET thread pool does not create a thread unless there is a processor/core on your system that is doing nothing. If there are no processors available the thread pool puts your request in a queue. You shouldn't use with ThreadPool.SetMaxThreads. The problem is that a thread is created and it blocks immediately when WININET already has two connections to a given host. When it blocks the processor it was running on is freed and the thread pool takes a request out of its queue and schedules a thread. You end up with all these threads blocked each taking up 1-2mb of virtual address space and they are all waiting for the same WININET resource to become available.

Felix
04/28/2010 08:59 PM by
Felix

Don't know if the Paralle.ForEach uses some sort of I/O port completion, but if so, I would think than blocking time waiting for socket reply ( the http request ) will be used, and do other file handles open and evantually run out of maximum file handles available. If I remember well, file handles are forced to some not so large count, in order to prevent buggy/malicious software to arm the system

Derek Fowler
04/29/2010 07:47 AM by
Derek Fowler

Are you enumerating the entire contents of bulk_docs for every request to check your filename is unique?

Mark
04/30/2010 12:20 PM by
Mark

Because you dumped 90,000 tasks into the Parallel Framework task scheduler?

Ayende Rahien
04/30/2010 01:03 PM by
Ayende Rahien

Actually, it handled that really nicely.

Comments have been closed on this topic.