Unbounded concurrency
Yesterday I talked about a problem that kept appearing, my application getting slower and slower and eventually dying. The code that I was testing it with was:
Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file => { PostTo("http://localhost:9090/bulk_docs", file); });
The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.
The problem was, at its root, an unbounded concurrency issue. Here is what happened.
Handling a single request is pretty fast, but it is still going to block for about a 100 ms. The Parallel ForEach implementation is getting notified about the blocked thread, and in order to increase parallelism, it is going to spin another thread and give it the next task. That thread is going to block as well, and after a while, another thread will be created, etc.
This code is big enough, and the response times are built just so to encourage the thread pool to create more and more threads. The server, at the same time, is having to deal with more and more work, causing request times to increase, which causes the client to create more threads, which causes the server to become more and more distressed, until eventually it fails and requests starts to error.
The fun part is that if you reduce the load on the server, it would resume operating normally. But the errors that it gave in the meanwhile was… cryptic. There is a reason why most servers limit their concurrency level, there are always enough clients to saturate a server, and it is better to place a client on a request queue or send him “busy now, come back later” error than to crash because of resource allocation issues somewhere in the middle of processing a request.
Comments
perhaps erlang or node.js is more suited than the c# threading model for massive concurrency
@junior programmer
This is not a problem that a node.js or an erlang solution is going to solve. The problem here is that the request itself is an expensive operation and as ayende has suggested there will always be a celling on the number of requests it can process at the same time. ASP.NET already has a 'request queue' behind the scenes which improves the situation but this has a limit as well and there is so many requests you can queue before it will to start throwing Timeout exceptions.
The easiest solution to develop is to throttle the requests on the client and perform them in batches. However I believe the most ideal solution in the long term is to convert the requests into async operations (if not already done so) which will remove the time-coupling on the client. This will allow you to store the requests into a persistent message quene which will allow you to process the request offline and at a load that the server can handle efficiently.
Hi,
as far as I unterstand Parallel.For (or ForEach in this case), each single iteration could be executed in parallel to each other iteration. That's why you could end up with this huge chunk of blocking threads.
Maybe using a Partitioner could help. In that case you would not support executing every iteration in parallel, but only batches of for e.g. 100.
I believe .Net 4.0 has some support for creating partions in Parallel.ForEach ... look at msdn.microsoft.com/en-us/library/dd560853.aspx
No need for a custom partitioner.
Just set ParallelOption.MaxDegreeOfParallelism to some reasonable value.
Parallel.ForEach(Directory.GetFiles("Docs","*.json"), new ParallelOptions { MaxDegreeOfParallelism = 10 }, file => ...);
You optimize one place to find the new bottleneck!
The Task Parallel Library tries to keep your CPU busy, at the expense of the overall performance.
I do wonder what the upper limit is for CouchDB with Erlang. Due to Erlang's concurrency model it can handle thousands of it's internal processes. Of course behind the scenes it is actually managing the queues, scheduling, and OS threads that make everything run.
I remembered reading about some Erlang webserver that didn't die until 80,000 concurrent connections while Apache died around 4,000. Of course questions of that webserver's sophistication and performance at those levels certainly comes up, but I believe it still demonstrates the abilities of Erlang.
Regardless I'm very excited about RavenDB and yes each request has a fair amount of overhead to it. It sounds like the problem is really just needing to find the maximum number of threads that can be handled concurrently. Even though Erlang is running thousands of "processes" it still limits the number of concurrent OS threads it's using. I would assume this is based on system specs when the runtime starts up.
Copenhas,
Couch has exactly one write thread.
The issue isn't with thread management, the issue is with any locks that you have to deal with.
It sounds like you should be using the I/O Thread Pool threading - check out blogs.msdn.com/.../9990424.aspx for a drop-in scheduler that will work with PFX.
CouchDB share the same number of threads that other high performance servers like Redis and Nginx has: 1. In this way, the way they achieve concurrency is just by being really really fast.
For maximum perf and scalability this is the correct number to have, although with the advent of multi-core computing the meaning has now changed to:
"The correct number of threads is, of course, equal to the number of cores on the box" -- The Moth
www.danielmoth.com/.../...arallel-Programming.aspx
Achieving this in say Redis is easy just run a new instance on a different port.
Although as @Anders Lybecker implies this just lets you achieve CPU efficiency, this does not solve the problem if the actual request itself is expensive. Maybe the request needs to write a lot of data to disk in which case IO now becomes your bottleneck and your new hard limit will be reached. Which is why I would suggest either partitioning / load balancing or to defer execution.
Yeah that would make sense that CouchDB has a single write thread. I'll have to take a look to see how CouchDB really takes advantage of Erlang. I suppose when updating indexes or on reads. Hmm..but is that a write thread in Erlang's sense or does CouchDB actually create and manage an OS level thread for writing... meh, this is getting off the point of the post.
Demis,
The 1 request handling works nicely if you don't do IO.
As you noted, if you do need it, you have a hard limit on what you can do. Even if you use async IO, it is pretty easy to get into situations where you can handle requests fast enough but you blow out the async IO queues.
Copenhas,
Index updates in Couch are blocking, only one thread can do it at a single time.
Couch concurrency really shines for reads, for writes, it is limited to a single thread for documents and one thread for each index (on first read)
Rahien, thanks for the info.
It seems to me that the PostTo operation on the client is IO bound* and not CPU bound. Therefore I think you may be better using the APM model instead of allocating threads to each task. From what I understand the APM model will leverage the Completion Port technology and therefore not steal so many threads....this leaves the threads available for the server (localhost) to the actually process the requests.
*I imagine the client code would spend most of its CPU time blocked while reading from disk and then blocked again while performing the network IO/waiting for a response.
I ran into a very similar issue the other day despite setting MaxDegreeOfParallelism to a reasonable number. I wonder if there is something else at play here?
@James, what value did you use for MaxDegreeOfParallelism?
If you want to see how fast RavenDB can take documents then this test won't do that well. You want your client to do at most 2 posts concurrently perhaps using MaxDegreeOfParallelism as suggested by James. 2 is the default max number of connections to one remote host that WININET will allow. When you allow more than the max WININET limit all the threads above 2 will block as soon as they try to connect.
In order to get a real stress test of RavenDB you will need to increase the WININET connection limit to match the number of cores in your system or you will need multiple client machines all doing 2 at a time. I think I would go for the multiple machine approach and use 4 or 5 all running two threads concurrently posting documents.
Dan,
I removed the 2 connection limit, as mentioned in the post
The thing is your client is working as hard as it can but your server may not be. Even if you remove that limit you would want to set the MaxDegreeOfParallelism to approximate the number of cores on your system. Better yet use ASYNC IO instead of Parallel.ForEach avoid all the blocking and use fewer threads. I use Jeffrey Richter's AsyncEnumerator for this and it's great.
I was under the impression that the TPL was for taking advantage of multiple cores for processor-intense operations. The processor is not your limiting factor here, it's the inherent latency in the PostTo operation which requires thread management. I would thing you'd be better off with something like ThreadPool to do that.
Comment preview