Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 18 | Comments: 63

filter by tags archive

ChallengeWhat killed the application?

time to read 2 min | 267 words

I have been doing a lot of heavy performance testing on Raven, and I run into a lot of very strange scenarios. I found a lot of interesting stuff (runaway cache causing OutOfMemoryException, unnecessary re-parsing, etc). But one thing that I wasn’t able to resolve was the concurrency issue.

In particular, Raven would slow down and crash under load. I scoured the code, trying to figure out what was going on, but I couldn’t figure it out. It seemed that after several minutes of executing, request times would grow longer and longer, until finally the server would start raising errors on most requests.

I am ashamed to say that it took me a while to figure out what was actually going on. Can you figure it out?

Here is the client code:

Parallel.ForEach(Directory.GetFiles("Docs","*.json"), file =>
{
    PostTo("http://localhost:9090/bulk_docs", file);
});

The Docs directory contains about 90,000 files, and there is no concurrent connection limit. Average processing time for each request when running in a single threaded mode was 100 – 200 ms.

That should be enough information to figure out what is going on.

Why did the application crash?

More posts in "Challenge" series:

  1. (28 Apr 2015) What is the meaning of this change?
  2. (26 Sep 2013) Spot the bug
  3. (27 May 2013) The problem of locking down tasks…
  4. (17 Oct 2011) Minimum number of round trips
  5. (23 Aug 2011) Recent Comments with Future Posts
  6. (02 Aug 2011) Modifying execution approaches
  7. (29 Apr 2011) Stop the leaks
  8. (23 Dec 2010) This code should never hit production
  9. (17 Dec 2010) Your own ThreadLocal
  10. (03 Dec 2010) Querying relative information with RavenDB
  11. (29 Jun 2010) Find the bug
  12. (23 Jun 2010) Dynamically dynamic
  13. (28 Apr 2010) What killed the application?
  14. (19 Mar 2010) What does this code do?
  15. (04 Mar 2010) Robust enumeration over external code
  16. (16 Feb 2010) Premature optimization, and all of that…
  17. (12 Feb 2010) Efficient querying
  18. (10 Feb 2010) Find the resource leak
  19. (21 Oct 2009) Can you spot the bug?
  20. (18 Oct 2009) Why is this wrong?
  21. (17 Oct 2009) Write the check in comment
  22. (15 Sep 2009) NH Prof Exporting Reports
  23. (02 Sep 2009) The lazy loaded inheritance many to one association OR/M conundrum
  24. (01 Sep 2009) Why isn’t select broken?
  25. (06 Aug 2009) Find the bug fixes
  26. (26 May 2009) Find the bug
  27. (14 May 2009) multi threaded test failure
  28. (11 May 2009) The regex that doesn’t match
  29. (24 Mar 2009) probability based selection
  30. (13 Mar 2009) C# Rewriting
  31. (18 Feb 2009) write a self extracting program
  32. (04 Sep 2008) Don't stop with the first DSL abstraction
  33. (02 Aug 2008) What is the problem?
  34. (28 Jul 2008) What does this code do?
  35. (26 Jul 2008) Find the bug fix
  36. (05 Jul 2008) Find the deadlock
  37. (03 Jul 2008) Find the bug
  38. (02 Jul 2008) What is wrong with this code
  39. (05 Jun 2008) why did the tests fail?
  40. (27 May 2008) Striving for better syntax
  41. (13 Apr 2008) calling generics without the generic type
  42. (12 Apr 2008) The directory tree
  43. (24 Mar 2008) Find the version
  44. (21 Jan 2008) Strongly typing weakly typed code
  45. (28 Jun 2007) Windsor Null Object Dependency Facility

Comments

Peter Ibbotson

Wild guess is that it ran out of IP source port numbers?

LS
LS

Directory.GetFiles("Docs",".json") should be Directory.EnumerateFiles("Docs",".json") if you want to be Parallel.

Ayende Rahien

Henning,

No, there is no association between the two.

Ayende Rahien

Peter,

No, we haven't got that. But I run into this before.

It usually only pop up using HTTPS, or authenticated connections, though.

LS,

Actually, no, we parallelize the action, not the enumeration, but thanks for letting me know about the new API

Henning Anderssen

Your testclient is sending more requests than the server can handle, maybe you're using some sort of queue on the server which overflows.

Wild guessing from my side.

DK
DK

Is it because the directory contains too many files?

Frank Quednau

Depending how your test is set up, could it be that Parallel ForEach and Raven DB are getting worker threads from the same thread pool?

manningj

hit OOM because the server was buffering all the post'ed files? It's gotta get the whole request (including file contents) into memory before passing it along AFAIK

Rafal

What about the underlying database - maybe it had some concurrency problems - deadlocks, transaction timeouts, or run out of pooled connections?

Tim van der Weijde

A wild guess, doesthe Directory.GetFiles() method return a non-generic collection instead of a generic one? If so, you should cast it.

Paul

It effectively DoS'd the server by uploading too many files at the one time (there were more parallel threads going on the client than the server could accept, so they started to timeout).

Richard Dingwall

90,000 files @ 100-200ms each, no limit on the degrees of parallelization - lemme guess you had around 8,000 threads active, with 1MB stack allocated each, and hit OOM?

Barry

Was it getting the same set of files ..

Dan Finucane

Unless you modify the registry to increase the limit WININET makes at most two distinct connections to the same remote host so you are only going to benefit from two threads. The other threads are going to block waiting for one of the two connections and if you have more than two processors in your system you are going to spin up more and more threads out of the .NET thread pool all of them blocking and taking up 1-2mb of virtual address space.

tobi

The thread-pool was spawning more and more threads (max by default is 250) because from its perspective the work was IO bound (waiting on the posts). It tries to saturate the CPU by spawning more threads.

Shaun

Is PostTo doing an async post? I can't imagine how Parallel.ForEach would be bogging down the server since it limits the number of parallel tasks to the number of cores that you have. So if you are doing synchronous POST requests, it is only going to be posting 2-4 requests at a time, which is obviously not a lot.

jonnii

Is it something to do with the fact that you're posting to the same uri over and over again?

I can imagine a scenario where at some point you decide to persist the documents, by recursively walking the documents to be written and because there are so many you end up blowing the stack somehow.

Dag
Dag

HttpWebRequest.KeepAlive was set to its default "true" value?

tobi

I am impressed because many creative solutions have been posted. By coincidence I faced the same issue 5min ago. It was the threadpool. Breaking in the debugger and executing ThreadPool.SetMaxThreads in the immediate window helped so I did not have to restart my long-running batch job.

Dan Finucane

The .NET thread pool does not create a thread unless there is a processor/core on your system that is doing nothing. If there are no processors available the thread pool puts your request in a queue. You shouldn't use with ThreadPool.SetMaxThreads. The problem is that a thread is created and it blocks immediately when WININET already has two connections to a given host. When it blocks the processor it was running on is freed and the thread pool takes a request out of its queue and schedules a thread. You end up with all these threads blocked each taking up 1-2mb of virtual address space and they are all waiting for the same WININET resource to become available.

Felix

Don't know if the Paralle.ForEach uses some sort of I/O port completion, but if so, I would think than blocking time waiting for socket reply ( the http request ) will be used, and do other file handles open and evantually run out of maximum file handles available. If I remember well, file handles are forced to some not so large count, in order to prevent buggy/malicious software to arm the system

Derek Fowler

Are you enumerating the entire contents of bulk_docs for every request to check your filename is unique?

Mark

Because you dumped 90,000 tasks into the Parallel Framework task scheduler?

Ayende Rahien

Actually, it handled that really nicely.

Comment preview

Comments have been closed on this topic.

FUTURE POSTS

  1. RavenDB 3.0 New Stable Release - 15 hours from now
  2. Production postmortem: The case of the lying configuration file - about one day from now
  3. Production postmortem: The industry at large - 3 days from now
  4. The insidious cost of allocations - 4 days from now
  5. Buffer allocation strategies: A possible solution - 7 days from now

And 4 more posts are pending...

There are posts all the way to Sep 11, 2015

RECENT SERIES

  1. Find the bug (5):
    20 Apr 2011 - Why do I get a Null Reference Exception?
  2. Production postmortem (10):
    31 Aug 2015 - The case of the memory eater and high load
  3. What is new in RavenDB 3.5 (7):
    12 Aug 2015 - Monitoring support
  4. Career planning (6):
    24 Jul 2015 - The immortal choices aren't
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats