Production postmortem: The case of the infected cluster

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Jul 09 2015

Production postmortemThe case of the infected cluster

time to read 8 min | 1592 words

RavenDB is a critical system for many of our clients, that means that we get our share of “the sky is falling” stories, and we have a team of people we train to be able to diagnose and troubleshoot issues in production. This series of post is about the kind of issues that we run into, how they were resolved, and what the underlying issue was.

The scenario, a customer had a total system meltdown because RavenDB would stop responding to requests. Restarting the server would result in everything working, but in about 10 minutes, they would run into the same issue.

Complicating matters was the fact that the customer was running in a high availability cluster, with two quite powerful nodes that had failover setup. However, the problem would typically impact a single server, eventually leading to it stopping processing requests, while the second server would be humming along beautifully. When the first server would stop, the second server would continue for a few more minutes, then die in the same manner.

The issue was clearly related to the load on those server. The website is pretty busy, with millions of queries to RavenDB per hour, but we had no reason to think that the system was over capacity. Why is that?

The system has been running around for over a month on the same configuration, with moderate memory consumption and low CPU usage throughout that time.
The last upgrade to the RavenDB servers happened about a week ago, and no new traffic patterns should cause that load.
The last RavenDB client upgrade happened that morning, but it was a minor change and should reduce load on the server. But the problem would show up within 10 – 20 minutes, and the client upgrade happened several hours before it started happening.
The operations team swore that there were no changes made to either server recently.

The behavior was puzzling because we didn’t have high CPU, high memory or paging. Typical indications that the system is under too much of a load. Furthermore, I/O rates were quite good, and testing the disk speed and throughput showed that everything is all right.

We tried to revert all recent changes, to both clients and servers. We monitored the servers to see what happens when the problem recurred, and eventually we figured out what is the probable issue. The affected server would slowly show a growth pattern in the number of threads being used. Slowly going to upward of 700 threads. The non impacted server would sit on around 100 threads or so, until the first server would fail, in which case all load would hit the okay server, and it would be fine, for a while, then the thread count would start climbing up, the request timing would increase, and eventually things would stop.

One thing that was very interesting was that the problem was that RavenDB would still be responding (albeit a bit badly) under those conditions, and there was no indication that there are any issues with it as far as it was concerned. But under this failure condition we did see that we start getting a rather strange error.

The server is too busy, could not acquire transactional access

This rather poorly worded message is an indication that we hit an internal limit inside RavenDB. By default, we have a limit of 512 concurrent requests inside the server. That is done to make sure that we aren’t swamping the server with too many requests that would take it down. The good news was that this is a configurable value, and we were able to set Raven/MaxConcurrentServerRequests to a higher value to get a little bit more breathing space. Once we have done that, the system would still become slower, but it would stabilize at some point, even though it would still be much slower than what we come to expect of it.

Intermediary conclusions were:

Something is causing requests to take a lot longer than they should.
Because they take longer than they should, we see a convoy of requests.

We can’t handle requests as fast as they come, so they start to pile up.
The thread pool (which dispatches requests) is handling this by slowly increasing the number of threads.
But we still aren’t able to catch up with the number of requests coming in.
At some point, the number of concurrent requests hits the limit, and we start rejecting some requests

We would see major slow downs because the thread pool would increase the number of threads in the pool only on a slow and measured pace.
We don’t have any idea what is causing us to slow down.

Remember, this is a system that has been in production for a while now. In fact, the current traffic numbers are actually only about one third of the peek that was hit shortly after the system was released, and we handled that load with no issue.

We came back to the drawing board, and started look at pretty much anything suspicious. Memory was reasonable, and we have few dozens GB still free. CPU utilization was moderate, with over half of the system capacity unused. I/O rates from the disk according to performance monitor was excellent. Everything was good. So we took a dump of the process and proceeded to try to figure out what is going on from that.

We saw something like this (spread over 150+ threads and in various modes, but this gets the point):

95% of the threads were stuck in various I/O related methods. Like NtCreateFile, NtCloseFile, etc. Other threads were hanging in .NET related code that was also doing file I/O. So we can pretty conclusively state that the problem is I/O related.

But what? We tested the disk I/O, it was good. We run I/O tests and they were fine. The measurements we got from the monitoring tools all said that the I/O is fine.

And yet we could see that I/O was the culprit. What the hell?!

At this point, we start suspecting that the issue was with something in the system. Now, each of the servers was capable to handling the entire load of the server (and then some) on its own. But both machine exhibit the same signs. That led us to believe that this isn’t some sort of hardware issue. It is something that both machine had.

A common trouble maker in this area is any kind of tool that hooks into sys calls. For example, anti virus software. In fact, that was one of the very first questions that we asked. Does this machine have anti virus on it? We got a negative answer early on. But after seeing those results, we went in and double checked. The machine did have an anti virus installed, and that is a huge red flag in those circumstances.

Anti virus software like to hook into all kind of system calls, and it is commonly associated with problems because of that. We asked the operations team to disable the anti virus on those servers to see if it would have an impact on the issue. Indeed, they stopped the anti virus and restarted the RavenDB servers, and then we waited. 10 minutes later, we saw no increase in the number of threads, and 25 minutes later, we were holding steady on the same performance that we came to expect.

Today, the customer sent us their monitoring system reports and there is a huge drop from 30% – 50% CPU usage to single digits on the day in question, and then it is flat graph for all the rest of the week.

Our hypothesis is that the anti virus either auto updated itself, or was actually updated by ops as part of routine maintenance task, and for some reason it has taken a liking to the kind of I/O RavenDB was generating. So it inspected us. And as it did so, it added overhead for each and every I/O call we made. And considering that we are a database, I/O is kinda of important for us. That additional overhead would cause us to respond much more slowly, which would mean that we would have additional request queue up, all the way until we start rejecting them. And we didn’t see those at the I/O metrics for the system because the overhead happen after the I/O is already completed.

I also suspect that this runs for all I/O operations, including things that should be buffered (therefor fast), so it has a big impact on our assumptions (this I/O call should be buffered, so it is fast, but it is checked by the AV, so it is not, leading to the convoy).

Because the Anti Virus was installed on both servers, they were both impacted, and the more load we put on them, the more the AV had to do, the more it slowed us down, the more requests we had queued, the bigger the trouble.

In other words, be very aware what is actually going on your system. As part of the post mortem analysis, we realized that we should have probably run fltmc.exe tool to check what are the file system filters that are installed and probably could have figured out that this is an Anti Virus issue much faster.

Tweet Share Share 14 comments

Tags:

Comments

09 Jul 2015
09:28 AM

tobi

What's your opinion about AV on servers in general? I'm asking because you must have collected quite some experience with that.

09 Jul 2015
10:42 AM

dariol

AV is for end users. End users doesn't work on servers.

09 Jul 2015
11:52 AM

Dominic Zukiewicz

For servers, do run AV! Ensure that there are known exclusions in place for the applications and services under operation.

For web servers, exclude any CDN resource directories, temporary caches & pagefile locations. If you site accepts uploads, this folder should be scanned on-demand. Then run the full scans over all folders and files out-of-hours to keep it in check.

For application servers, the binaries / static resources served for any web services are good place to exclude, pagefile locations. Again, run the full scans over all folders and files (included exclusions) out-of-hours to keep it in check.

For database servers, exclude the data directory of the Transaction Logs and Data Files including Temp DBs - or even the exact files themselves if at all possible. Again, run the full scans over all folders and files (included exclusions) out-of-hours to keep it in check.

Raise these exclusions and mitigations with your SecOps teams and I believe it would make a good balance of both worlds.

09 Jul 2015
13:22 PM

Miki

Was it NOD32 ? I've had similar incidents with my applications and had to prove to the client each time that it was their AV was killing the performance and not my app.

09 Jul 2015
15:16 PM

Oren Eini

Tobi, In general, I hate AV on servers. Dominic, below, seems to make a lot of sense with regards to how to setup and configure AV on servers, but just about any time that I have run into AV on servers it was part of some production issue that was really hard to narrow down.

09 Jul 2015
16:48 PM

Alois Kraus

Such things are pretty easy and fast to find if you use ETW since you can see all call stacks including the AV Software in the kernel and not just in your process. I know you are no fan of ETW because it is too complex to use. The price for this is that you will need more guessing what apart from yourself in your own process could be slowing you down.

09 Jul 2015
17:21 PM

Oren Eini

Alois, Given a production system, how do you use ETW to get stack traces? And can you use it for timings as well?

09 Jul 2015
18:05 PM

Alois Kraus

The biggest advantage of ETW are stack traces and exact timings. If the virus scanner will put your thread to a halt you can see it via the context switch events where you can see for every single WaitForSingleObject call where you block for anything how long it did block AND which thread did unblock you. With ReadyThread events you can even get the call stack who did release your thread (e.g. call ReleaseMutex, PulseEvent, ...). For starters you can use WPRUI which is part of the Windows Performance Toolkit will give you all of these Details with an easy to use GUI. Normally you want to fine tune the data collection depending on what you are after (anything, GC related issues, long term profiling, ...) which is complex but once you have it you can track down virtually anything. E.g. you can see network hickups (TCP Retransmission) by enabling the TCP ETW Diagnostics Provider where each retransmission event relates to a delay of ca. 400ms. Since that stuff is in place I can track down the hardest distributed problems at customer sites. I can tell you if it is the network, the client PC or the server which has an issue. By profiling both machines at once (https://etwcontroler.codeplex.com/) where you can combine it with WireShark to see all network stuff too you can record so much Information that you can tell for every hickup what the root cause is.

10 Jul 2015
07:11 AM

Oren Eini

Alois, Are there any good guides to get into this?

10 Jul 2015
07:27 AM

Steve Wagner

Wouldnt it a good idea to use the Antivirus integration in Windows API to detect that and give a fat warning when an Antivirus Software is installed?

10 Jul 2015
07:29 AM

Oren Eini

Steve, Sure, because ops love it when you tell them that you don't want to run with an AV installed. And just having an AV isn't bad, in fact, they had the AV all along, but for some mysterious reason it decided to do extra checks on us all of a sudden

10 Jul 2015
07:35 AM

Steve Wagner

That was not what I had in mind. I would raise a warning when an AV gets online that it can interrupt performance when not properly configured. Users can dismiss it, and everything is fine.

10 Jul 2015
07:43 AM

Oren Eini

Steve, That doesn't actually do anything for us. It is just a way for us to CYA.

What we'll do (and thanks for the suggestion) is use this to detect that in the debug information we send, so we can narrow down on this more easily

10 Jul 2015
16:59 PM

Alois Kraus

Sure here are good links: https://randomascii.wordpress.com/2012/05/11/the-lost-xperf-documentationcpu-scheduling/ Here is a .NET sample which is largely obsolete unless you want to Register the ETW Provider manually but it Shows a rudimentary script to collect the data by yourself. http://geekswithblogs.net/akraus1/archive/2013/06/02/153041.aspx And here a typical case study where an database did become very slow and how I could track it down to the root cause. http://geekswithblogs.net/akraus1/archive/2013/08/03/153594.aspx

Besides this there are plent y of tutorials on Defrag Tools about WPA and xperf. MS Virtual Academy has also a very good set of tutorials: http://www.microsoftvirtualacademy.com/training-courses/windows-performance-jump-start

One thing to note is that if you are running on Win7 (Server 2008 R2) x64 you should consider to use NGen to get correct call stacks. Dynamically generated (JITed) code will not give you good call stacks due to limitations of the Windows Kernel Stackwalker. If you do not have an excessive amount of JITed code in your process you should get pretty far with NGenned binaries. On Win8 all of these issues are fixed.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Production postmortemThe case of the infected cluster

More posts in "Production postmortem" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "Production postmortem" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication