This was an interesting support call, about six months ago, if I recall correctly. It started with a customer calling because his database would sometimes stop responding for minutes (up to 30 minutes in some cases), and only when there was a high load on the system.
We started looking at this, and it was really strange. They could reliably reproduce this on their production system, but only there, and only under pretty high load. Simulating the same load on a non production machine would have no impact whatsoever. Tripling the load on a non production machine had no impact either.
We asked the customer to send us the debug logs, and we found an interesting tidbit of information in there. The issue always happened when the server was loading a database. The sequence of events went something like this:
- Lots of users on the site, using the (multiple) databases on the server.
- For some reason (lunch, meeting, whatever) there is a period of an hour or so when there is no activity on one part of the system, which meant that there are no requests for a particular database.
- RavenDB would unload the database and release all resources associated with it.
- After that period of time, and while the server is still in high load, requests would resume for that database. Usually in a big burst of requests for that database.
- RavenDB would start loading the database.
- All requests would start hanging.
- About 15 – 25 minutes later, RavenDB would start the database.
- All timing information in the logs indicated that there was no long operation.
- Everything return to normal.
To be honest, this looked a look like someone suspended the entire process somehow. CPU during this time was effectively zero, memory usage was fine and there was no I/O.
The quick workaround was to increase the database idle time to account for the breaks in activity. With the database no longer loading and unloading under load, there was no issue.
Now it was the time to try to figure out what the hell was going on. Going over the logs, we could see that there was a pretty big gaps in the initialization order of the database. To be exact, the initialization of the indexes took almost all the time. But why? Looking at each individual index, it took very little time to initialize, and even all of them together didn’t take a long time. But the time from starting the index initialization until the first index started to initialize was very high.
A few years ago we sped up the initialization times for large RavenDB databases by parallelizing the initialization of all indexes. That was the first clue. Something in this parallel initialization wasn’t working.
And there was also the load factor to consider. The problem only surfaced under load, and to be more precise, under load when we had multiple databases involved, but the issue affected the entire server.
When RavenDB get a request for an unloaded database, it starts loading it in a background thread and the request waits until the database is loaded.
Combine all of those together, and you might start seeing the answer.
RavenDB has an unloaded database, and we suddenly get a high number of requests for that database, all at the same time. RavenDB will start a background task loading the database, and wait until it is loaded to complete the request. However, because we have a lot of requests to this database, we have a lot of threads that are now waiting for it to complete.
Meanwhile, the database is loading in a background thread, and needs to initialize its indexes. It does this in parallel, using the thread pool…
The same thread pool that we serve requests from…
The same thread pool where we have a lot of threads that are currently waiting for the database to load…
The same database that is now waiting for tasks to complete on that busy thread pool…
If you had things going just right, you would get into a situation where the database couldn’t load, because all the thread pool threads it needed to start were busy waiting for the database to load.
Now, the thread pool would slowly increase the number of threads to take care of the amount of work it had. But those threads would usually be kept busy with more requests to the loading database.
Eventually, the index initialization would go through, the database would complete loading, all the database requests would complete, and the system would go back to normal. But in the meantime… ouch!
The temporary workaround, as I mentioned, was increasing the database idle time to prevent this situation from happening. The long term solution was to limit the amount of time that a thread would wait for a request to load, as well as limit the number of threads that could wait for a database to load. The first five requests for an unloaded database would wait up to ten seconds for that database to load, any additional requests would return with a service unavailable code immediately.
We also made sure that waiting for a database to load can be done in an asynchronous manner and in 3.5 we have a dedicated thread pool for internal RavenDB tasks. But I’ll have another post about that.
More posts in "Production postmortem" series:
- (24 Jul 2023) The dog ate my request
- (03 Jul 2023) ENOMEM when trying to free memory
- (27 Jan 2023) The server ate all my memory
- (23 Jan 2023) The big server that couldn’t handle the load
- (16 Jan 2023) The heisenbug server
- (03 Oct 2022) Do you trust this server?
- (15 Sep 2022) The missed indexing reference
- (05 Aug 2022) The allocating query
- (22 Jul 2022) Efficiency all the way to Out of Memory error
- (18 Jul 2022) Broken networks and compressed streams
- (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
- (12 Jul 2022) The data corruption in the node.js stack
- (11 Jul 2022) Out of memory on a clear sky
- (29 Apr 2022) Deduplicating replication speed
- (25 Apr 2022) The network latency and the I/O spikes
- (22 Apr 2022) The encrypted database that was too big to replicate
- (20 Apr 2022) Misleading security and other production snafus
- (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
- (13 Dec 2021) The memory leak that only happened on Linux
- (17 Sep 2021) The Guinness record for page faults & high CPU
- (07 Jan 2021) The file system limitation
- (23 Mar 2020) high CPU when there is little work to be done
- (21 Feb 2020) The self signed certificate that couldn’t
- (31 Jan 2020) The slow slowdown of large systems
- (07 Jun 2019) Printer out of paper and the RavenDB hang
- (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
- (25 Dec 2018) Handled errors and the curse of recursive error handling
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster