Production postmortemThe case of the slow index rebuild

time to read 6 min | 1057 words

A customer called to complain that the indexing times that they were seeing on an index rebuild were very high, and that caused them issues. The customer was kind enough to actually provide us with a duplicate machine of their system, including duplicate data, which made the whole process so much easier. Unlike most scenarios, where we have to poke the logs, the debug endpoints and to try to figure out what is going on in a production system that we can’t really touch without causing downtime, here we had a complete freedom of action during the investigation.

The database in question is in the many tens of GB in size, and like most production databases, it has its own.. gravity, shall we say? Unlike a test data set where you can do something over the entire set and get immediate return, here the problem often was that to reproduce the issue we’ll have to start the action, then wait for ten or twenty minutes for it to pick up steam and actually start exhibiting the problem. But being able to actually run those tests repeatedly was very valuable in both narrowing down on exactly what was going on and how to resolve it.

The problem boiled down to an issue with how we were handling document prefetching. Before I get down into the details of that, let me explain what prefetching is.

Quite a lot of RavenDB code is concerned with reducing the time a request has to spend waiting for I/O. In particular, creation of a new index require us to read all the documents in the database so we can index them. On large databases, that can mean that we need to read tens of GB (and on very large databases, running an index that cover half a TB is very likely) from disk, index them, then write the index results to disk again.

Initially (as in, five or six years ago), we wrote the indexing code like so:

  • while (there are documents to index):
    • Load a batch of documents
    • Index those documents
    • Write them to disk

The problem is that this kind of code is very simple an easy to understand, but it also results in spending a lot of time doing:

  • Wait for load documents (no CPU usage)
  • Index documents (CPU usage)
  • Wait to write to disk (no CPU usage)

So a lot of the time was spent just waiting for I/O. Time that could have been much better spent doing something useful.

Because of that, we introduced the idea of prefetching. Basically, whenever we finish loading stuff from disk, we also immediately start a background task that will read the next batch of documents to member. The idea is that while we are indexing / writing the index results to disk, we’ll load the next batch of documents to memory, and we’ll have them immediately available to the indexing code, so we’ll have to do less waits, and we get the benefit of parallel I/O and execution.

This is a really high level overview of what is going on there, of course, and we need to balance quite a few competing concerns (memory, I/O pipeline size, I/O speed, other work being done, CPU utilization, etc, etc). But that is a pretty good description.

The problem in this case was that the customer in question have the following pattern of documents:

image

Our code mostly assumes that you have a roughly uniform distribution of documents sizes. Given the distribution above, assume we have a batch size of 2.

We’ll read the first two documents (taking 25Kb), and then start indexing them. At this time we start loading the next 2 documents. But the msgs/4 document is large, so it takes time to load, which means that indexing is now stalled on I/O.

What is work, the problem exacerbated, since the bigger documents tended to be toward the end (later documents tend to be bigger), it means that our heuristics about the data kept misleading us. Now, to make things worse, we actually do care about the size of the documents that we load, so instead of indexing the documents in big batches, those big documents would cause both I/O stalls, and then cause us to send much smaller batches to the indexes. That means that we have a lot more indexing batches, and a lot more I/O stalls.

The solution was to allow the prefetching code to give the indexes “whatever I have on hand”, and then continue with prefetching the additional documents while the indexes are working. It means more batches, but far less time waiting for the documents to be loaded from disk.

Another change we did was to parallelize the I/O further. When we notice that we get into this kind of situation, instead of firing off a single background task to load the next document batch, we are actually going to spin off multiple prefetching tasks, to load the next few batches in parallel. That means that we put more load on the I/O system, but especially on cloud machines, that is actually a  good thing (they they to have a shallow but wide I/O behavior).

Here the ability to actually test those changes on real system was invaluable, because our initial attempt was a bit… too active and actually placed serious I/O strain on the system, because we would try to make a lot of parallel reads for a lot of data at the same time. The implementation that we ended up with knows to scale the amount of pressure we put on the I/O system based on the actual system we use, the (current) I/O throughput we see, the document sizes in recent history, etc.

The end result is that we were able to shave about 20% – 25% of the indexing time under those conditions, and keep the system alive and functioning while we are doing so.

We also introduced the customer to the side by side, which allows them to deploy indexes in production without any interruption in service while the indexing is rebuilding. 

More posts in "Production postmortem" series:

  1. (12 Dec 2023) The Spawn of Denial of Service
  2. (24 Jul 2023) The dog ate my request
  3. (03 Jul 2023) ENOMEM when trying to free memory
  4. (27 Jan 2023) The server ate all my memory
  5. (23 Jan 2023) The big server that couldn’t handle the load
  6. (16 Jan 2023) The heisenbug server
  7. (03 Oct 2022) Do you trust this server?
  8. (15 Sep 2022) The missed indexing reference
  9. (05 Aug 2022) The allocating query
  10. (22 Jul 2022) Efficiency all the way to Out of Memory error
  11. (18 Jul 2022) Broken networks and compressed streams
  12. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  13. (12 Jul 2022) The data corruption in the node.js stack
  14. (11 Jul 2022) Out of memory on a clear sky
  15. (29 Apr 2022) Deduplicating replication speed
  16. (25 Apr 2022) The network latency and the I/O spikes
  17. (22 Apr 2022) The encrypted database that was too big to replicate
  18. (20 Apr 2022) Misleading security and other production snafus
  19. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  20. (13 Dec 2021) The memory leak that only happened on Linux
  21. (17 Sep 2021) The Guinness record for page faults & high CPU
  22. (07 Jan 2021) The file system limitation
  23. (23 Mar 2020) high CPU when there is little work to be done
  24. (21 Feb 2020) The self signed certificate that couldn’t
  25. (31 Jan 2020) The slow slowdown of large systems
  26. (07 Jun 2019) Printer out of paper and the RavenDB hang
  27. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  28. (25 Dec 2018) Handled errors and the curse of recursive error handling
  29. (23 Nov 2018) The ARM is killing me
  30. (22 Feb 2018) The unavailable Linux server
  31. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  32. (01 Dec 2017) The random high CPU
  33. (07 Aug 2017) 30% boost with a single line change
  34. (04 Aug 2017) The case of 99.99% percentile
  35. (02 Aug 2017) The lightly loaded trashing server
  36. (23 Aug 2016) The insidious cost of managed memory
  37. (05 Feb 2016) A null reference in our abstraction
  38. (27 Jan 2016) The Razor Suicide
  39. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  40. (21 Oct 2015) The case of the slow index rebuild
  41. (22 Sep 2015) The case of the Unicode Poo
  42. (03 Sep 2015) The industry at large
  43. (01 Sep 2015) The case of the lying configuration file
  44. (31 Aug 2015) The case of the memory eater and high load
  45. (14 Aug 2015) The case of the man in the middle
  46. (05 Aug 2015) Reading the errors
  47. (29 Jul 2015) The evil licensing code
  48. (23 Jul 2015) The case of the native memory leak
  49. (16 Jul 2015) The case of the intransigent new database
  50. (13 Jul 2015) The case of the hung over server
  51. (09 Jul 2015) The case of the infected cluster