Production postmortemThe encrypted database that was too big to replicate

time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

More posts in "Production postmortem" series:

  1. (12 Dec 2023) The Spawn of Denial of Service
  2. (24 Jul 2023) The dog ate my request
  3. (03 Jul 2023) ENOMEM when trying to free memory
  4. (27 Jan 2023) The server ate all my memory
  5. (23 Jan 2023) The big server that couldn’t handle the load
  6. (16 Jan 2023) The heisenbug server
  7. (03 Oct 2022) Do you trust this server?
  8. (15 Sep 2022) The missed indexing reference
  9. (05 Aug 2022) The allocating query
  10. (22 Jul 2022) Efficiency all the way to Out of Memory error
  11. (18 Jul 2022) Broken networks and compressed streams
  12. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  13. (12 Jul 2022) The data corruption in the node.js stack
  14. (11 Jul 2022) Out of memory on a clear sky
  15. (29 Apr 2022) Deduplicating replication speed
  16. (25 Apr 2022) The network latency and the I/O spikes
  17. (22 Apr 2022) The encrypted database that was too big to replicate
  18. (20 Apr 2022) Misleading security and other production snafus
  19. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  20. (13 Dec 2021) The memory leak that only happened on Linux
  21. (17 Sep 2021) The Guinness record for page faults & high CPU
  22. (07 Jan 2021) The file system limitation
  23. (23 Mar 2020) high CPU when there is little work to be done
  24. (21 Feb 2020) The self signed certificate that couldn’t
  25. (31 Jan 2020) The slow slowdown of large systems
  26. (07 Jun 2019) Printer out of paper and the RavenDB hang
  27. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  28. (25 Dec 2018) Handled errors and the curse of recursive error handling
  29. (23 Nov 2018) The ARM is killing me
  30. (22 Feb 2018) The unavailable Linux server
  31. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  32. (01 Dec 2017) The random high CPU
  33. (07 Aug 2017) 30% boost with a single line change
  34. (04 Aug 2017) The case of 99.99% percentile
  35. (02 Aug 2017) The lightly loaded trashing server
  36. (23 Aug 2016) The insidious cost of managed memory
  37. (05 Feb 2016) A null reference in our abstraction
  38. (27 Jan 2016) The Razor Suicide
  39. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  40. (21 Oct 2015) The case of the slow index rebuild
  41. (22 Sep 2015) The case of the Unicode Poo
  42. (03 Sep 2015) The industry at large
  43. (01 Sep 2015) The case of the lying configuration file
  44. (31 Aug 2015) The case of the memory eater and high load
  45. (14 Aug 2015) The case of the man in the middle
  46. (05 Aug 2015) Reading the errors
  47. (29 Jul 2015) The evil licensing code
  48. (23 Jul 2015) The case of the native memory leak
  49. (16 Jul 2015) The case of the intransigent new database
  50. (13 Jul 2015) The case of the hung over server
  51. (09 Jul 2015) The case of the infected cluster