Production postmortemThe encrypted database that was too big to replicate

time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

More posts in "Production postmortem" series:

  1. (27 Jan 2023) The server ate all my memory
  2. (23 Jan 2023) The big server that couldn’t handle the load
  3. (16 Jan 2023) The heisenbug server
  4. (03 Oct 2022) Do you trust this server?
  5. (15 Sep 2022) The missed indexing reference
  6. (05 Aug 2022) The allocating query
  7. (22 Jul 2022) Efficiency all the way to Out of Memory error
  8. (18 Jul 2022) Broken networks and compressed streams
  9. (13 Jul 2022) Your math is wrong, recursion doesn’t work this way
  10. (12 Jul 2022) The data corruption in the node.js stack
  11. (11 Jul 2022) Out of memory on a clear sky
  12. (29 Apr 2022) Deduplicating replication speed
  13. (25 Apr 2022) The network latency and the I/O spikes
  14. (22 Apr 2022) The encrypted database that was too big to replicate
  15. (20 Apr 2022) Misleading security and other production snafus
  16. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  17. (13 Dec 2021) The memory leak that only happened on Linux
  18. (17 Sep 2021) The Guinness record for page faults & high CPU
  19. (07 Jan 2021) The file system limitation
  20. (23 Mar 2020) high CPU when there is little work to be done
  21. (21 Feb 2020) The self signed certificate that couldn’t
  22. (31 Jan 2020) The slow slowdown of large systems
  23. (07 Jun 2019) Printer out of paper and the RavenDB hang
  24. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  25. (25 Dec 2018) Handled errors and the curse of recursive error handling
  26. (23 Nov 2018) The ARM is killing me
  27. (22 Feb 2018) The unavailable Linux server
  28. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  29. (01 Dec 2017) The random high CPU
  30. (07 Aug 2017) 30% boost with a single line change
  31. (04 Aug 2017) The case of 99.99% percentile
  32. (02 Aug 2017) The lightly loaded trashing server
  33. (23 Aug 2016) The insidious cost of managed memory
  34. (05 Feb 2016) A null reference in our abstraction
  35. (27 Jan 2016) The Razor Suicide
  36. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  37. (21 Oct 2015) The case of the slow index rebuild
  38. (22 Sep 2015) The case of the Unicode Poo
  39. (03 Sep 2015) The industry at large
  40. (01 Sep 2015) The case of the lying configuration file
  41. (31 Aug 2015) The case of the memory eater and high load
  42. (14 Aug 2015) The case of the man in the middle
  43. (05 Aug 2015) Reading the errors
  44. (29 Jul 2015) The evil licensing code
  45. (23 Jul 2015) The case of the native memory leak
  46. (16 Jul 2015) The case of the intransigent new database
  47. (13 Jul 2015) The case of the hung over server
  48. (09 Jul 2015) The case of the infected cluster