Production postmortemThe encrypted database that was too big to replicate

time to read 3 min | 592 words

A customer called the support hotline with a serious problem. They had a large database and wanted to add another replica node to it. This is a fairly standard thing to do, of course. The problem was that somewhere around the 70% mark, the replication process stalled. All the metrics were green, the mentor node and the new node had perfect connectivity, and there were no errors in the logs.

Typical reasons for replication to stall usually involve connectivity issues, but in this case, we could see that there was no such sign of that. In fact, the mentor node kept sending (empty) batches to the destination node. That shouldn’t be the case, however. If we have nothing to send, there shouldn’t be a batch sent over the wire. That was the only hint of something wrong.

We also looked into what information RavenDB could tell us about the system, and noticed that we have a performance hint about big documents. Some of them exceeded 32MB in size, which is… quite a lot.  That doesn’t really relate so much to replication, however. It would surely slow it down, but it should work.

Looking into the logs, we could see that the mentor node was attempting to send a batch, but it was sending zero documents. Digging deeper, we saw an entry about skipping documents, that was… strange. Cross referencing the log statement with the source code revealed that RavenDB decided that it is sending too much in the batch and aborted it. But… it isn’t sending anything in the batch.

What is actually going on is that the database in question is an encrypted one. Encrypted databases in RavenDB are encrypted in both disk and memory. The only time that we decrypt a document is when there is an active transaction reading it. During that time, we hold that in locked memory (so it wouldn’t be paged to disk). As a result of that, we try to limit the size of transactions in encrypted databases. When we replicate data between nodes, we open a read transaction on the source node, read the documents that we need to replicate and send them to the other side.

There is a small caveat here, each node in an encrypted database can use a different encryption key, so we aren’t sending the encrypted data, but the plain text. Of course, the communication itself is encrypted, so nothing can peek into the data in the middle.

By default, we’ll stop a replication batch in an encrypted database after we locked more than 64 MB of memory. A replication batch of 64 MB is plenty big enough, after all. However… we didn’t take into account a scenario where a single document may cause us to consume more than 64 MB in locked memory. And we put the check to close the replication batch very early in the process.

The sequence of operations was then:

  • Start a replication batch
  • Load the first document to send
  • Realize that we locked too much memory and close the batch
  • Send a zero length batch

Rinse and repeat, since we can’t make any forward progress.

The actual solution was to set the “Replication.MaxSizeToSendInMb” configuration option to a higher value, enough to send even the biggest documents the customer has. At that point, there was forward progress again in the system and the replication was completed successfully.

We still consider this a bug, and we’ll fix it so there won’t be a hang in the system, but I’m happy to see that we were able to do a configuration change and get everything up to speed so quickly.

More posts in "Production postmortem" series:

  1. (29 Apr 2022) Deduplicating replication speed
  2. (25 Apr 2022) The network latency and the I/O spikes
  3. (22 Apr 2022) The encrypted database that was too big to replicate
  4. (20 Apr 2022) Misleading security and other production snafus
  5. (03 Jan 2022) An error on the first act will lead to data corruption on the second act…
  6. (13 Dec 2021) The memory leak that only happened on Linux
  7. (17 Sep 2021) The Guinness record for page faults & high CPU
  8. (07 Jan 2021) The file system limitation
  9. (23 Mar 2020) high CPU when there is little work to be done
  10. (21 Feb 2020) The self signed certificate that couldn’t
  11. (31 Jan 2020) The slow slowdown of large systems
  12. (07 Jun 2019) Printer out of paper and the RavenDB hang
  13. (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
  14. (25 Dec 2018) Handled errors and the curse of recursive error handling
  15. (23 Nov 2018) The ARM is killing me
  16. (22 Feb 2018) The unavailable Linux server
  17. (06 Dec 2017) data corruption, a view from INSIDE the sausage
  18. (01 Dec 2017) The random high CPU
  19. (07 Aug 2017) 30% boost with a single line change
  20. (04 Aug 2017) The case of 99.99% percentile
  21. (02 Aug 2017) The lightly loaded trashing server
  22. (23 Aug 2016) The insidious cost of managed memory
  23. (05 Feb 2016) A null reference in our abstraction
  24. (27 Jan 2016) The Razor Suicide
  25. (13 Nov 2015) The case of the “it is slow on that machine (only)”
  26. (21 Oct 2015) The case of the slow index rebuild
  27. (22 Sep 2015) The case of the Unicode Poo
  28. (03 Sep 2015) The industry at large
  29. (01 Sep 2015) The case of the lying configuration file
  30. (31 Aug 2015) The case of the memory eater and high load
  31. (14 Aug 2015) The case of the man in the middle
  32. (05 Aug 2015) Reading the errors
  33. (29 Jul 2015) The evil licensing code
  34. (23 Jul 2015) The case of the native memory leak
  35. (16 Jul 2015) The case of the intransigent new database
  36. (13 Jul 2015) The case of the hung over server
  37. (09 Jul 2015) The case of the infected cluster