I’m certain that the archangel that is responsible for statistical grouping has a strange fondness for jokes. If that isn’t the case, I can’t really explain how we routinely get clear clustering of bug reports on specific issues. We have noticed that several years ago, when we suddenly started to get a lot of reports of a particular issue. There was no recent release and no reason for all of these correlated errors to happen all at once. But they did. Luckily, the critical mass of reports gave us more than enough information to figure out what the bug was and fix it. When we checked, the bug was there, dormant, for 20 months.
The current issue we are handling is a spate of reports about hard errors in RavenDB. The kind of errors that use the terms invalid data, in particular. These kind of errors bring development to a screeching halt, as we direct all available resources to figure out what is going on. We didn’t have a reproduction, but we did have a few cases of people reporting it and the stack trace we got told us that the problem wasn’t isolated to a single location but seemed to be fairly wide spread. This is a Good News / Bad News kind of situation. It is good because it means that the problem is in a low level component, it is bad because it is going to be hard to narrow it down exactly.
The common theme around this issue popped up when the system was already in a bad state. We were able to see that when we run out of disk space or when the memory on the system was very low. The common thread to these was that in both cases these are expected and handled scenarios. What is more, these are the kind of errors that we want to keep on shouldering through.
In general, the error strategy we use in most of RavenDB is fairly simple. Abort the transaction and give the user an error. If the error is considered catastrophic (I/O failure when writing to disk, for example), we also force a restart of the database (which does full recovery). As it turned out, there were three completely separate issues that we found when investigating this issue.
First, and the most critical one was cleanup code that wasn’t aware of the state of the system. For example, consider:
Let’s assume that we have an error during the WriteData. For example, we run out of space on disk. We abort the transaction and throw an exception. The FlushIndex() method, however, will go into the catch clause and try to cleanup its resources. When doing so, it will access the transaction (that was just aborted) and try to use it, still.
The problem is that at this point the state of the transaction is known bad, the only thing you are allowed to do is to dispose it, you cannot access anything on it. The code in question, however, is meant to be used on non transactional resource and require compensating actions to revert to previous state.
For that matter, the code above has another problem, however. Do you see the using statement here? It seemed like we have quite a bit of code that does stuff in the Dispose method. And if that stuff also touch the transaction, you may get what is called in the profession: “hilarity ensued”.
The problem was that our transaction code was too trusting and assumed that the calling code will not call it if it is in invalid state. Indeed, the cases where we did such a thing were rare, but when they happened.
The second problem was when we had code like this:
Except that the real code is a lot more complex and it isn’t as each to see what is going on.
What you see here is that we run commands, and as we process them, we may get (expected) exceptions, which we should report to the caller. The problem is that we mixed expected exceptions with unexpected ones. And these unexpected exceptions could leave us in… a funny state. At which point we would continue executing future commands and even commit the transaction. As you can imagine, that isn’t a good place to be at.
We have gone over our error handling and injected faults and errors at any later of the stack that we could conceive of. We have been able to fix a lot of issues, most of them have probably never been triggered, but it was a sobering moment to look at some of those changes. The most likely cause for the errors was a change that was made (by me, as it turns out) over two years ago. And in all that time, we never seen neither hide nor hair of it. Until suddenly we got several simultaneous cases.
The third and final problem, by the way, was related to not crashing. By default, any severe enough error should cause us to shut down the database and restart it. In the process, we re-apply the journal and ensure that we are in a consistent state. The problem is that we don’t want to do that for certain expected errors. And the issue with staying up was that while Voron (at the storage layer) handled the error properly, the higher level code did not. At that point we had a divergence of the in memory state vs. the on disk state. That usually led to either NRE that would remain until the server was restarted or really scary messages that would typically go away when we recovered the database and reloaded the in memory state from the on disk state.
In short, handling errors in the error handling code is tough.
The primary changes we made were on the transaction side, we made it validate its own state when called, so code that erroneously try to use a transaction that has already failed will error early. We have also added additional validation of operations in several key points. That would allow us to catch errors much more quickly and allow us to pinpoint exactly where things are breaking apart sooner.
The current state is that we pushed these changes to our production system and are running the usual battery of tests to prove that the changes are valid. We’ll also be adding more faults into the process to ensure that we exercise our error handling a lot more often.
More posts in "Production postmortem" series:
- (18 Feb 2019) This data corruption bug requires 3 simultaneous race conditions
- (25 Dec 2018) Handled errors and the curse of recursive error handling
- (23 Nov 2018) The ARM is killing me
- (22 Feb 2018) The unavailable Linux server
- (06 Dec 2017) data corruption, a view from INSIDE the sausage
- (01 Dec 2017) The random high CPU
- (07 Aug 2017) 30% boost with a single line change
- (04 Aug 2017) The case of 99.99% percentile
- (02 Aug 2017) The lightly loaded trashing server
- (23 Aug 2016) The insidious cost of managed memory
- (05 Feb 2016) A null reference in our abstraction
- (27 Jan 2016) The Razor Suicide
- (13 Nov 2015) The case of the “it is slow on that machine (only)”
- (21 Oct 2015) The case of the slow index rebuild
- (22 Sep 2015) The case of the Unicode Poo
- (03 Sep 2015) The industry at large
- (01 Sep 2015) The case of the lying configuration file
- (31 Aug 2015) The case of the memory eater and high load
- (14 Aug 2015) The case of the man in the middle
- (05 Aug 2015) Reading the errors
- (29 Jul 2015) The evil licensing code
- (23 Jul 2015) The case of the native memory leak
- (16 Jul 2015) The case of the intransigent new database
- (13 Jul 2015) The case of the hung over server
- (09 Jul 2015) The case of the infected cluster