I’m currently deep in the process of modifying the internals of Voron, trying to eke out more performance out of the system. I’m making great progress, but I’m also touching parts of the code that haven’t even been looked at for a long time.
In other words, I’m mucking about with the most stable and most critical portions of the storage engine. It’s a lot of fun, and I’m actually seeing some great results, but it is also nerve-wracking.
We have enough tests that I’ve great confidence I would catch any actual stability issues, but the drive back toward a fully green build has been a slog.
The process is straightforward:
- Change something.
- Verify that it works better than before.
- Run the entire test suite (upward of 30K tests) to see if there are any breaks.
The last part can be frustrating because it takes a while to run this sort of test suite. That would be bad enough, but some of the changes I made were things like marking a piece of memory that used to be read/write as read-only. Now any access to that memory would result in an access violation.
I fixed those in the code, of course, but we have a lot of tests, including some tests that intentionally corrupt data to verify that RavenDB behaves properly under those conditions.
One such test writes garbage to the RavenDB file, using read-write memory. The idea is to verify that the checksum matches on read and abort early. Because that test directly modifies what is now read-only memory, it generates a crash due to a memory access violation. That doesn’t just result in a test failure, it takes the whole process down.
I’ve gotten pretty good at debugging those sorts of issues (--blame-crash is fantastic) and was able to knock quite a few of them down and get them fixed.
And then there was this test, which uses encryption-at-rest. That test started to fail after my changes, and I was pretty confused about exactly what was going on. When trying to read data from disk, it would follow up a pointer to an invalid location. That is not supposed to happen, obviously.
Looks like I have a little data corruption issue on my hands. The problem is that this shouldn’t be possible. Remember how we validate the checksum on read? When using encryption-at-rest, we are using a mechanism called AEAD (Authenticated Encryption with Associated Data). That means that in order to successfully decrypt a page of data from disk, it must have been cryptographically verified to be valid.
My test results showed, pretty conclusively, that I was generating valid data and then encrypting it. The next stage was to decrypt the data (verifying that it was valid), at which point I ended up with complete garbage.
RavenDB trusts that since the data was properly decrypted, it is valid and tries to use it. Because the data is garbage, that leads to… excitement. Once I realized what was going on, I was really confused. I’m pretty sure that I didn’t break 256-bit encryption, but I had a very clear chain of steps that led to valid data being decrypted (successfully!) to garbage.
It was also quite frustrating to track because any small-stage test that I wrote would return the expected results. It was only when I ran the entire system and stressed it that I got this weird scenario.
I started practicing for my Fields medal acceptance speech while digging deeper. Something here had to be wrong. It took me a while to figure out what was going on, but eventually, I tracked it down to registering to the TransactionCommit event when we open a new file.
The idea is that when we commit the transaction, we’ll encrypt all the data buffers and then write them to the file. We register for an event to handle that, and we used to do that on a per-file basis. My changes, among other things, moved that logic to apply globally.
As long as we were writing to a single file, everything just worked. When we had enough workload to need a second file, we would encrypt the data twice and then write it to the file. Upon decryption, we would successfully decrypt the data but would end up with still encrypted data (looking like random fluff).
The fix was simply moving the event registration to the transaction level, not the file level. I committed my changes and went back to the unexciting life of bug-fixing, rather than encryption-breaking and math-defying hacks.