One of the measure that we don’t care much about is the startup time of RavenDB. Whatever it takes 5 seconds or 15 seconds is of little concern to us. Whatever it takes 15 seconds or 3 minutes, however, is something that we most certainly want to pay attention to.
One of our customers has an interesting use case. They are running on Azure machines and take full advantage of the multiple storage options that they have available there. In particular, their journals are using a premium storage disk but their data is residing on a a large (and slow) disk. This is because they have quite a lot of data. One of their indexes just exceeded the 256GB mark, for example.
In their case, the startup time for RavenDB wasn’t acceptable. We investigated the issue and it turned out that the root of the problem was that RavenDB was running recovery on the database, re-applying recent transactions to make sure that we are consistent. This is expected, and in most cases, shouldn’t cause you to spend too much time at startup. By default, journals are going to be about 256MB if you are heavily loaded. But due to the customer’s access patterns, we saw transactions that included multiple GBs. We compress the transaction data before writing it to disk, so a single transaction (which cannot be split into multiple journal files) that takes multiple GBs compressed has likely wrote to 10+ GB on the data file. We can tell that we don’t need to apply a transaction if it was already applied, but we need to read and analyze it first.
Times that by a number of databases and a number of indexes per database and you can see that restarting RavenDB begins to be something that you plan for. That is not where we want to be, obviously. Now, if we just had a crash, there is really no good way to avoid reapplying these transactions, but the problem was that we saw the same behavior without a crash. We saw this when doing normal shutdown.
The basic problem was that RavenDB doesn’t track the location in the journal file that we know have been safely synced to disk. We only track things at the journal level. That means that on startup, we need to read through the entire journal file and figure out whatever we need to apply each of the transactions inside it. We could track the last synced transaction location, of course. That would mean changing the on disk format at a very low level, something that we have the facilities to do, but is probably going to be awkward and cause compatibility concerns that I would rather not get into.
We also looked into changing the runtime behavior so we’ll be more likely to move to a new journal file after we synced the data in the previous one if it is too large. I was looking at this today and figure out something silly. Whenever we have a large transaction (where large is bigger that the max journal size) we need to ensure that we have enough space for the transaction. We do that by allocating a big enough file on disk. However, the way we did that was interesting.
As you can see, if the minimum required size is smaller than the current journal size, we make sure to increase it. And because we want to avoid making too many file allocation calls, we try to ensure that we’ll use a size that is big enough that the journal file can be used or the next transaction as well. Now, consider the common scenario where the current journal size is 256MB (which is the default journal file limit) and the transaction size is 1.56 GB.
What will happen then is that we’ll get a journal size of 2GB, of which only 1.56GB is used. This is fine, and we’ll use the rest of the space, if we can. However, if the next transaction is too large (let’s say, 800MB), we’ll need to create a new file, whose size will be 1GB, etc.
It is when we sync the data to disk, that we really hit the bad behavior. We just synced the data to disk, so we can get rid of the journal file. But there are still 440MB of disk space allocated to the journal file, so we keep the journal around for the next transaction. And if we restart at that point, we’ll have to go through the entire 2 GB journal file to make sure that we haven’t missed anything. The fix, in this case, was stupidly easy:
All we need to do is to ensure that if the power of two size of the write to the journal is bigger than the max journal size, we’ll use the size of the write to the journal. That will create a journal that has just a single transaction on it. Most importantly, that means that once the data is synced to disk, there is no more space available on that journal file and Voron will immediately know that it can clear it. No big journal sticking around, no need to re-structure our on disk data or to go into tricky change of behavior. I really love this change because is it succinct, simple and does the job.