Corax, Lucene, Benchmarks and lies!

time to read 6 min | 1025 words

When we started working on Corax (10 years ago!), we had a pretty simple mission statement for that: “Lucene, but 10 times faster for our use case”. When we actually started implementing this in code (early 2020), we had a few more rules about the direction we wanted to take.

Corax had to be faster than Lucene in all scenarios, and 10 times faster for common indexing and querying scenarios. Corax design is meant for online indexing, not batch-oriented like Lucene. We favor moving work to indexing time and ensuring that our data structures on disk can work with no additional processing time.

Lucene was created at a time when data size was much smaller and disks were far more expensive. It shows in the overall design in many ways, but one of the critical aspects is that the file design for Lucene is compressed, meaning that you need to read the data, decode that into the in-memory data structure, and then process it.

For RavenDB’s use case, that turned out to be a serious problem. In particular, the issue of cold queries, where you query the database for the first time and have to pay the initialization cost, was particularly difficult. Now, cold queries aren’t really that interesting, from a benchmark perspective, you have to warm things up in every software (caches are everywhere, from your disk to your CPU). I like to say that even memory has caches (yes, plural) because it is so slow (L1, L2, L3 caches).

With Lucene’s design, however, whenever it runs an indexing batch, it creates a new file, and to start querying after that means that you have a “cold start” for that file. Usually, those files are small, but every now and then Lucene needs to merge several files together and then we have to pay the cold start price for a large amount of data.

The issue is that this sometimes introduces a high latency spike (hitting us in the P999 targets), which is really hard to smooth over. We spent a lot of time and engineering resources ensuring that this doesn’t have a big impact on our users.

One of the design goals for Corax was to ensure that this doesn’t happen. That we are able to get consistent performance from the system without periodic maintenance tasks. That led us to a very different internal design. The persistent data structures that we use are meant to be used as is, without initial processing.

Everything has a cost, and in this case, it means that the size of Corax on disk is typically somewhat larger than Lucene. The big advantage is that the amount of memory being used by Corax tends to be significantly lower. And in today’s world, disks are far cheaper than memory. Corax’s cold start time is orders of magnitude faster than Lucene’s cold start time.

It turns out that there is a huge impact in another scenario as well, completely unexpected. We continuously run performance tests on our system, and we got some ridiculous results when testing query performance using encrypted databases.

When you use encryption at rest, RavenDB ensures that the only time that your data is decrypted is when there is an active transaction using the data. In other words, even in-memory buffers are encrypted. That applies to documents as well as indexes. It does not apply to the in-memory data that Lucene holds in its cache, though. For Corax, however, all of its state is encrypted.

When we run our benchmark on encrypted database queries, we expect to see either roughly the same performance between Corax and Lucene or see Lucene edging out Corax in this scenario, since it can use its cache without paying decryption costs.

Instead, we got really puzzling results. I tried showing them in bar chart format, but I literally couldn’t make the data fit in a reasonable size. The scenario is testing queries on an encrypted database, using an m5.xlarge instance on AWS. We are hitting the server with 500 queries/second, and testing for the 99.99 percentile performance.

Indexing Engine99.99% percentile (ms)99.99% percentile (seconds)

Take a look at those numbers! Somehow Corax is absolutely smoking Lucene’s lunch. And I was quite surprised about that. I mean, I’m happy, I guess, that the indexing engine we spent so much time on is doing this well, but any time that we see a performance number that we cannot explain we need to figure out what is going on.

Here is the profiler output for this benchmark, using Lucene.

As you can see, the vast majority of the time is spent decrypting pages. And we are decrypting pages belonging to a stream. Those are the Lucene files, stored (encrypted in this case) inside of Voron. The issue is that the access pattern that Lucene is using forces us to touch large parts of the file. It usually reads a very small portion each time, but in various locations. Given that the data is encrypted, we have to decrypt each of those locations.

Corax, on the other hand, keeps the persistent data structure in such a way that when we need to access specific pages only. That means that in terms of the number of pages touched by Corax or Lucene for this particular scenario, Lucene is using a lot more. You’ll usually not notice that since Voron (our storage engine) is memory mapped and those accesses are cheap. When using encrypted storage, however, we need to decrypt the data first, so that was very noticeable.

It’s interesting to note that this also applies to instances where there is a memory pressure involved. Corax would tend to touch a lot less memory and have a smaller working set, while Lucene will generate more page faults.

Really interesting results, and I’m both happy and amused that totally different design decisions have led to such a big impact in this scenario. In short, Corax is fast, really fast, and in many more scenarios than we initially thought.