I’m trying to compare indexing speed of Corax vs. Lucene. Here is an interesting result:
We have two copies of the same index, running in parallel on the same data. And we can clearly see that Lucene is faster. Not by a lot, but enough to warrant investigation.
Here is the core of the work for Lucene:
And here it is for Corax:
If you look at the results, you’ll see something really interesting.
For the Corax version, the MapItems.Execute() is almost 5% slower than the Lucene version.
And that really pisses me off. That is just flat out unreasonable to see.
And the reason for that is that the MapItems.Execute() is identical in both cases. The exact same code, and there isn’t any Corax or Lucene code there. But it is slower.
Let’s dig deeper, and we can see this interesting result. This is the Lucene version, and the highlighted portion is where we are reading documents for the indexing function to run:
And here is the Corax version:
And here it is two thirds more costly? Are you kidding me? That is the same freaking code and is utterly unrelated to the indexing.
Let’s dig deeper, shall we? Here is the costs breakdown for Lucene, I highlighted the important bits:
And here is the cost breakdown for Corax
I have to explain a bit about what is going on here. RavenDB doesn’t trust the disk and will validate the data it reads from it the first time it loads a page.
That is what the UnlikelyValidatePage is doing.
What we are seeing in the profiler results is that both Corax and Lucene are calling GetPageInternal() a total of 3.69 million times, but Corax is actually paying the cost of page validation for the vast majority of them.
Corax validated over 3 million pages while Lucene validated only 650 thousand pages. The question is why?
And the answer is that Corax is faster than Lucene, so it is able to race ahead. When it races ahead, it will encounter pages first, and validate them. When Lucene comes around and tries to index those documents, they were already validated.
Basically, Lucene is surfing all the way forward on the wavefront of Corax’s work, and ends up doing a lot less work as a result.
What this means, however, is that we need to test both scenarios separately, on cold boot. Because otherwise they will mess with each other results.