What Lucene does, a look under the hood
Lucene is a search engine library, which is great. But as it turns out, there is a lot going on there. After working with it for several years, I can say with confidence that it is a pretty awesome library. But surprisingly, a lot of the effort that went into it doesn’t seem to be talked about / visible to people not trolling through the code. I think that this is a pretty good testament for how successful it is. That, and the fact that it is now the base line against which all other search libraries & engines are compared.
What I wanted to talk about today was the kind of things that Lucene is doing that doesn’t seem to get much publicity. I think that Spolsky said it best:
Back to that two page function. Yes, I know, it's just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I'll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn't have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.
I remember just how much impact that article made on me at the time. And Lucene’s codebase bear true for this words. Lucene is a search engine library, which basically means that it does:
One of the major areas of maturity in Lucene is how it optimized indexing. You can see it in the code. For example, Lucene goes to a great deal of trouble to avoid allocating memory willy nilly. Instead, pretty much everything there is done via object pools. This helps reduce the memory pressure when doing a lot of indexing and can save a lot of GC cycles.
Another is the concept of multiple threads for indexing .A lot of Lucene is build around this idea, it has a lot of per thread state that is meant to ensure that you don’t have to deal with concurrency yourself. The idea is that you can take an IndexWriter and write to it concurrently, then call commit. A lot of the work to do with indexing is CPU intensive, so that makes a lot of sense, and Lucene nicely isolates you from all of that work. There is DocumentWriterPerThread, so you can see really nice scaling effects as you throw more threads & hardware at the problem.
Usually, when people start messing with Lucene, they do that by writing analyzers, and you are sort of exposed to the memory constraints by being encourage to use ReusableTokenStream, etc. It has also a nice pipeline architecture for doing the indexing work with filters.
On the querying side, Lucene does a lot of work to ensure that things just works. It has a Boolean Model for searches, and Vector Space Model for ranking. Writing your own Query classes is pretty easy too, once you understand how things work, and again, this is another common place for people to extend Lucene. But there is a lot going on behind the scenes. Lucene does a lot of caching on a segment basis, and it is quite nice, since segments are immutable, it means that you can get pretty good usage out of that.
That give it a lot of its speed, and it means that over time, things are actually going to be faster, because more parts of the segments are in memory and cached.
Finally, we have all of the other work that Lucene does. In practice, it means things like merging segments (hopefully in the background), and keeping the overall system humming along. Unfortunately, that is also one of the places that are usually most common for people to start tinkering with when they run into perf problems. That is anything but trivial, and optimizing it is something that require a lot of expertise and understanding about the specific scenario you have.
And on top of that, you have everything else that already works on top of Lucene. Which is quite a lot.
As I said earlier, that is a very impressive piece of technology. That doesn’t mean that it doesn’t have its own set of problems, but that is something that I’ll discuss in detail in my next post.