Black box reverse engineering speculation

time to read 3 min | 548 words

Terrance has pointed me to some really interesting feature in Solr, called Facets. After reading the documentation, I am going to try and guess how this is implemented, based on my understanding of how Lucene works.

But first, let me explain what Facets are, Facets are a way to break down a search result in a way that would give the user more meaningful results. I haven’t looked at the code, and haven’t read any further than that single link, but i think that I can safely extrapolate from that. I mean, the worst case that could happen is that I would look stupid.

Anyway, Lucene, the underpinning of Solr, is a document indexing engine. It has no ability to do any sort of aggregation, and I am pretty sure that Solr didn’t sneak in something relational when no one was looking. So how can it do these sort of things?

Well, let us look at a simple example: ?q=camera&facet=true&facet.field=manu, which will give us the following results:

<!-- search results snipped -->
<lst name="facet_fields">
  <lst name="manu">
    <int name="Canon USA">17</int>
    <int name="Olympus">12</int>
    <int name="Sony">12</int>
    <int name="Panasonic">9</int>
    <int name="Nikon">4</int>
  </lst>
</lst>

Remember what we said about Lucene being an indexing engine? You can query the index itself very efficiently, and these sort of results are something that Lucene can provide you instantly.

More over, when we start talking about facets prices, which looks something like this;

?q=camera&facet=true&facet.query=price:[* TO 100]
    &facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300]
    &facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500]
    &facet.query=price:[500 TO *]

It gets even nicer. If I would have that problem (which I actually do, but that is a story for another day), I would resolve this using individual multiple Lucnene searches. Something like:

  • type:camera –> get docs
  • type:camera price:[* TO 100] –> but just get count
  • type:camera price:[100 TO 200] –> but just get count

In essence, Solr functions as a query batching mechanism to Lucene, and then message the data to a form that is easy to consume by the front end. That is quite impressive.

By doing this aggregation, Solr can provide some really impressive capabilities, on top of a really simple concept. I am certainly going to attempt something similar for Raven.

Of course, I may have headed in the completely wrong direction, in which case I am busy wiping egg of my face.