Black box reverse engineering speculation
Terrance has pointed me to some really interesting feature in Solr, called Facets. After reading the documentation, I am going to try and guess how this is implemented, based on my understanding of how Lucene works.
But first, let me explain what Facets are, Facets are a way to break down a search result in a way that would give the user more meaningful results. I haven’t looked at the code, and haven’t read any further than that single link, but i think that I can safely extrapolate from that. I mean, the worst case that could happen is that I would look stupid.
Anyway, Lucene, the underpinning of Solr, is a document indexing engine. It has no ability to do any sort of aggregation, and I am pretty sure that Solr didn’t sneak in something relational when no one was looking. So how can it do these sort of things?
Well, let us look at a simple example: ?q=camera&facet=true&facet.field=manu, which will give us the following results:
<!-- search results snipped --> <lst name="facet_fields"> <lst name="manu"> <int name="Canon USA">17</int> <int name="Olympus">12</int> <int name="Sony">12</int> <int name="Panasonic">9</int> <int name="Nikon">4</int> </lst> </lst>
Remember what we said about Lucene being an indexing engine? You can query the index itself very efficiently, and these sort of results are something that Lucene can provide you instantly.
More over, when we start talking about facets prices, which looks something like this;
?q=camera&facet=true&facet.query=price:[* TO 100] &facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300] &facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500] &facet.query=price:[500 TO *]
It gets even nicer. If I would have that problem (which I actually do, but that is a story for another day), I would resolve this using individual multiple Lucnene searches. Something like:
- type:camera –> get docs
- type:camera price:[* TO 100] –> but just get count
- type:camera price:[100 TO 200] –> but just get count
In essence, Solr functions as a query batching mechanism to Lucene, and then message the data to a form that is easy to consume by the front end. That is quite impressive.
By doing this aggregation, Solr can provide some really impressive capabilities, on top of a really simple concept. I am certainly going to attempt something similar for Raven.
Of course, I may have headed in the completely wrong direction, in which case I am busy wiping egg of my face.
Comments
I think that linq2lucene can give you a better way to create lucene query out of the box..
Not sure how Solr does it, but facets are very easy to implement in Lucene.Net using BitArrays - www.devatwork.nl/.../faceted-search-and-drill-d...
Hi,
Solr has quite a few amazing features that take some of the basics of lucene and really cranks them up a notch or two. We used Solr extensively to deliver http://www.fancydressoutfitters.co.uk/ to drive all the search and faceted navigation. It's blisteringly fast and scales very well.
To integrate with .NET we used the wonderful SolrNet project:
http://code.google.com/p/solrnet/
which even has NHibernate integration to synchronise db changes to the Solr Index:
code.google.com/.../NHibernateIntegration
I also heartily recommend Mauricio's blog, which has a wealth of Solr / SolrNet info:
http://bugsquash.blogspot.com/search/label/solrnet
and if you really want to get into the guts of Solr the recently released "Solr - the enterprise search server" book is a godsend:
www.packtpub.com/solr-1-4-enterprise-search-server
Howard
You might also find it useful to check out the free complete reference guide to Solr offered by Lucid Imagination:
www.lucidimagination.com/.../Reference-Guide
Faceting is described here:
www.lucidimagination.com/.../CDRG_ch07_7.11
be aware that the faceting feature in solr is implemented at very low level, and that redoing the query with some filters could work well with an small set of facets, but if you have a big set, the performance hit could be noteworthy. I have made a port of the solr algorithm to lucene.net, and it suffers also from some differences between the bitset collections in java and .NET
Ayende,
Sorry to take you off task :) I really do enjoy SOLR (see my blog shutupandcode.net). In the future I have hopes to integrate with it more completely.
As for performance I have loaded 5 million "records/documents" representing a wide array of "nouns" such as client, account, product and am effectively able to slice and dice this data on any number of attributes which gives me what is effectively MDX queries (ala a datawarehouse). Think .003 seconds timeline to perform the query and another .05 seconds to linq over documents returned.
Granted I had to use memcached to keep some of the frequently used data-structures in memory for quick lookup and retrieval (for example pricing all data for a particular facet). Still it is very effective and performant without using a DB as the main "query point" (thus no disk).
Still a database is being used but its but more for the file-system level.
Been using memcached and MSMQ serving the core business services allowing for potential of 100% uptime (if I had enough servers). The query points for dimensional hyperplanes are coming from SOLR.
Comment preview