Excerpts from the RavenDB Performance team reportFacets of information, Part II
In my previous post, I outlined the performance problem with facets, in this one, I want to discuss how we solved it.
For the purpose of discussion, we’ll deal with the following data model:
{ "Id": "phones/1", "Unlocked": false, "Brand": "Apple", ... } { "Id": "phones/2", "Unlocked": true, "Brand": "LG", ... } { "Id": "phones/3", "Unlocked": false, "Brand": "Samsung", ... } { "Id": "phones/4", "Unlocked": true, "Brand": "Apple", ... } { "Id": "phones/5", "Unlocked": false, "Brand": "Samsung", ... }
In order to handle that, we have to understand how Lucene actually work behind the scenes. Everything with Lucene is based on the notion of a document id. The results of a query is a set of document ids, and the internal structures refer to document ids on a regular basis.
So, if the result of a query is actually a set of document ids (Int32), we can represent them as a simple HashSet<int>. So far, so good. Now, we want to do a facet by the brand. How is Lucene actually storing that information?
There are two data structures of interest. The TermsEnum, which looks like this:
- Brand:Apple – 2
- Brand:LG –1
- Brand:Samsung -2
This shows the number of documents per term. This is almost exactly what we would have wanted, except that this is global information, relevant for all the data. It our query is for unlocked phones only, we cannot use this data to answer the query.
For that, we need to use the TermsDocs, which looks like this:
- Brand:Apple – [1,4]
- Brand:LG – [2]
- Brand:Samsung – [3,5]
So we have the term, and the document ids that contained that term.
Note that this is a very high level view of how this actually works, and it ignores a lot of stuff that isn’t actually relevant for what we need to understand here.
The important bit is that we have good way to get all the document matches for a particular term. Because we can do that, it gives us an interesting option. Instead of trying to calculate things on a case by case basis, we changed things around. We used the information inside the Lucene index to create the following structure:
{ "Brand": { "Apple": [1,4], "LG": [2], "Samsung": [3,5] } }
There is caching and other obvious stuff also involved, naturally, but those aren’t important for the purpose of this talk.
Now, let us get back to our query and the required facets. We want:
Give me all the unlocked phones, and show me the facet for Brand.
A query in Lucene is going to return the following results:
[1,3,5]
Looking at the data like that, I think you can figure out where we are going with this. Here is the general algorithm that we have for answering facets.
var docsIds = ExecuteQuery(); var results = {}; foreach(var facet in facets) { var termsForFacet = terms[facet]; foreach(var term in termsForFacet){ results[facet] = CountIntersection(docIds, term.DoccumentsForTerm) } }
The actual cost for calculating the facet is down to a mere set intersection, so the end result is that the performance of facets in the scenarios we test had improved by more than an order of magnitude.
I’m skipping over a lot of stuff, obviously. With facets we deal with distinct queries, ranges, dynamic aggregation, etc. But we were able to shove all of them into this style of work, and the results certainly justify the work.
Oh, and the normal case that we worked so hard to optimize normally, we are seeing 50% improvement there as well, so I’m pretty happy.
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
Interesting to see how Facets is implemented under the covers. It explains why we need to make several calls to support a UI with facets when multiple filters are selected at the same time. When did this improvement go into the code base?
Piers, Why do you need to make multiple queries?
We are emulating the functionality you can see on many sites. For example:
http://www.johnlewis.com/electricals/televisions/all-tvs/c800005013
If you look at their use of facets, for facets in which you make a choice, the other choices in the same facet are still visible and their facet count does not go down. The counts in other facets do go down.
For example: given the facets "Screen Type", "Screen Size" and "Features", if I select "47 to 50" in the Screen Size, the "LED" count in the "Screen Type" facet drops from 93 to 19. But the "55 to 60" Screen Size count remains the same.
To get this type of functionality in Raven, I believe we need to make one faceted search with a WHERE ScreenSize >= 47 AND ScreenSize <= 50 to retrieve the facet counts for "Screen Type" and "Features", then a second faceted search without the WHERE clause to get the "Screen Size" counts.
If the user then clicks on "3D Ready" in the Features facet, the numbers drop in both the "Screen Type" and "Screen Size" facets, but the counts in the "Features" facet remain the same. This I believe requires 3 searches:
1) One for all the facets that haven’t yet got a choice selected. This would have a WHERE clause along the lines of WHERE ScreenSize >= 47 AND ScreenSize <= 5 AND 3DReady = true
2) One for just the "Screen Size" facet with WHERE 3DReady = true
3) One for just the "Features" facet with WHERE ScreenSize >= 47 AND ScreenSize <= 5
I guess the requirement is that every facet shows its counts taking into account the choices made in other facets, but ignoring the choices within itself.
I believe we can batch these up into a single call to the database by using .Lazily
We'd love to hear if there is a different way to get this functionality, but I suspect it is asking too much of a database to do this breakdown generically.
Piers
Interesting, that is a different from how we usually see facets used, usually when you select any sort of facet, it will also limit all the other ones.
Did you see the MultiFacet option?
https://github.com/ayende/ravendb/blob/master/Raven.Client.Lightweight/Connection/IDatabaseCommands.cs#L271
That looks interesting, but at the moment we are still on 2.5 and that looks like a 3.0 feature. We're looking to migrate soon so we could come back to that.
Long time ago (2007!) I worked with facets in lucene.net, This approach works its better when the query matches a big percentage of the index. I even have to port some code from Solr to efficiently count the number of set bits in an array because .net didn't have equivalent bitset cardinality method.
http://markmail.org/message/zrew4dimoktd6vex
Anyway, I suppose that you already do that, but if not a quick look to how solr handle this kinds of things could be good. Also in solr you can choose the algorithm to use when faceting, provided that you know what you want it's good to have the choice. https://wiki.apache.org/solr/SimpleFacetParameters#facet.method
Comment preview