Excerpts from the RavenDB Performance team reportFacets of information, Part I
Facets are annoying. Mostly because they tend to be used relatively rarely, but when they are used, they are used a lot, and in many interesting ways. Because they are fairly rarely used, let me explain what they are first.
We want to show the users all the recent phones. We can do this using the following query:
from phone in session.Query<Phone>()
orderby phone.ReleaseDate descending
select phone;
So far, so good. But we want to do better than just show a potentially big list to the user. So we start faceting the results. We pull some interesting pieces of data from the results, and allow the user to quickly narrow down the details. Typical facets for a phone query would be:
- Brand
- Screen Size
- Camera
- Cost
Now, we can certainly give the option to search for them individually, but then you end up with something like this:
That is neither friendly nor very useful. so UX designers came up with this idea, instead:
This gives the user a much easier way to look at the large data sets, and quickly narrow down the search to just the things they want. It also gives the user some indication about the actual data. I can see how changing the facet selection would likely change the number of results I have. This is a great way to offer data exploration capabilities. And in any kind of data repository, this is crucial. Shopping sites are a natural choice, but content/document management systems, job sites, information centers are all commonly making use of this feature.
I’m sorry for the long intro, but I feel that this feature needs to be placed in context, to explain its history and the various iterations it has gone through.
We first got this feature in Sep 2011. It was a pull request from Matt Warren, apparently based on some code that I spiked. I intentionally goes there, because that is the first time we see this code, and it was implemented in the simplest possible manner that could possibly work, to showcase this feature and to make sure that it works end to end. The underlying implementation looked something like this:
private void HandleTermsFacet(string index, Facet facet, IndexQuery indexQuery, IndexSearcher currentIndexSearcher, Dictionary<string, IEnumerable<FacetValue>> results) { var terms = database.ExecuteGetTermsQuery(index, facet.Name,null, database.Configuration.MaxPageSize); var termResults = new List<FacetValue>(); foreach (var term in terms) { var baseQuery = database.IndexStorage.GetLuceneQuery(index, indexQuery); var termQuery = new TermQuery(new Term(facet.Name, term)); var joinedQuery = new BooleanQuery(); joinedQuery.Add(baseQuery, BooleanClause.Occur.MUST); joinedQuery.Add(termQuery, BooleanClause.Occur.MUST); var topDocs = currentIndexSearcher.Search(joinedQuery, 1); if(topDocs.totalHits > 0) termResults.Add(new FacetValue { Count = topDocs.totalHits, Range = term }); } results[facet.Name] = termResults; }
In other words, the overall logic was something like this:
- Get a faceted query, with a list of facets.
- For each facet:
- Get all the terms for the field (if this is Operating System, then that would be: Android, iOS, BlackBerry, Windows Phone)
- For each of those terms, executed the original query and add a clause that limit the results to the current term and value.
As you can imagine, this has some.. performance issues. The more facets we had, and the more terms per facets, the costlier this got. Surprisingly enough, this code lasted almost unmodified for about 10 months, and the next major change was simply to process all of this in parallel.
Shortly thereafter, we have more changes. Moving from this trivial model to a much more efficient one, where we are executing the query once (PR from Michael Weber, by the way) per facet, instead of once per facet/term. Dynamic aggregation in RavenDB is also done through facets, and that added a much more common code path, so more changes started happening to the facets code. In particular, we started handling caching, we ended up reducing the time to run by a lot by restructuring how we were handling things so we would only executed a single query, then use low level index access to get all the other information we needed in an optimized fashion.
We got more customer usage sample, from the guys who had 70,000 facet queries to the people who wanted facets over multiple millions of results. All of that meant that we put a lot of thought into how we can reduce the performance of faceted queries, and over time, things because more and more complex, but much faster. In particular, by ordering the sequence of operations for computing the query, adding prefetching support and in general greatly reduced the amount of work we needed to do to just get the query working.
The end result was that the facet code looked something like this:
IndexedTerms.ReadEntriesForFields(currentState, fieldsToRead, allCollector.Documents, (term, docId) => { List<Facet> facets; if (!sortedFacets.TryGetValue(term.Field, out facets)) return; foreach (var facet in facets) { switch (facet.Mode) { case FacetMode.Default: var facetValues = facetsByName.GetOrAdd(facet.DisplayName); FacetValue existing; if (facetValues.TryGetValue(term.Text, out existing) == false) { existing = new FacetValue { Range = GetRangeName(term) }; facetValues[term.Text] = existing; } ApplyFacetValueHit(existing, facet, docId, null, currentState, fieldsCrc); break; case FacetMode.Ranges: // removed for brevity break; default: throw new ArgumentOutOfRangeException(); } } });
});
There is a lot of stuff going on around this code, but this is where we actually computed the facets. Note that this basically calls the lambda once per every matching term for the specified fields for every document in the query.
And the performance was pretty good for our test cases, we run this through several customers who were heavy facet users, and they saw a marked improvement in performance of queries. But we recently got a customer that mentioned in passing that facet queries were expensive. As in, 5 – 10 seconds range per query. That raised some red flags, we consider requests that takes more than 100ms to be suspicious, so we asked for more information, and we realized that he was using facets slightly differently from most of our customers.
Usually, facets are used to… well, facet the data on a search. Typical performance issues with facets in the past were with large number of facets on a search result (I wasn’t kidding with the 70,000 facets). Because most of the customer feedback we got was around that, we focused on this issue primarily, to the point were even stuff like that was well behaving. This customer, however, was using facets not for searching, but as a way to manage the data itself. That meant that they were using a lot of facets on the entire data set.
So instead of having many facets on a medium size number of documents, we were using a small number of facets, but on the entire data set. Profiling this resulted in a few calls being highlighted:
- GetOrAdd
- TryGetValue
They are very cheap calls, on their own, but this piece of code was called literally tens of millions of times for this particular scenario. We spent some time with a profiler and started playing with the code, trying to see if we can remove the costly calls and optimize this code path.
The answers was… no. In fact, everything we did was in the range of 5% – 7% change (in both directions, mind).
This sucked, because going back to the user and saying: “that is the way it is” didn’t sit well with us. We got together, closed the shutters, turned off the light, and solely by the light of the monitors, we did a major hacking session. We completely changed the way we were handling facets, to avoid not just the problem with the calls to TryGetValue or GetOrAdd but to change direction to a much more efficient mode.
This post is getting pretty long, so I’ll discuss this in my next post. For now, can you suggest options for improvements?
More posts in "Excerpts from the RavenDB Performance team report" series:
- (20 Feb 2015) Optimizing Compare – The circle of life (a post-mortem)
- (18 Feb 2015) JSON & Structs in Voron
- (13 Feb 2015) Facets of information, Part II
- (12 Feb 2015) Facets of information, Part I
- (06 Feb 2015) Do you copy that?
- (05 Feb 2015) Optimizing Compare – Conclusions
- (04 Feb 2015) Comparing Branch Tables
- (03 Feb 2015) Optimizers, Assemble!
- (30 Jan 2015) Optimizing Compare, Don’t you shake that branch at me!
- (29 Jan 2015) Optimizing Memory Comparisons, size does matter
- (28 Jan 2015) Optimizing Memory Comparisons, Digging into the IL
- (27 Jan 2015) Optimizing Memory Comparisons
- (26 Jan 2015) Optimizing Memory Compare/Copy Costs
- (23 Jan 2015) Expensive headers, and cache effects
- (22 Jan 2015) The long tale of a lambda
- (21 Jan 2015) Dates take a lot of time
- (20 Jan 2015) Etags and evil code, part II
- (19 Jan 2015) Etags and evil code, Part I
- (16 Jan 2015) Voron vs. Esent
- (15 Jan 2015) Routing
Comments
I'm I correctly assuming you will use Map Reduce with the optimized Intersect method on Lucene indexes in order to speed up faceted search?
Google does it like this so it must be good technique.
Thanks for the helpful explanation of facets. That's one area of RavenDB that I've felt I've never had a good handle on, never quite understood the purpose of. This post helped.
Comment preview