Death by 70,000 facets
Just sit right back and you'll hear a tale, a tale of a fateful bug. That started from a simple request, about a feature that was just a bit too snug.
Okay, leaving aside my attempts at humor. This story is about a customer reporting an issue. “Most of the time we have RavenDB running really fast, but sometimes we have high latency requests”.
After a while, we managed to narrow it down to the following scenario:
- We have multiple concurrent requests.
- Those requests contains a Lazy request that has a facet query.
- The concurrent requests appears to all halt and then complete together.
In other words, it looks like we had all those requests waiting on a lock, then when it is released, all of them are free to return. This makes sense, there is a cache lock in the facet code that should behave in this manner. But when we looked at that, we could see that this didn’t really behave in the way we expected it to.
Eventually we got to test this out on the client data, and that is when we were able to pin point the issue.
Usually, you have facets like this:
The one of the left is when searching Amazon for HD, the one on the right is when you search Amazon for TV.
In RavenDB, you typically express this sort of query using:
session.Query<Product>().Search(“Search”, query).ToFacets(new Facet { Name = “Brand”} );
And we expect the number of facets that you have in a query to be in the order of a few dozens.
However, the client in question has done something a bit different. I think that this is because they brought the system over from a relational database. Each product in the system had a list of facets associated with it. It looked something like:
“Facets”: [13124,87324,32812,65743]
Obvious, this means that this product belongs to the “2,000 – 3,500” price range facet (electronics) the “Red” color facet (electronics, handheld) etc…
In total, we had over 70,000 facets in the database, and that is just something that we never really expected. Because we didn’t expect it, we reacted… poorly when we had to deal with it. In fact, what happened was that pattern of behavior meant that we effectively had worse than not having a cache, we would always have to do the work, and never really gain any benefit from it (there wasn’t enough sharing to actually trigger the benefits of the cache). And because we did locks on the cache to ensure that we don’t get into a giant mess… Well, you can figure out how it went from there.
The fix was to actually devolve the code in to a simpler method. Instead of trying to be smart and just figure out what we needed to compute for this query, we can be aggressive and load everything we needed. All the next requests will result in no wait time, because the data is already there. The code became much simpler.
Oh, and we got it deployed to production and saw a 400% decrease in the average request time, and no more sudden waits when we had requests piling up.
Comments
Comment preview