Ayende @ Rahien

It's a girl

Death by 70,000 facets

Just sit right back and you'll hear a tale, a tale of a fateful bug. That started from a simple request, about a feature that was just a bit too snug.

Okay, leaving aside my attempts at humor. This story is about a customer reporting an issue. “Most of the time we have RavenDB running really fast, but sometimes we have high latency requests”.

After a while, we managed to narrow it down to the following scenario:

  • We have multiple concurrent requests.
  • Those requests contains a Lazy request that has a facet query.
  • The concurrent requests appears to all halt and then complete together.

In other words, it looks like we had all those requests waiting on a lock, then when it is released, all of them are free to return. This makes sense, there is a cache lock in the facet code that should behave in this manner. But when we looked at that, we could see that this didn’t really behave in the way we expected it to.

Eventually we got to test this out on the client data, and that is when we were able to pin point the issue.

Usually, you have facets like this:

The one of the left is when searching Amazon for HD, the one on the right is when you search Amazon for TV.

imageimage

In RavenDB, you typically express this sort of query using:

session.Query<Product>().Search(“Search”, query).ToFacets(new Facet { Name = “Brand”} );

And we expect the number of facets that you have in a query to be in the order of a few dozens.

However, the client in question has done something a bit different. I think that this is because they brought the system over from a relational database. Each product in the system had a list of facets associated with it. It looked something like:

“Facets”: [13124,87324,32812,65743]

Obvious, this means that this product belongs to the “2,000 – 3,500” price range facet (electronics) the “Red” color facet (electronics, handheld) etc…

In total, we had over 70,000 facets in the database, and that is just something that we never really expected. Because we didn’t expect it, we reacted… poorly when we had to deal with it. In fact, what happened was that pattern of behavior meant that we effectively had worse than not having a cache, we would always have to do the work, and never really gain any benefit from it (there wasn’t enough sharing to actually trigger the benefits of the cache). And because we did locks on the cache to ensure that we don’t get into a giant mess… Well, you can figure out how it went from there.

The fix was to actually devolve the code in to a simpler method. Instead of trying to be smart and just figure out what we needed to compute for this query, we can be aggressive and load everything we needed. All the next requests will result in no wait time, because the data is already there. The code became much simpler.

Oh, and we got it deployed to production and saw a 75% decrease in the average request time, and no more sudden waits when we had requests piling up.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Rafal
02/10/2014 10:11 AM by
Rafal

A 400% decrease in response time?

Paul Turner
02/10/2014 10:16 AM by
Paul Turner

I must be really stupid, even for a Monday: I can't understand what you mean by a "400% decrease" in request time. I read it as if a 50% decrease would mean "half the time". Need more Red Bull.

I assume the method which was doing poorly in this scenario had been written with some kind of optimization around the "in the order of a few dozens" scenario which didn't work for this 70,000 facets scenario. Did your simplification of the behaviour negatively or positively affect the "a few dozens" scenario? I'm curious whether it was straight-up enhancement or a compromise.

hilton smith
02/10/2014 10:54 AM by
hilton smith

400% decrease just sounds weird. if you decrease by more than 100% you end up in negative.

Tom Derisk
02/10/2014 11:43 AM by
Tom Derisk

it's a predictive system! LOL

Ayende Rahien
02/10/2014 12:31 PM by
Ayende Rahien

Guys, I fixed the post. I was meant to be the reverse of 400% increase.

hilton smith
02/10/2014 12:52 PM by
hilton smith

I figured thats what you meant, but for some reason it just became a focal point of the article...

When a customer has an large amount of facets like they do here is it not worth restructuring the data in some way?

Ayende Rahien
02/10/2014 12:59 PM by
Ayende Rahien

Hilton, If we can sure. If we can't, we see if we there is anything that we can do to help

Ryan Heath
02/10/2014 04:49 PM by
Ryan Heath

Was this a case of premature optimization? Since code is devolved into something simpler.

// Ryan

Ayende Rahien
02/10/2014 07:44 PM by
Ayende Rahien

Ryan, Pretty much, we tried to read less data, but it actually turned out that just reading it all.

Ayende Rahien
02/10/2014 07:44 PM by
Ayende Rahien

Ryan, Pretty much, we tried to read less data, but it actually turned out that just reading it all was cheaper.

Andrew
02/10/2014 09:45 PM by
Andrew

Which build did this happen in? please include the relevant builds in post like these Oren.

Comments have been closed on this topic.