PostmortemAccidentally quadratic indexing output
Tracking down a customer’s performance issue, we eventually tracked things down to a single document modification that would grind the entire server to a halt. The actual save was working fine, it was when indexing time came around that we saw the issues. The entire system would spike in terms of memory usage and disk I/O, but CPU utilization wasn’t too bad.
We finally tracked it down to a fairly big document. Let’s assume that we have the following document:
Note that this can be big. As in, multiple megabyte range in some cases, with thousands of reviews. The case we looked at, the document was over 5MB in size and had over 1,500 reviews.
That isn’t ideal, and RavenDB will issue an performance hint when dealing with such documents, but certainly workable.
The problem was with the index, which looked like this:
This index is also setup to store all the fields being indexed. Take a look at the index, and read it a few times. Can you see what the problem is?
This is a fanout index, which I’m not a fan of, but that is certainly something that we can live with. 1,500 results from a single index isn’t even in the top order of magnitude that we have seen. And yet this index will cause RavenDB to consume a lot of resources, even if we have just a single document to index.
What is going on here?
Here is the faulty issue:
Give it a moment to sink in, please.
We are indexing the entire document here, once for each of the reviews that you have in the index. When RavenDB encounters a complex value as part of the indexing process, it will index that as a JSON value. There are some scenarios that call for that, but in this case, what this meant is that we would, for each of the reviews in the document:
- Serialize the entire document to JSON
- Store that in the index
5MB times 1,500 reviews gives us a single document costing us nearly 8GB in storage space alone. And will allocate close to 100GB (!) of memory during its operation (won’t hold 100GB, just allocate it). Committing such an index to disk requires us to temporarily use about 22GB of RAM and force us to do a single durable write that exceed the 7GB mark. Then there is the background work to clean all of that.
The customer probably meant to index book_id, but got this by mistake, and then we ended up with extreme resource utilization every time that document was modified. Removing this line meant that indexing the document went from ~8GB to 2MB. That is three orders of magnitude difference.
We are going to be adding some additional performance hints to make it clear that something is wrong in such a scenario. We had a few notices, but it was hard to figure out exactly what was going on there.
More posts in "Postmortem" series:
- (27 Sep 2021) Partial RavenDB Cloud outage
- (23 Jul 2021) Accidentally quadratic indexing output
- (13 Jul 2021) The case of the pauses, the fault and the GC
When you say "performance hints" where/how do you output these? To the end user in log somewhere?
Clearly remember encountering the same issue although slightly different outcomes https://ayende.com/blog/191650-B/production-post-mortem-too-much-ram-and-the-unstable-cluster and would save a ton of time for both you guys and the Customer if Raven can do some measures to detect and warn about early on. Maybe if Raven sees a complex object it throws a warning out X many times for an index before going mute on the alerts screen.
RavenDB can raise alerts that point out several common issues.
For example, query without a limit clause, very large docs, etc. We'll add more details on that specific issue.
That isn't in the log (no one reads that), it is user visible.
Yes, that is a very similar situation. However, in this case, the issue was that the overall size was so high. The previous scenario was about the number of allocations causing an issue, which we solved. The issue here was the overall output size.
perhaps a solution for this would be, apart from a warning, is to index the book object, but recognize that object and subsequent indexing of each review merely point to the single already-indexed book.
That would just save on the serialization work, we'll still need to write the whole thing to disk.
Perhaps docs should be able to share attached documents, in that case the attachment only gets written once? Then this could be treated similarly