RavenDB RetrospectiveUnbounded result sets
We spent some time recently looking into a lot of our old design decisions. Some of them make very little sense today (json vs. blittalbe as a good example), but made perfect sense at the time, and were essential to actually getting the product out.
Some of those design decisions, however, are still something that I very firmly believe in. This series of posts is going to explore those decisions, their background and how they played out in the real world. So, without further ado, let us talk about unbounded result sets.
The design of RavenDB was heavily influenced by my experience as That NHibernate Guy (I got started with NHibernate over a decade ago, if you can believe that), where I saw the same types of error, repeated over and over again. I then read Release It!, and I suddenly discovered that I wasn’t alone fighting those kind of demons. When I designed RavenDB, I set out explicitly to prevent as many of those as I possibly could.
One of the major issues that I wanted to address was Unbounded Result Sets, simply put, this is when you have:
SELECT * FROM OrderLines WHERE OrderID = 1555
And you don’t realize that this order has three million line items (or, which is worst, that most of your orders have a few thousands line items, so you are generating a lot of load on the database, only to throw most of them away).
In order to prevent this type of issue, RavenDB has the notion of mandatory page sizes.
- On the client side, if you don’t specify a limit, we’ll implicitly add one (by default set to 128).
- On the server side, there is a database wide maximum page size (by default set to 1024). The server will trim all page sizes to the max if they are larger.
I think that this is one of the more controversial decisions in RavenDB design, and one that got a lot of heated discussion. But I still think that this is a good idea,because I have seen what happens when you don’t do that. And the arguments are mostly about “RavenDB should trust developers to know what they are doing” and a particular irate guy called me while I was out shopping to complain how I broke the sacred contract of Linq with regards to “queries should return all by default, even if this is ten billion results”. I pointed out that this is actually configurable, and if he wanted to set the default to any size he wanted, he could do that, but apparently it is supposed to be “shoot my own foot first, then think” kind of deal.
Even though that I still think that this is a really good idea, we have added some features over the years to make it easy for people to access the entire dataset when they need it. Streaming has been around since 2.5 or so, giving you a dedicated API to stream unbounded results. Streams were built to make it efficient to process large sets of data, and they allow both client & server to process the data in parallel, instead of batching huge responses on the server, then consuming ridiculous amounts of memory on the client before giving you the full result set. Instead, you can get each result as soon as it arrive from server, and you can process it and send it further.
In 4.0, we are going to change the behavior of the paging limits so:
- If you don’t specify a limit, we’ll supply a limit clause of 25 items. If there are more than 25 items, we’ll throw an exception (unless you asked otherwise in the conventions).
- If you supply a limit explicitly, it will work as expected and page through the data.
The idea is that we want to reduce the surprise for users, and that can give them the experience to draw upon early on. Another thing that we’ll do is make sure that the operations guys can also change that, likely with an environment variable or something like that. If you need to modify the conventions on the fly, you usually have hard time deploying a new version, and an immediate action is needed.
In this manner, we can help users avoid expensive requests to the server, and they can be explicit with what they need to do.
More posts in "RavenDB Retrospective" series:
- (17 Oct 2016) The governors
- (14 Oct 2016) Explicit indexes & auto indexes
- (12 Oct 2016) BASE Indexes
- (28 Sep 2016) Unbounded result sets
Comments
Silent swallowing of the rest of the data is what bothered me for a long time. I think exceptions is a much better solution here due to the fail fast principle.
“RavenDB should trust developers to know what they are doing” :) yeah about that ....
Still, I would very much prefer a model where RavenDB only throws in Debug mode, in Release mode and on production servers it should just log a Warning. But that's just my preference.
Pop, There are problems here: a) No one ever looks at the logs. b) That is why you have the conventions for.
Ayende, it's true no one looks at logs. Period.
However when you present the information elsewhere, like a dashboard, (IE: "Performance Alerts (5)", with Red) then people start paying attention.
I know of an example where a company was gathering in excess of 300 GB of logs per week and they did nothing with them for over 6 years. One day when a simple dashboard was created with Top 5 slowest queries, Top 5 slowest pages. Top 5 most common errors, Pages taking over 5 seconds to load etc. people started jumping out of their chairs to fix the issues, even though they were there for years and no one bothered checking.
Pop Catalin, We are actually doing that, but the problem is that those are stuff that is generated by the client, there isn't a dashboard that I can generate there.
So what's keeping sysadmins from setting the page limit 999.9999 records in order to eliminate the errors from production systems? That's what a typical sysadmin would do as soon as it receives complaints and finds the "solution" online.
Pop Catalin, Absolutely nothing, but then they accept the responsibility for this setting.
@Pop Catalin taking responsibility of it means that it will marked in red (this is freaking unsafe, use at your own risk kind of warning) and when a support engineer comes and looks at the unsafe endpoint because you are saying RavenDB is consuming X GB of memory he can say: "Solved" you have a very unsafe parameter there, it is working as designed.
Pop Catalin, are you really saying you prefer auto-truncation of result sets (or, as Vlad calls it "silent swallowing") as the right thing to do, in production? Honest question
@njy, I prefer just a warning in production, and neither truncation nor raising errors is acceptable.
If there's an honest mistake somewhere, I prefer a spike in resource usage or a slowdown than let's say incomplete data leading to wrong results or service disruption due to errors. There are far worse things than a slow running query from a business perspective: a) not running at all, b) giving wrong results.
Pop Catalin oh, I see, so you meant a warning without truncation, ok.
i just don't get why a fix should be to configure the server limit to a higher value. it's so easy to do paging in ravendb, all our webservices just provides paging, full stop! also on a ui level it just doesn't make sense to have too many items. no unbound.ToList() crap.
I think Pop Catalin hit the nail on the head - "There are far worse things than a slow running query from a business perspective: a) not running at all, b) giving wrong results."
Wrong results ends up costing (a) someone noticing it's wrong, (b) troubleshooting time and (c) fixing time. All on an emergency schedule.
Not running at all ends up costing (c) fixing time. Again, on an emergency schedule. Better, but you're still forcing emergencies.
A Debug.Assert and production logging seems the best compromise. Let customers fix their slow queries according to their own business needs, not your support costs.
Or just change the API, and require an explicit limit. Principle of Least Surprise and all that....
Mark, A debug assert won't actually do much, and no one is looking at the logs.
Take a look at this issue: http://indexoutofrange.com/Debugging-high-memory-usage.Part-2-DotNetMemoryProfiler/
That is the kind of thing that takes a very long time to figure out, and it is a silly oversight that is causing this.
That is the whole reason why we have those limits in place.
And it isn't trying to save my support costs (take into account, if this was a support issue it would be trivial), I'm charging for support, so easy support issues are basically profit.
The reason to prevent it is to avoid the users getting into problem
Mark, The most commonly used API is Linq, which doesn't require it. Also, the default approach by users it to put int.MaxValue
Apologies for going long, but this is actually something that's bothered me for quite awhile on RavenDB. I love the ideas (and engineering) behind it, but could never quite get on board with some of the more opinionated (and controversial) design choices.
For me, my relevant expectations for good .NET libraries are:
Prior, unbounded results in RavenDB would break all three of those; the change to throwing an exception makes it better, but still breaks the latter two.
No one proactively looks at logs; but they do when troubleshooting. Or at least, they should - way before they start doing offline memory dumps, at least.
I see your point with the Debug.Assert; with the change to throw an exception at 25, that's no longer really relevant unless you also require an explicit limit (and then could be used for backwards compatibility).
Unless you're aiming for portability, can't RavenDB require it regardless? I've never built a LINQ provider, but you obviously have enough info to know if a limit was supplied. Sure, it's a little leaky - but pretty much all the LINQ providers are (I mean, you're competing with EF...it's not like you can do much worse). Even for portability, an explicit flag to
DisableUnboundedResultChecking
for those porting apps would be explicit enough I think. As for int.MaxValue...just throw on that too. Hell, set a max size of 25 or whatever and throw on anything larger (regardless of actual result length). That again moves the problem to dev, where it's cheap to fix.I haven't had this issue in production, but my thought process would be that the "fix" everyone will do is (1) increase the limit in config/code/whatever, (2) create a backlog item to do paging the right way, and (3) move on to work that adds business value. Maybe, just maybe, the business runs out of valuable ideas and the devs go back to actually working on that backlog item...but hopefully (for the sake of their jobs) not. As both a developer and a manager, I have to admit that's what I would probably do.
All that said - I applaud the good fight, and think the change to throwing an exception is a welcome one. I don't like the fact that it only occurs at the magic 25 boundary, but I've never been a big convention-over-configuration guy either. I prefer my limits to kick me in the face while writing code, not to wake me up at 2am wondering what magic config setting I need to tweak now and looking like an amateur for releasing code that's so fragile as to break at 26 records.
Mark, For common queries:
Requiring Take() will serious disrupt the workflow.
Moving this configuration to server side also give admins a "quick fix" and a devs a way to catch this early (set "implicit take limit at 1").
Comment preview