What am I so hung up on Unbounded Result Set?
Originally posted at 3/10/2011
You might have noticed that I am routinely pointing out that there are issues, drastic issues, with any piece of code that issues a query without setting a limit to the number of results.
I got several replies saying that I am worrying too much about this issue. I don’t really understand that sort of thinking. Leaving aside the possibility of literally killing our application (as a result of Out of Memory Exception), unbounded result sets are dangerous. The main problem is that they aren’t just a theoretical problem, it is a problem that happens with regular intervals in production systems.
The real issue is that it is not just System Down issues, this might be the best scenario, actually. In most production deployment, you are actually paying for the amount of data that you are passing around. When you start dealing with unbounded result set, you are literally writing an open check and handing it to strangers.
I don’t know many people who can do something like this with equanimity.
It doesn’t take a lot to get to the point where this sort of thing really hurts. Moreover, there are truly very few cases where you actually need to have access to the entire result set. For the most part, when I see developers doing that, it is usually out of sheer laziness.
Comments
One place I am using an unbounded result set is for populating drop-down lists, e.g. I have a search filter that can be filtered by organisation and I am fetching all the organisations from the database. Is the recommended way to handle this a text box with auto-complete or is an unbounded result set okay while the system is still small?
Oren, in which situation would you really call it "dangerous"? Isn't it YAGNI if I always limit the results, just because I load the user-table of a small web-app?
The other thing is, that if one limits the result set, then he also wants to implement a paging mechanism into his application. Otherwise it would be sarcastic. Pretty much overkill i think.
Paul,
Think about the scenario from the user perspective, if you have a drop down list with 75 items, it is unusable from a UX perspective.
An auto complete / search is much better
Daniel,
Think about something like Azure, where you are actually charged for those sort of things. You just gave ANONYMOUS_USER a credit line on your account.
And the main problem is that what you assume to be small won't always be small
Very good point. I was reading this post blog.serverfault.com/.../views-of-the-same-prob... which comes at the problem from a different angle.
In this situation a large result set, ~3000 rows, on the StackOverflow site. Now this query was bounded but the boundary was large. During operation it saturated the webs servers network connection. The post has a good explanation of how transferring large amounts of data is compounded by the way the data is broken apart for transmission. In this case it was the TCP packet size which compounded the problem. Any unbounded result set will suffer the same potential problems.
The post is through and includes some good links to tips which can help find the queries which are returning large datasets.
This isn't always an issue.
What if the data table is only editable by admins and is a system setting that would normally have 5-10 values? If they edit it obviously it would use more bandwidth but you would normally need all the possible values and not just a subset of them...
My main concern with these unbounded result sets is not so much the exploitation..
Having someone exploit that usually means someone actually cares about your app (and 99% of the people don't)..
My main beef with that is that it's usually the kind of thing that comes back to bite me after 2 years in production. With users calling you telling you "the system is slow as hell and we have no clue why" ..
Throw in a little N+1 somewhere and your system performance is degrading gradually over time as the system grows.. Although none of the queries are a problem, the size of data you are pulling around will make the system crawl..
And I hate debugging that kind of problems ;)
Thank god I have NHProf, so at least my unbounded result sets that are in the App are intentional
Configurator,
Then put a limit of 50 on it.
That way, if it ever grows too big, it is a bug in the program, and not a crashing emergency fix issue
"My main beef with that is that it's usually the kind of thing that comes back to bite me after 2 years in production"
^^^^ THIS.
While most of my problems with unbounded result sets are more philosophical in nature, I have been bitten TOO MANY TIMES by production code somebody else wrote that has this problem. I cannot count on two hands and two feet how many times this has come up for me even in the past year. And some of this code was written by "good" programmers... good programmers who didn't think ahead or thought "YAGNI".
The worst one actually brought a production server to its knees because someone requested did tried to do a search for "everything since inception" in one of our legacy applications. It OOMed the w3wp.exe instance that was running the app, crashing several other applications and causing the entire server's performance to go off a cliff. That led to me having to dig through logs, dust off my WinDBG skills, and spend hours figuring out what the cause was. All that instead of actually innovating and solving real business problems.
I'm hear to say that, from my experience, You ARE Going to Need It.
Wow, and my spelling and grammar both totally suck at 1:30 in the morning :)
@Nick Aceves
good and funny too.
YAGNI is a great concept. But people also use it as an excuse to be lazy. You build up loads of technical debt and then forget what you owe. And then you pay back the debt and the time to find out it takes "who" you owe the debt too.
I've always seen this as more of a library issue, back when I had the energy bother, I'd wrap all NHibernate calls to force a maxresults if the user hadn't attempted to set one themselves (with the option of limiting to -1 for unbounded.)
I agree that it's something that always has to be set, but I don't really like it when my libraries set their default behavior to something outright dangerous.
I get this for free by using Rhino ESB. It prevents you from submitting a message with a collection with more than 256 items in it. And the message always contains a business key of some kind. So no one can POST an unbounded result set. If I make sure all my GETs use the business key when querying I can be fairly certain an actual unbounded result set won't happen.
Joao,
That is because the same set of reasoning guided me when I built it
How do you deal with one to many association collections? Those are basically unbounded result sets too.
I think NHibernate has some method for setting a bound on the collection. What happens when such a "bounded" collection only loads a subset of the association and the association is subsequently modified? Do the items that were not loaded get deleted from the database?
Ayende,
The problem is that we can't always figure out the good boundaries in the data access layer : for instance I have an online booking system for hotel.
I select the first 10 hotels filling the user's request
I select theirs 20 first rooms
I select the 40 pricing strategies
I select the 100 first available dates
-> I get 0 results because the first 10 hotels want only 7 days stay and the user asked for 1 day.
Problem :
an hotel with availability might not be shown
a room with availability might not be shown
an available date might not be shown
Maybe it's a very stupid question, but how do you deal with that kind of problems ? (I don't think here the problematic is linked to nhibernate or any other ORM)
Remi,
Start with selecting the hotels, then move from there. It is actually something that I would push into a service where they can answer those results much easier by building a specialized service just for that.
You mean that I should select my hotels more finely (ie : select only those who have available rooms at the right date) or that I should not apply boundaries to my request ? Frankly I didn't get your answer.
Remi,
I'll have a full blog post about the topic.
But yes, the first part is to select only the ones who are actually available. Then you can fetch the rest of the required information.
The second part is to make this fast, you do that by creating a service dedicate to answering availability questions, and optimizing that in isolation.
OK, looking forward to your article.
But the thing is, the business rule to say that an hotel is available might be really hard to explain with only SQL (or this might slow down the process), or even to cache in a temporary table is too complex. So I can't do a query that'll return 50 available hotel.
Comment preview