Ayende @ Rahien

Refunds available at head office

Reviewing OSS Project: Whiteboard Chat–Unbounded Result Sets and Denial of Service Attacks

Originally posted at 3/8/2011

As a reminder, I am reviewing the problems that I found while reviewing the Whiteboard Chat project during one of my NHibernate’s Courses. Here is the method:

[Transaction]
[Authorize]
[HttpPost]
public ActionResult GetLatestPost(int boardId, string lastPost)
{
    DateTime lastPostDateTime = DateTime.Parse(lastPost);
    
    IList<Post> posts = 
        _postRepository
        .FindAll(new GetPostLastestForBoardById(lastPostDateTime, boardId))
        .OrderBy(x=>x.Id).ToList();

    //update the latest known post
    string lastKnownPost = posts.Count > 0 ? 
        posts.Max(x => x.Time).ToString()
        : lastPost; //no updates
    
    Mapper.CreateMap<Post, PostViewModel>()
        .ForMember(dest => dest.Time, opt => opt.MapFrom(src => src.Time.ToString()))
        .ForMember(dest => dest.Owner, opt => opt.MapFrom(src => src.Owner.Name));

    UpdatePostViewModel update = new UpdatePostViewModel();
    update.Time = lastKnownPost; 
    Mapper.Map(posts, update.Posts);

    return Json(update);
}

In this post, I want to focus on a very common issue that I see over & over. The problem is that people usually don’t notice those sort of issues.

The problem is quite simple, there is no limit to the amount of information that we can request from this method. What this mean is that we can send it 1900-01-01 as the date, and force the application to get all the posts in the board.

Assuming even a relatively busy board, we are talking about tens or hundreds of thousands of posts that are going to be loaded. That is going to put a lot of pressure on the database, on the server memory, and on the amount of money that you’ll pay in the end of the month for the network bandwidth.

There is a reason why I strongly recommend to always use a limit, especially in cases like this, where it is practically shouting at you that the number of items can be very big.

On the next post, we will analyze the SELECT N+1 issue that I found in this method (so far, my record is 100% success in finding this type of issue in any application that I reviewed)…

Comments

Dmitry
03/14/2011 02:13 PM by
Dmitry

Is the Owner property lazy loaded?

jdn
03/14/2011 02:18 PM by
jdn

"There is a reason why I strongly recommend to always use a limit"

And, once again, this is a bad recommendation. Getting all the posts in the board (or the equivalent in other apps) is quite often a legitimate use case.

Use limits when they make sense, sure. Blanket recommendations (or crippling a query engine silently) like this are bad.

Karg
03/14/2011 03:38 PM by
Karg

@jdn The point is not that the app shouldn't be able to show all posts. It is that the service call shouldn't be returning them all back in a single response.

If you do have a legitimate need for all of the board's posts (and let's just assume that you do for the sake of argument) then there are better back end implementation details for getting that data to the client than a single response with all of the data.

Scooletz
03/14/2011 03:39 PM by
Scooletz

@jdn

Look at Tweeter and other reaaally big services. You provide "more" button, which can be easily clicked several times, when not providing of "kill my db by querying for everything" button. I hope you don't consider export/import scenario in here;)

Daniel
03/14/2011 05:50 PM by
Daniel

jdn - When would it be a legitimate use case?

Can you provide an example?

Dmitry
03/14/2011 06:40 PM by
Dmitry

It would only be a legitimate case when you know about the size of the collection. A number of countries in the world or employees in the company is not likely to dramatically increase one day.

jdn
03/14/2011 11:38 PM by
jdn

@karg

What if I want them all in one call?

@kooletz

I will grant you that Twitter is probably not a legitimate use case. That's rather the extreme case though.

@daniel

If I want all of the open orders for a trade group that I support, that might return 15 records, or it might return 100,000+ (that's an accurate range, btw, not making it up). Regardless, when I query for all open orders, I want ALL open orders. And not paged either (if I want them paged, I'll page them explicitly).

Ayende said 'always' and I think he means it, that's why he made Raven DB safe/crippled by default (he calls it one, I call it the other, I'll let you guess which...LOL). And, he's been open about the fact that he did it for marketing reasons as well as technical ones.

And he's still wrong. YMMV.

Ayende Rahien
03/15/2011 06:04 AM by
Ayende Rahien

Jdn,

What are you going to do with 100,000+ orders?

jdn
03/15/2011 01:20 PM by
jdn

I am going to process them and send them to a 3rd party vended application that expects them in one shot (I've never really thought about it, but I don't think they even have paging in their API).

Karg
03/15/2011 01:25 PM by
Karg

@jdn

What if I want to make a blocking network call on my UI thread so I lock up my UI? That makes me a bad developer for wanting to do that.

You seem to be stuck on the idea that since you have a desire to show a large amount of data to the user at once that it must be returned in a single service call.

Why is it that you specifically want it returned in one service call?

jdn
03/15/2011 01:57 PM by
jdn

Ayende:

I can guarantee you that the 3rd party vendor expects it in one shot.

I am familiar with Udi's article and scenario, and it is irrelevant.

Which is part of the point. You don't know my scenario or the vendor, I do.

Now, if you want to ask me whether I think the vendor is doing things correctly, we may come to a different conclusion.

jdn
03/15/2011 01:59 PM by
jdn

Karg:

Who said anything about showing a large amount of data?

It is true that Ayende's specific example is an ActionResult, but again, he said "always" and believes it (and built/crippled RavenDB around the concept).

jdn
03/15/2011 02:11 PM by
jdn

Karg:

Sorry, missed the question.

To use my specific example, the end client wants everything to be sent to them in one shot. Since our systems are perfectly capable of handling 100,000 open orders in one shot, there is no reason not to get it in one call.

It is an interesting fact that even good developers appear not to consider the possibility of 'unbounded' result sets. I've discussed it with Ayende, and I don't dispute that even good developers write bad code.

It still isn't right to make code un-self-documenting.

If I write:

myCollection.Skip(472).Take(14393)

I mean, "skip 472 records and then take the next 14393". I do not mean, "skip 472 records and then take whatever Ayende thinks is the correct silent formerly poorly documented number of records that he thinks you should take because he doesn't want bad performance to reflect badly on RavenDB."

Luke
03/16/2011 11:07 PM by
Luke

@jdn

1) he never said it had to be a silent limit. There's nothing really wrong with letting the caller specify the limit, as long as it's kept reasonable.

2) if you have service A that calls service B to get data, then send it to service C (where service C needs it in one shot), why can't service B have limits? Service A can call it multiple times, aggregate, then send to C. When you write service D that presents service B results to screen, it's already paginated.

Hendry Luk
03/17/2011 02:35 AM by
Hendry Luk

I am with jdn on this case. If the 3rd party vendor requires the whole set of records anyway, how does splitting the request to multiple paginated calls (like few people have suggested above) make it any cheaper? You're still querying for the same set of data, splitting them to multiple streams will do nothing but making it a lot more costly. Law of physics ensures that.

This is a perfect example of enforcing a blanket guidance (i.e. pagination) just for the sake of it backed by no legitimate reason, in which you're actually causing the exact problem the guidance is meant to overcome.

Ayende Rahien
03/17/2011 07:32 AM by
Ayende Rahien

Hendry,

It means that Service B doesn't need to worry about Out Of Memory Exceptions, for once.

Since in Service A, you are explicitly doing something out of the ordinary, you take care of that only in that place, and you don't have to worry about this in multiple systems.

Hendry Luk
03/17/2011 10:54 PM by
Hendry Luk

I thought we had had streaming of large data in (web-)services? And similarly in NH. Unbounded query might not necessary be all bad, but holding unbounded data in memory definitely is.

Nick Aceves
03/18/2011 08:21 AM by
Nick Aceves

@jdn

So you're saying that because you happen to have an (allegedly) legitimate case where you need ALL of the data, that it's a bad idea in general to limit it by default?

I don't know about you, but I would much rather have a system that, by default, caters to the 99% case (i.e., when requesting all of the data at once is a mistake) and requires some minor tweaking to make that 1% case work when I'm really sure I need to shoot myself in the foot.

In RavenDB you can override the max page size on the server. Problem solved.

In the case of this post, we're talking about a UI. I don't care what business function your app performs, displaying 100,000 records all at once to a user without paging is a bad idea for a whole host of reasons.

jdn
03/20/2011 02:01 AM by
jdn

@Nick

No one is arguing that you should pull 100,000 records to display in a UI.

I completely disagree that it is a 99% case that it is a mistake to request all data at once. My (allegedly) legitimate case is a very legitimate case, and I can come up with many more.

When I write code that pulls data, it is up to me to decide how much data is going to pulled and whether it needs to be paged. I absolutely don't want some silent global variable deciding that for me.

It is unfortunate that too many developers are too lazy to figure out the performance impact of the code that they write. Treating the symptom by crippling by default is an anti-practice, not a best practice.

Comments have been closed on this topic.