Ayende @ Rahien

It's a girl

If you are way off in the deep end, there is only so much that tooling can do for you

I get a lot of requests for what I term, the regex problem. Why the regex problem?

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.Jamie Zawinski in comp.lang.emacs.

A case in point, which comes up repeatedly, is this question:

Can you show us an example for loading collections of collections.
How would you write a query and avoid a Cartesian product multiple levels deep ?

In this case, we have someone who wants to load a blog, all its posts, and all its comments, and do it in the most efficient manner possible. At the same time, they want to have the tool handle that for them.

Let us take a look at how two different OR/Ms handle this task, then discuss what an optimal solution is.

First, Entity Framework, using this code:

db.Blogs
    .Include("Posts")
    .Include("Posts.Comments")
    .Where(x => x.Id == 1)
    .ToList();

This code will generate:

SELECT   [Project2].[Id]             AS [Id],
         [Project2].[Title]          AS [Title],
         [Project2].[Subtitle]       AS [Subtitle],
         [Project2].[AllowsComments] AS [AllowsComments],
         [Project2].[CreatedAt]      AS [CreatedAt],
         [Project2].[C1]             AS [C1],
         [Project2].[C4]             AS [C2],
         [Project2].[Id1]            AS [Id1],
         [Project2].[Title1]         AS [Title1],
         [Project2].[Text]           AS [Text],
         [Project2].[PostedAt]       AS [PostedAt],
         [Project2].[BlogId]         AS [BlogId],
         [Project2].[UserId]         AS [UserId],
         [Project2].[C3]             AS [C3],
         [Project2].[C2]             AS [C4],
         [Project2].[Id2]            AS [Id2],
         [Project2].[Name]           AS [Name],
         [Project2].[Email]          AS [Email],
         [Project2].[HomePage]       AS [HomePage],
         [Project2].[Ip]             AS [Ip],
         [Project2].[Text1]          AS [Text1],
         [Project2].[PostId]         AS [PostId]
FROM     (SELECT [Extent1].[Id]             AS [Id],
                 [Extent1].[Title]          AS [Title],
                 [Extent1].[Subtitle]       AS [Subtitle],
                 [Extent1].[AllowsComments] AS [AllowsComments],
                 [Extent1].[CreatedAt]      AS [CreatedAt],
                 1                          AS [C1],
                 [Project1].[Id]            AS [Id1],
                 [Project1].[Title]         AS [Title1],
                 [Project1].[Text]          AS [Text],
                 [Project1].[PostedAt]      AS [PostedAt],
                 [Project1].[BlogId]        AS [BlogId],
                 [Project1].[UserId]        AS [UserId],
                 [Project1].[Id1]           AS [Id2],
                 [Project1].[Name]          AS [Name],
                 [Project1].[Email]         AS [Email],
                 [Project1].[HomePage]      AS [HomePage],
                 [Project1].[Ip]            AS [Ip],
                 [Project1].[Text1]         AS [Text1],
                 [Project1].[PostId]        AS [PostId],
                 CASE 
                   WHEN ([Project1].[C1] IS NULL) THEN CAST(NULL AS int)
                   ELSE CASE 
                          WHEN ([Project1].[Id1] IS NULL) THEN CAST(NULL AS int)
                          ELSE 1
                        END
                 END AS [C2],
                 CASE 
                   WHEN ([Project1].[C1] IS NULL) THEN CAST(NULL AS int)
                   ELSE CASE 
                          WHEN ([Project1].[Id1] IS NULL) THEN CAST(NULL AS int)
                          ELSE 1
                        END
                 END AS [C3],
                 [Project1].[C1]            AS [C4]
          FROM   [dbo].[Blogs] AS [Extent1]
                 LEFT OUTER JOIN (SELECT [Extent2].[Id]       AS [Id],
                                         [Extent2].[Title]    AS [Title],
                                         [Extent2].[Text]     AS [Text],
                                         [Extent2].[PostedAt] AS [PostedAt],
                                         [Extent2].[BlogId]   AS [BlogId],
                                         [Extent2].[UserId]   AS [UserId],
                                         [Extent3].[Id]       AS [Id1],
                                         [Extent3].[Name]     AS [Name],
                                         [Extent3].[Email]    AS [Email],
                                         [Extent3].[HomePage] AS [HomePage],
                                         [Extent3].[Ip]       AS [Ip],
                                         [Extent3].[Text]     AS [Text1],
                                         [Extent3].[PostId]   AS [PostId],
                                         1                    AS [C1]
                                  FROM   [dbo].[Posts] AS [Extent2]
                                         LEFT OUTER JOIN [dbo].[Comments] AS [Extent3]
                                           ON [Extent2].[Id] = [Extent3].[PostId]) AS [Project1]
                   ON [Extent1].[Id] = [Project1].[BlogId]
          WHERE  1 = [Extent1].[Id]) AS [Project2]
ORDER BY [Project2].[Id] ASC,
         [Project2].[C4] ASC,
         [Project2].[Id1] ASC,
         [Project2].[C3] ASC

If you’ll look closely, you’ll see that it generate a join between Blogs, Posts and Comments, essentially creating a Cartesian product between all three.

What about NHibernate? The following code:

var blogs = s.CreateQuery(
    @"from Blog b 
        left join fetch b.Posts p 
        left join fetch p.Comments 
    where b.Id = :id")
    .SetParameter("id", 1)
    .List<Blog>();

Will generate a much saner statement:

select blog0_.Id             as Id7_0_,
       posts1_.Id            as Id0_1_,
       comments2_.Id         as Id2_2_,
       blog0_.Title          as Title7_0_,
       blog0_.Subtitle       as Subtitle7_0_,
       blog0_.AllowsComments as AllowsCo4_7_0_,
       blog0_.CreatedAt      as CreatedAt7_0_,
       posts1_.Title         as Title0_1_,
       posts1_.Text          as Text0_1_,
       posts1_.PostedAt      as PostedAt0_1_,
       posts1_.BlogId        as BlogId0_1_,
       posts1_.UserId        as UserId0_1_,
       posts1_.BlogId        as BlogId0__,
       posts1_.Id            as Id0__,
       comments2_.Name       as Name2_2_,
       comments2_.Email      as Email2_2_,
       comments2_.HomePage   as HomePage2_2_,
       comments2_.Ip         as Ip2_2_,
       comments2_.Text       as Text2_2_,
       comments2_.PostId     as PostId2_2_,
       comments2_.PostId     as PostId1__,
       comments2_.Id         as Id1__
from   Blogs blog0_
       left outer join Posts posts1_
         on blog0_.Id = posts1_.BlogId
       left outer join Comments comments2_
         on posts1_.Id = comments2_.PostId
where  blog0_.Id = 1 /* @p0 */

While this is a saner statement, it will also generate a Cartesian product. There are no two ways about it, this is bad bad bad bad.

And the way to do that is quite simple, don’t try to do it in a single query, instead, we can break it up into multiple queries, each loading just a part of the graph and rely on the Identity Map implementation to stitch the graph together.  You can read the post about it here. Doing this may require more work on your part, but it will end up being much faster, and it is also something that would be much easier to write, maintain and work with.

Comments

liviu
02/05/2010 11:43 AM by
liviu

hi.

you ignore the linq to sql generated statements which are THE MOST EFFICIENT. And do not require changes of semantics

Why?

Ayende Rahien
02/05/2010 11:49 AM by
Ayende Rahien

Liviu,

Because the L2S version fails very frequently in anything but the most common cases.

Thilak Nathen
02/05/2010 11:51 AM by
Thilak Nathen

Your NH examples seem to always use the query API with HQL. While that is not the point of this post (or most of your other ones), it scares first time NH users, who may not know about the Criteria API or the new QueryOver stuff or the linq provider... and augments to the arguments of NH haters.

This is especially so when you put NH and EF query syntaxes side by side in a single post... using the most verbose way to query NH and the most elegant way to query EF.

Frans Bouma
02/05/2010 12:24 PM by
Frans Bouma

LLBLGen Pro (using linq)

var q = metaData.Customer

        .WithPath(p => p.Prefetch

<orderentity(c => c.Orders)

            .SubPath(op => op.Prefetch(o => o.OrderDetails)));

3 queries: 1 for Customer, 1 for Order with a filter on customerid based on the query on customer, and 1 for orderdetails, with a filter on orderid based on the query on order. This tunes itself at runtime, if there are n Customers fetched and n is below a threshold it will replace the subquery filter on order with an IN predicate on order.CustomerID with the values for the PK values of the fetched customers.

You can extend the Prefetch<> elements above with order by, filter, exclude fields etc.

Rafal
02/05/2010 12:56 PM by
Rafal

Ayende, are you trying to say that using prefetch is harmful when loading a single master record with many dependencies?

Let's see: prefetch is bad when loading a single record - as shown above. And it's equally bad when loading multiple records because you usually don't need collections related to main object (you are showing a list, not going into each record's details). So, prefetching is only useful if we have 1:1 mapping, or the number of related records is very low?

Ayende Rahien
02/05/2010 01:00 PM by
Ayende Rahien

Rafal,

I am saying that you need to be aware of what is actually going on.

And in general, you want to load a single collection per query.

david
02/05/2010 03:19 PM by
david

From my understanding the SQL generated by NH is not a Cartesian product. Left outer joins do not generate Cartesian products like this. A Cartesian product would have the effect of returning number of blog entries * number of post entries * number of comment entries, the format would be like so:

Select * --brevity

from blog, post, comment

thus omitting the join conditions, effectively using CROSS JOIN statements. What the NH Sql does do is populate the entire graph of objects, and while that is in most cases not the intent of what you want, it is not nearly as bad as a Cartesian product.

I did not look in detail at what the EF generated as that is more than my poor brain can handle without running the SQL

Ayende Rahien
02/05/2010 03:25 PM by
Ayende Rahien

David,

I am probably using Cartesian product term a bit losely.

What I meant here is that you are going to lot a lot of information unnecessarily.

Graeme Hill
02/05/2010 03:55 PM by
Graeme Hill

Yeah, I thought you would have to do a cross join to get a Cartesian product. So the problem is just that the result set generated by one query actually has more cells than the sum of the three record sets returned by separate queries?

If that is the case then it begs the question: Which has more overhead, sending separate queries (will they be in a single round trip?) or just lumping it into one query and returning a little more data?

Mark Gellings
02/05/2010 03:56 PM by
Mark Gellings

The NHibernate SQL is not a Cartesian product. You are getting a flattened resultset combining the various tables. A Cartesian product from what I understand would be formed if there were no joins between the tables in which all possible combinations are produced.

JeroenH
02/05/2010 04:09 PM by
JeroenH

@Ayende, I have to second the other commenters (and your own acknowledgement) that you should avoid the term "Cartesian Product" in this case. It's confusing at first...

tobi
02/05/2010 05:38 PM by
tobi

The discussion about the "cartesian product" is not the point here. What is important to all of us is that Ayende has recognized this phenomenon as an important issue.

@Ayende, you did not discuss an optimal solution to this problem. I propose:

select b.id as id, 0 as tag ... from blogs b

union all select p.blogid as id, 1 as tag, ... from posts p

union all select c.blogid as blogid, 2 as tag, ... from comments c

order by id, tag

i learned this pattern in SQL Books Online as an example on how to construct a hierarchical XML result set.

configurator
02/06/2010 02:40 AM by
configurator

@tobi: Wow. I'm speechless.

@Ayende: Am I right that the response here would be for 2 blogs, 2 posts each and 2 comments each

blog1, blog1data, post1, post1data, comment1, comment1data

blog1, blog1data, post1, post1data, comment2, comment2data

blog1, blog1data, post2, post2data, comment1, comment1data

blog1, blog1data, post2, post2data, comment2, comment2data

blog2, blog2data, post1, post1data, comment1, comment1data

blog2, blog2data, post1, post1data, comment2, comment2data

blog2, blog2data, post2, post2data, comment1, comment1data

blog2, blog2data, post2, post2data, comment2, comment2data

And is this duplication of data what you mean when you say Cartesian product?

Since this is blogs we're talking about, the data would be quite large. In fact, I think that even getting the post data with its comments would be a mistake, since the blog contents would be duplicated for each comment. Am I missing something here?

Ayende Rahien
02/06/2010 10:24 AM by
Ayende Rahien

Configurator,

No, that is the problem that I am pointing out

Udi Dahan
02/06/2010 02:28 PM by
Udi Dahan

That's why CQRS points to a different data structure for those queries :-)

configurator
02/06/2010 03:48 PM by
configurator

So wouldn't the query

var blogs = s.CreateQuery(

@"from Posts p 

    left join fetch p.Comments 

where p.Id = :id")

.SetParameter("id", 1)

.List

<post();

Also be quite bad since the post data is duplicated?

It seems to be that the best SQL for my shorter example would be

SELECT * FROM Posts WHERE ID = @ID

SELECT * FROM Comments WHERE PostID = @ID

Stefan Wenig
02/06/2010 04:31 PM by
Stefan Wenig

Like LLBLGen, our own ORM re-store is automatically splitting this into three separate queries:

var query = from blog b in QueryFactory.Linq() where b.id == 1;

query = query.Fetch (b => b.Posts).ThenFetch (p => p.Comments);

This will create two additional queries, each including the previous query's conditions but only one set of SELECT clauses.

Now the funny thing is that we talked about that feature, because it is in re-linq which eventually got used for LINQ2NH too, and you said you'd like a Cartesian product better. I didn't bother to ask for a reason back than, but I kept wondering why you would - in a general case - want to buy the advantage of a single query execution for the price of that much data overhead. Now it seems we're not that far off, are we? (mail from feb 2 2009 if you care)

That said, I would like to have the option to switch to Cartesian products explicitly, because in some cases the redundant data may be neglectible compared to the effort of having to find the initial set of records (blogs in your example) multiple times. Like, in queries that can't quite use the index...

Frans Bouma
02/07/2010 10:21 AM by
Frans Bouma

Tobi, that has several drawbacks:

1) your query result will get very wide, as every column of every subresult has to be present.

2) with every 1:n relationship traversal, you'll get an explosion of rows

3) multiple paths originating from the same parent gives problems (explosion of columns)

4) a lot of null-ed values are reported back.

5) your entity fetch engine has to be adjusted to consume these odd sets, while when you do it the way llblgen or re-linq do it, you can build it ON TOP of existing fetch pipelines, so it's really an easy job. and IMHO more efficient.

Mohamed Meligy
02/08/2010 11:04 AM by
Mohamed Meligy

@Ayende

Is there a way to make NHibernate apply this in 3 queries?

Then it'd use the identity map to connect the results together. Having Multiple Result Sets supported for the ADO.NET connection created under the hood is one feature that can be taken benefit of here..

Ayende Rahien
02/08/2010 11:07 AM by
Ayende Rahien

Mohamed,

Take a look at the linked post

Mohamed Meligy
02/08/2010 11:54 AM by
Mohamed Meligy

@Ayende,

Ah.. sorry that I didn't notice it when the question came to mind.

Actually, since I'm old user of LLBLGen Pro and read a lot about how it does fetching, I thought this was the default trend for ORM makers. Very big false assumption as it seems.

I thought of futures already. Thanks for showing the complete example in the other post.

Is there an easier way than futures in case of more than 2 fetch paths? I have been in situations when 3+ paths were required in a really huge screen. Deciding whether that decision was right will need much domain explanations but at least it wasn't very bad because it was agreed upon by 4 senior developers. So, it's possible to need 3+ fetch paths. Using LLBLGen Pro it's as simple as PreFetchPath.SubPath.Add (....childPrefetchPath). I thought NH Eager fetch works the same...

Ayende Rahien
02/08/2010 12:15 PM by
Ayende Rahien

Mohamed,

NH's eager fetching will work nicely as long as you use don't try to load more than a single collection eagerly per query.

I suggest loading all single associations in the first query, and then a single collection association per each of the next queries.

Stefan Wenig
02/09/2010 09:26 AM by
Stefan Wenig

Ayende,

if I understand correctly, the only way in NH to have this n-query behavior is to manually write several queries and throw away all but one result set, and there's no way at all to prefetch collections of collections. If you want that, it's all in re-linq. There's a thread on the NH users list where I posted some details for a similar problem: groups.google.com/.../c05d2c7ea7233340

(only in this case it's easier, because that's just what re-linq's fetch mechanism is made for. you just need to add some extension methods to your namespace to turn it on.)

there's one disadvantage: all of the created LINQ queries have to be converted to HQL and then SQL from scratch - a native NH solution could do better. however, that's what you'd get from manually built fetching queries too, and Steve's provider does enough caching to make this a neglectible problem. also, you can still replace it with an optimized implementation later if it should turn out desirable.

Comments have been closed on this topic.