If you are way off in the deep end, there is only so much that tooling can do for you
I get a lot of requests for what I term, the regex problem. Why the regex problem?
Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. — Jamie Zawinski in comp.lang.emacs.
A case in point, which comes up repeatedly, is this question:
Can you show us an example for loading collections of collections.
How would you write a query and avoid a Cartesian product multiple levels deep ?
In this case, we have someone who wants to load a blog, all its posts, and all its comments, and do it in the most efficient manner possible. At the same time, they want to have the tool handle that for them.
Let us take a look at how two different OR/Ms handle this task, then discuss what an optimal solution is.
First, Entity Framework, using this code:
db.Blogs .Include("Posts") .Include("Posts.Comments") .Where(x => x.Id == 1) .ToList();
This code will generate:
SELECT [Project2].[Id] AS [Id], [Project2].[Title] AS [Title], [Project2].[Subtitle] AS [Subtitle], [Project2].[AllowsComments] AS [AllowsComments], [Project2].[CreatedAt] AS [CreatedAt], [Project2].[C1] AS [C1], [Project2].[C4] AS [C2], [Project2].[Id1] AS [Id1], [Project2].[Title1] AS [Title1], [Project2].[Text] AS [Text], [Project2].[PostedAt] AS [PostedAt], [Project2].[BlogId] AS [BlogId], [Project2].[UserId] AS [UserId], [Project2].[C3] AS [C3], [Project2].[C2] AS [C4], [Project2].[Id2] AS [Id2], [Project2].[Name] AS [Name], [Project2].[Email] AS [Email], [Project2].[HomePage] AS [HomePage], [Project2].[Ip] AS [Ip], [Project2].[Text1] AS [Text1], [Project2].[PostId] AS [PostId] FROM (SELECT [Extent1].[Id] AS [Id], [Extent1].[Title] AS [Title], [Extent1].[Subtitle] AS [Subtitle], [Extent1].[AllowsComments] AS [AllowsComments], [Extent1].[CreatedAt] AS [CreatedAt], 1 AS [C1], [Project1].[Id] AS [Id1], [Project1].[Title] AS [Title1], [Project1].[Text] AS [Text], [Project1].[PostedAt] AS [PostedAt], [Project1].[BlogId] AS [BlogId], [Project1].[UserId] AS [UserId], [Project1].[Id1] AS [Id2], [Project1].[Name] AS [Name], [Project1].[Email] AS [Email], [Project1].[HomePage] AS [HomePage], [Project1].[Ip] AS [Ip], [Project1].[Text1] AS [Text1], [Project1].[PostId] AS [PostId], CASE WHEN ([Project1].[C1] IS NULL) THEN CAST(NULL AS int) ELSE CASE WHEN ([Project1].[Id1] IS NULL) THEN CAST(NULL AS int) ELSE 1 END END AS [C2], CASE WHEN ([Project1].[C1] IS NULL) THEN CAST(NULL AS int) ELSE CASE WHEN ([Project1].[Id1] IS NULL) THEN CAST(NULL AS int) ELSE 1 END END AS [C3], [Project1].[C1] AS [C4] FROM [dbo].[Blogs] AS [Extent1] LEFT OUTER JOIN (SELECT [Extent2].[Id] AS [Id], [Extent2].[Title] AS [Title], [Extent2].[Text] AS [Text], [Extent2].[PostedAt] AS [PostedAt], [Extent2].[BlogId] AS [BlogId], [Extent2].[UserId] AS [UserId], [Extent3].[Id] AS [Id1], [Extent3].[Name] AS [Name], [Extent3].[Email] AS [Email], [Extent3].[HomePage] AS [HomePage], [Extent3].[Ip] AS [Ip], [Extent3].[Text] AS [Text1], [Extent3].[PostId] AS [PostId], 1 AS [C1] FROM [dbo].[Posts] AS [Extent2] LEFT OUTER JOIN [dbo].[Comments] AS [Extent3] ON [Extent2].[Id] = [Extent3].[PostId]) AS [Project1] ON [Extent1].[Id] = [Project1].[BlogId] WHERE 1 = [Extent1].[Id]) AS [Project2] ORDER BY [Project2].[Id] ASC, [Project2].[C4] ASC, [Project2].[Id1] ASC, [Project2].[C3] ASC
If you’ll look closely, you’ll see that it generate a join between Blogs, Posts and Comments, essentially creating a Cartesian product between all three.
What about NHibernate? The following code:
var blogs = s.CreateQuery( @"from Blog b left join fetch b.Posts p left join fetch p.Comments where b.Id = :id") .SetParameter("id", 1) .List<Blog>();
Will generate a much saner statement:
select blog0_.Id as Id7_0_, posts1_.Id as Id0_1_, comments2_.Id as Id2_2_, blog0_.Title as Title7_0_, blog0_.Subtitle as Subtitle7_0_, blog0_.AllowsComments as AllowsCo4_7_0_, blog0_.CreatedAt as CreatedAt7_0_, posts1_.Title as Title0_1_, posts1_.Text as Text0_1_, posts1_.PostedAt as PostedAt0_1_, posts1_.BlogId as BlogId0_1_, posts1_.UserId as UserId0_1_, posts1_.BlogId as BlogId0__, posts1_.Id as Id0__, comments2_.Name as Name2_2_, comments2_.Email as Email2_2_, comments2_.HomePage as HomePage2_2_, comments2_.Ip as Ip2_2_, comments2_.Text as Text2_2_, comments2_.PostId as PostId2_2_, comments2_.PostId as PostId1__, comments2_.Id as Id1__ from Blogs blog0_ left outer join Posts posts1_ on blog0_.Id = posts1_.BlogId left outer join Comments comments2_ on posts1_.Id = comments2_.PostId where blog0_.Id = 1 /* @p0 */
While this is a saner statement, it will also generate a Cartesian product. There are no two ways about it, this is bad bad bad bad.
And the way to do that is quite simple, don’t try to do it in a single query, instead, we can break it up into multiple queries, each loading just a part of the graph and rely on the Identity Map implementation to stitch the graph together. You can read the post about it here. Doing this may require more work on your part, but it will end up being much faster, and it is also something that would be much easier to write, maintain and work with.
Comments
hi.
you ignore the linq to sql generated statements which are THE MOST EFFICIENT. And do not require changes of semantics
Why?
Liviu,
Because the L2S version fails very frequently in anything but the most common cases.
Your NH examples seem to always use the query API with HQL. While that is not the point of this post (or most of your other ones), it scares first time NH users, who may not know about the Criteria API or the new QueryOver stuff or the linq provider... and augments to the arguments of NH haters.
This is especially so when you put NH and EF query syntaxes side by side in a single post... using the most verbose way to query NH and the most elegant way to query EF.
LLBLGen Pro (using linq)
var q = metaData.Customer
<orderentity(c => c.Orders)
3 queries: 1 for Customer, 1 for Order with a filter on customerid based on the query on customer, and 1 for orderdetails, with a filter on orderid based on the query on order. This tunes itself at runtime, if there are n Customers fetched and n is below a threshold it will replace the subquery filter on order with an IN predicate on order.CustomerID with the values for the PK values of the fetched customers.
You can extend the Prefetch<> elements above with order by, filter, exclude fields etc.
Ayende, are you trying to say that using prefetch is harmful when loading a single master record with many dependencies?
Let's see: prefetch is bad when loading a single record - as shown above. And it's equally bad when loading multiple records because you usually don't need collections related to main object (you are showing a list, not going into each record's details). So, prefetching is only useful if we have 1:1 mapping, or the number of related records is very low?
Rafal,
I am saying that you need to be aware of what is actually going on.
And in general, you want to load a single collection per query.
From my understanding the SQL generated by NH is not a Cartesian product. Left outer joins do not generate Cartesian products like this. A Cartesian product would have the effect of returning number of blog entries * number of post entries * number of comment entries, the format would be like so:
Select * --brevity
from blog, post, comment
thus omitting the join conditions, effectively using CROSS JOIN statements. What the NH Sql does do is populate the entire graph of objects, and while that is in most cases not the intent of what you want, it is not nearly as bad as a Cartesian product.
I did not look in detail at what the EF generated as that is more than my poor brain can handle without running the SQL
David,
I am probably using Cartesian product term a bit losely.
What I meant here is that you are going to lot a lot of information unnecessarily.
Yeah, I thought you would have to do a cross join to get a Cartesian product. So the problem is just that the result set generated by one query actually has more cells than the sum of the three record sets returned by separate queries?
If that is the case then it begs the question: Which has more overhead, sending separate queries (will they be in a single round trip?) or just lumping it into one query and returning a little more data?
The NHibernate SQL is not a Cartesian product. You are getting a flattened resultset combining the various tables. A Cartesian product from what I understand would be formed if there were no joins between the tables in which all possible combinations are produced.
@Ayende, I have to second the other commenters (and your own acknowledgement) that you should avoid the term "Cartesian Product" in this case. It's confusing at first...
The discussion about the "cartesian product" is not the point here. What is important to all of us is that Ayende has recognized this phenomenon as an important issue.
@Ayende, you did not discuss an optimal solution to this problem. I propose:
select b.id as id, 0 as tag ... from blogs b
union all select p.blogid as id, 1 as tag, ... from posts p
union all select c.blogid as blogid, 2 as tag, ... from comments c
order by id, tag
i learned this pattern in SQL Books Online as an example on how to construct a hierarchical XML result set.
@tobi: Wow. I'm speechless.
@Ayende: Am I right that the response here would be for 2 blogs, 2 posts each and 2 comments each
blog1, blog1data, post1, post1data, comment1, comment1data
blog1, blog1data, post1, post1data, comment2, comment2data
blog1, blog1data, post2, post2data, comment1, comment1data
blog1, blog1data, post2, post2data, comment2, comment2data
blog2, blog2data, post1, post1data, comment1, comment1data
blog2, blog2data, post1, post1data, comment2, comment2data
blog2, blog2data, post2, post2data, comment1, comment1data
blog2, blog2data, post2, post2data, comment2, comment2data
And is this duplication of data what you mean when you say Cartesian product?
Since this is blogs we're talking about, the data would be quite large. In fact, I think that even getting the post data with its comments would be a mistake, since the blog contents would be duplicated for each comment. Am I missing something here?
Configurator,
No, that is the problem that I am pointing out
That's why CQRS points to a different data structure for those queries :-)
So wouldn't the query
var blogs = s.CreateQuery(
<post();
Also be quite bad since the post data is duplicated?
It seems to be that the best SQL for my shorter example would be
SELECT * FROM Posts WHERE ID = @ID
SELECT * FROM Comments WHERE PostID = @ID
Like LLBLGen, our own ORM re-store is automatically splitting this into three separate queries:
var query = from blog b in QueryFactory.Linq() where b.id == 1;
query = query.Fetch (b => b.Posts).ThenFetch (p => p.Comments);
This will create two additional queries, each including the previous query's conditions but only one set of SELECT clauses.
Now the funny thing is that we talked about that feature, because it is in re-linq which eventually got used for LINQ2NH too, and you said you'd like a Cartesian product better. I didn't bother to ask for a reason back than, but I kept wondering why you would - in a general case - want to buy the advantage of a single query execution for the price of that much data overhead. Now it seems we're not that far off, are we? (mail from feb 2 2009 if you care)
That said, I would like to have the option to switch to Cartesian products explicitly, because in some cases the redundant data may be neglectible compared to the effort of having to find the initial set of records (blogs in your example) multiple times. Like, in queries that can't quite use the index...
Tobi, that has several drawbacks:
1) your query result will get very wide, as every column of every subresult has to be present.
2) with every 1:n relationship traversal, you'll get an explosion of rows
3) multiple paths originating from the same parent gives problems (explosion of columns)
4) a lot of null-ed values are reported back.
5) your entity fetch engine has to be adjusted to consume these odd sets, while when you do it the way llblgen or re-linq do it, you can build it ON TOP of existing fetch pipelines, so it's really an easy job. and IMHO more efficient.
@Ayende
Is there a way to make NHibernate apply this in 3 queries?
Then it'd use the identity map to connect the results together. Having Multiple Result Sets supported for the ADO.NET connection created under the hood is one feature that can be taken benefit of here..
Mohamed,
Take a look at the linked post
@Ayende,
Ah.. sorry that I didn't notice it when the question came to mind.
Actually, since I'm old user of LLBLGen Pro and read a lot about how it does fetching, I thought this was the default trend for ORM makers. Very big false assumption as it seems.
I thought of futures already. Thanks for showing the complete example in the other post.
Is there an easier way than futures in case of more than 2 fetch paths? I have been in situations when 3+ paths were required in a really huge screen. Deciding whether that decision was right will need much domain explanations but at least it wasn't very bad because it was agreed upon by 4 senior developers. So, it's possible to need 3+ fetch paths. Using LLBLGen Pro it's as simple as PreFetchPath.SubPath.Add (....childPrefetchPath). I thought NH Eager fetch works the same...
Mohamed,
NH's eager fetching will work nicely as long as you use don't try to load more than a single collection eagerly per query.
I suggest loading all single associations in the first query, and then a single collection association per each of the next queries.
Ayende,
if I understand correctly, the only way in NH to have this n-query behavior is to manually write several queries and throw away all but one result set, and there's no way at all to prefetch collections of collections. If you want that, it's all in re-linq. There's a thread on the NH users list where I posted some details for a similar problem: groups.google.com/.../c05d2c7ea7233340
(only in this case it's easier, because that's just what re-linq's fetch mechanism is made for. you just need to add some extension methods to your namespace to turn it on.)
there's one disadvantage: all of the created LINQ queries have to be converted to HQL and then SQL from scratch - a native NH solution could do better. however, that's what you'd get from manually built fetching queries too, and Steve's provider does enough caching to make this a neglectible problem. also, you can still replace it with an optimized implementation later if it should turn out desirable.
Comment preview