The cost of inappropriate linqing
I read this post with interest, apparently Linq for SQL is doing something odd, because I can't quite believe the results that this guy is getting. From the post:
So I dug into the call graph a bit and found out the code causing by far the most damage was the creation of the LINQ query object for every call! The actual round trip to the database paled in comparison
I can't imagine what they are doing there to cause this performance characteristics.
From my own experiments with Linq, it should definitely not produce this amount of efforts.
I've just experienced this myself with a client while developing a solution - not using Linq to SQL - but using something home-brewed to work with their in-house ORM... caused by a naive implementation of the Linq expression parser, and a costly metadata-querying layer we ended up with very simple queries having the same characteristic - the Linq cost was dwarfing the sql execution cost. for one-off compilation.. Easily fixed with the aid of a profiler, but it took me by surprise - I'm so used to sql being the bottle neck.
The main problem is the expression tree. As it can have any shape or form, you end up with 'visitors' (not really visitors, the visitor pattern isn't the same) which traverse the tree, transform a couple of nodes and move to the next phase.
The downside of this is that a node can be visited a lot of times. In simple Linq providers, which only do where and select for example, it's not that much of a problem. The things start getting serious when you want to handle group joins, nested aggregates, selects on one or both sides of a join etc. You then have to do some serious expression tree transformations which aren't doable in 1 pass: you don't know what's up ahead at a given location in the code, as you for example just see a single node, you don't have a lookahead.
So if you need info about the tree up-ahead, you use a visitor to check what's up ahead. For example if you need an object from the subtree you can use a visitor to get that for you.
I'm now at 70% or so of implementing a Linq provider (groupby/orderby are still left) and I have just a fraction of the amount of visitors MS has in their code (according to reflector). I think they went a little overboard in the visitor stuff. For example: if you pass a tree once in full, you can collect a lot of data from it, e.g. assign aliases, see scopes etc. You can store all that data in easy access dictionaries and other datastructures. This frees you from traversing the tree again and again just to get some element out of it.
They do have a problem others like you and I don't have: they have to transform the expression tree COMPLETELY to sql. We don't have to do that: we just transform it to our own query api and from there it's no longer transformed by visitors. This is one area where they lose a lot of time I think, because they have to transform the tree a lot of times. (I have currently 5 passes, they a truckload more).
But the cost of the visitors is offset by query API cost of generating itself.
Frankly, they should do their caching for this.
I've just posted a suggestion on connect. I was thinking about doing this for quite some time, but I never expected this to be too relevant for SQL queries, merely for stuff like index4objects.
here's the link: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=315491
This would allow caching of transformation results, so the LINQ provider only has to do it once. In fact, if you take this out of context, it is absolutely ridiculous that such a massive computation should be done on every call, just because some argument values change.
Vote for it!
I agree. I don't see any reason why there shouldn't be some sort of a simple switch on the IQueryable or Expression to enable caching of all but the parameters. I would prefer, opt-out as well. The memory overhead would be, I think (not that I've done any memory profiling on the expressions), pretty minimal for most apps.
JD, if you think of all the metadata overhead every single projection (anonymous type) or closure creates, I don't think that optimizing the compiler output for memory usage is very high on the C# team's priority list anyway.
The almost recursive visitation of some nodes is what was costing me, in my case I've implemented a blackboard for collecting all the information about the expression as it's visited by different visitors (well agents I guess they should be called), and function-level memoization to avoid repeated visits and metadata lookup, which definitely made it a lot faster.
For many ORM's I think you could probably write a single parser, that produced a much easier to parse query AST, which could then be easily visited to map it into the underlying ORM's query API - I definitely get the feeling that a large amount of developers are all working hard solving the same problems in different ways, with varying levels of success when they could be adding new features to their products and leveraging some common implementation provided by a 3rd party to do the heavy lifting.
Alex, that's exactly what we're doing here right now. We've just started to develop it, and it's not a full-time task right now, so it's probably gonna take us a few weeks before we're ready for the first release. We're planning to release it under the LGPL.
Anybody who's interested can email me at stefan dot wenig at rubicon-it dot com, so I can notify you as soon as anything useful is online. (Ayende, sorry that I'm using your blog for announcements, but mine's not set up yet.)