Linq for Entities: Abstractions
Jeremy has a great post summarizing the MVP summit, and he include:
- Web service call is (a remote remote call - extranet/internet)
- Non relational databases
- Hierarchical data stores
- Relational databases
Now consider the following query:
where c.Name == "Ayende" select c).First();
- If the web service expose a GetCustomerByName(), this would be a good candidate, if not, the implementation would need to call GetAllCusomters() and filter it in memory.
- For non relational databases, I am aware of object databases, flat files, temporal and hierarchical (covered seperatedly) - each of which has its own tradeoffs. I am not familiar with object databases to say what the tradeoffs are here, but for a flat file, it is going to be a linear scan. The query is not even a valid one for a temporal database (unless there is an implict CurrentTime).
- For hierarchical data stores, this query would need to iterate all the customers, and compare their name to the query.
- Relational database would think that this is too easy, and quit.
And this is just about the simplest query possible. I can't guess what will happen if I happened to want a join between customer and orders.
I get the feeling that I am missing something here, because it sure isn't heterogeneous to me.
Comments
The vision for ADO.NET entities (as described by Jeremy) is exactly the same as my vision for Data2.0.
Base4 (despite some horrible patterns and mixing of concerns in the code) already allows for some of this silo integration, in effect creating an EDM or enterprise data model.
Entities promises to be cool stuff.
Ayende, you perfectly understood the shortcomings of this 'feature'. Consider 10,000 customers and 1 million order rows, and both are in different stores. Having all customers from the US with an order in May 2006 will burn the server down to the ground due to the in-memory filtering.
This could be optimized by first filter the customers, send the result to the order store and join there. (or something like that). THough that's unclear, as you don't know how many orders there are.
So, I asked about this and if it would be great to have hints in Linq, or other optimizations to get rid of this. The answer I got was basicly: "No, that's not going to happen. In these scenario's you simply have to know what you're doing".
My conclusion is that this feature looks great on paper and feature charts ("Look what we can do!") but in practise no-one with a sane mind will use it, because it is completely unusable in enterprise systems: you never know how big your sets will grow over time so the filter will be too expensive.
You could get smart and query the store for the count of orders and stores. But this gets brittle very fast, IMO.
I think that this is a great feature, but it shouldn't be tauted as One Query To Conquer Them All.
Trying to go this way would bring all the Distributed Objects pain back.
You have to know what you are querying, and you have to know what is going on under the hood in order to optimize it.
In the critical performance paths, you need to get as close as you can to what is really happening.
As you point out, the critical performance problems are not always obvious before the system has been in production for a while.
I had a SELECT N+1 in a system of mine that took 9 months to become a problem that someone noticed, for instance.
I've had a chat with some people at MS about what I see as the need to extend IQueryable<T> to include the ability to do costing at the provider boundaries. And they are definitely thinking about it.
The idea is that a generalised query implementation can come up with an appropriate plan that spans data silos, with the aim to minimize what Frans refers too: 'In memory filtering'.
As both of you point out this stuff is not easy, I know that very smart people like Jim Gray (who will be missed) where thinking a lot about how to make this stuff work.
Comment preview