Linq for Entities: Abstractions

Mar 20 2007

Linq for Entities: Abstractions

time to read 3 min | 407 words

Jeremy has a great post summarizing the MVP summit, and he include:

Linq for Entities is much more than an O/R Mapper. It potentially provides us with a unified data access strategy over heterogeneous data sources (web services, xml, non relational databases, etc).

Web service call is (a remote remote call - extranet/internet)
Non relational databases
Hierarchical data stores
Relational databases

Now consider the following query:

Customer customer = (from c in someHeterogeneousDataSource.Customers
where c.Name == "Ayende" select c).First();

If the web service expose a GetCustomerByName(), this would be a good candidate, if not, the implementation would need to call GetAllCusomters() and filter it in memory.
For non relational databases, I am aware of object databases, flat files, temporal and hierarchical (covered seperatedly) - each of which has its own tradeoffs. I am not familiar with object databases to say what the tradeoffs are here, but for a flat file, it is going to be a linear scan. The query is not even a valid one for a temporal database (unless there is an implict CurrentTime).
For hierarchical data stores, this query would need to iterate all the customers, and compare their name to the query.
Relational database would think that this is too easy, and quit.

And this is just about the simplest query possible. I can't guess what will happen if I happened to want a join between customer and orders.

I get the feeling that I am missing something here, because it sure isn't heterogeneous to me.

Tweet Share Share 4 comments

Tags:

Linq

Comments

20 Mar 2007
01:31 AM

Alex James

The vision for ADO.NET entities (as described by Jeremy) is exactly the same as my vision for Data2.0.

Base4 (despite some horrible patterns and mixing of concerns in the code) already allows for some of this silo integration, in effect creating an EDM or enterprise data model.

Entities promises to be cool stuff.

20 Mar 2007
10:07 AM

Frans Bouma

Ayende, you perfectly understood the shortcomings of this 'feature'. Consider 10,000 customers and 1 million order rows, and both are in different stores. Having all customers from the US with an order in May 2006 will burn the server down to the ground due to the in-memory filtering.

This could be optimized by first filter the customers, send the result to the order store and join there. (or something like that). THough that's unclear, as you don't know how many orders there are.

So, I asked about this and if it would be great to have hints in Linq, or other optimizations to get rid of this. The answer I got was basicly: "No, that's not going to happen. In these scenario's you simply have to know what you're doing".

My conclusion is that this feature looks great on paper and feature charts ("Look what we can do!") but in practise no-one with a sane mind will use it, because it is completely unusable in enterprise systems: you never know how big your sets will grow over time so the filter will be too expensive.

20 Mar 2007
10:41 AM

Ayende Rahien

You could get smart and query the store for the count of orders and stores. But this gets brittle very fast, IMO.

I think that this is a great feature, but it shouldn't be tauted as One Query To Conquer Them All.

Trying to go this way would bring all the Distributed Objects pain back.

You have to know what you are querying, and you have to know what is going on under the hood in order to optimize it.

In the critical performance paths, you need to get as close as you can to what is really happening.

As you point out, the critical performance problems are not always obvious before the system has been in production for a while.

I had a SELECT N+1 in a system of mine that took 9 months to become a problem that someone noticed, for instance.

20 Mar 2007
17:57 PM

Alex

I've had a chat with some people at MS about what I see as the need to extend IQueryable<T> to include the ability to do costing at the provider boundaries. And they are definitely thinking about it.

The idea is that a generalised query implementation can come up with an appropriate plan that spans data silos, with the aim to minimize what Frans refers too: 'In memory filtering'.

As both of you point out this stuff is not easy, I know that very smart people like Jim Gray (who will be missed) where thinking a lot about how to make this stuff work.

Comment preview

Comments have been closed on this topic.

Oren Eini

Oren Eini

CEO of RavenDB