The fallacy of IRepository
I didn't have a controversial title in a few days :-)
The cause for this post is a post by Rob Conery where he suggests removing the RDMBS from the equation, at least during development. In particular, he suggests using an OODB during development and switching to RDBMS late in the game, when going to production. The reason for doing so in order to reduce the friction of having to maintain a database during development.
One thing that I feel that I have to point out. DB is a friction point in one of several conditions. If your tool doesn't support your rapid changes, then it is usually an indication of a problem with the tool. I can tell you that on my most recent project, there wasn't a day in which the DB schema of the project hasn't been changed in some way, and several times I had to make significant changes (rearranging the entire model). NHibernate takes away a lot of the pain using a DB, because you don't really care about what is going on. And using Active Record attributes or Fluent NHibernate makes it an even easier task. I don't know what the state of convention based configuration is for Fluent NHibernate, but that is a very promising direction.
Anyway, that is not the point of this post.
I agree with a lot of the points that Rob is making, and I'll expand on them in an additional post, but right now I wanted to actually address a comment I made on Rob's post, which I feel wasn't clear enough.
One big problem is that for most applications, trying to change OODB to RDBMS would not work without a LOT of work.
There are a couple of things that I tried to put into a very terse comment. The first is that you should practice the way you play, and that include putting any constraints that you have for production into the development environment. But even this isn't the point of this post.
If you look at the title, you'll see that I am decrying the fallacy of IRepository. In particular, this is what I disagree to:
Hi Oren - if you implemented IRepository<T> as I've done here, how would this not work? Can you be specific in terms of "a lot of work" and what that means?
In this case, the problem is that the interface for IRepository contains a lot of unspoken assumptions about the way you deal with persistence storage. Let us take an example of moving an IRepository between OODB and RDMBS. OODB query access patterns are completely different than the ones that you would use for RDMBS. A trivial difference that has profound implications is getting Blog with all its Posts and all their Comments. The only way of doing this with RDBMS is using joins (in a single statement), which is going to cause Cartesian product, which is expensive in the DB and have to be dealt with in the app layer. In the case of OODB, you just let the OODB handle that and move on. It is not using relational algebra, and it can handle this specific scenario pretty well.
Let us take it from the other way now, all my IRepository implementation recently has been using the future pattern, in which they return an IEnumerable<T> implementation, which is aggregated with all the queries for the request and then sent to the DB as a single remote call. That works really well. But what is going to happen if the OODB doesn't support this notion? (a cursory search didn't reveal anything enlightening, so I am assuming it is not supported for now).
You code previously assumed 1 remote call for N queries, but now you are faced with N remote calls for N queries. Even assuming that each query time is constant, the performance difference between the two is significant and crippling.
IRepository is a good way of decoupling you from the nitty grity details of how things work, but it doesn't decouple you from the abstract notions. Not for any real world implementation, at least.
Comments
Linq would make this transition easier, i think. Linq provides another level of abstraction so that you don't have to do joins _manually_. Even though this is the case, this assumption is usually not really good, i have to admit because not all the providers are perfect, and one has to workaround for one provider and not for the other. I mean db4o may not fail for one specific expression but linq 2 sql would(or yield in perf problem such as n+1), this thing would hide the problem when transition to sql occur.
Correction, I think a Cartesian product is where you don't have a join?
"The only way of doing this with RDBMS is using joins (in a single statement), which is going to cause Cartesian product, which is expensive in the DB and have to be dealt with in the app layer"
Not true. Batched nested result sets is another approach.
Why not stick with the OODB? Depending on your needs of course but collage's have been very happy with Cache on a large project and it was faster then Oracle or SQL Server. And that can of course be bad query design :)
-Mark
You mention that your current IRepository implementations uses the future pattern. Have you got an example of it up anywhere I could read, please?
Using db4o remotely appears to be a relatively rare scenario, batching becomes much less useful when everything is in process.
The repository interface I have chosen where I intend to take db4o into production is narrower than Rob's - my generic base only supports fetches of aggregates by their identifier. It's just too dangerous to expose its pretty immature linq support (it's quite possible to confuse it with even simple queries that involve generics) outside of the repository, as well as it lacking index support in some scenarios. It can't construct an index on object types for one thing, and I have an EAV scenario where the types of some things cannot be known at compilation time.
I am building a Lucene index on commit and using that for complex queries, as well as supporting projections from stored index fields to avoid activations in situations where I do not want to have drill into a large numbers of returned aggregates.
Why not just prove it - create a sample application supporting first OODBMS and then switch it to RDBMS... It doesn't have to be even UI based, just little if core going beyond two table/object setup and some real world queries as test cases.
Lot of warm air would be spared...
I guess it would depend on the way you work. Personally for every service I develop, I first start with the building the domain model, i.e. a set of POCO classes that represents the domain of the application I'm creating with all the data and relations it needs to maintain. The end result is the 'ideal domain model' without any considerations for a RDBMS back end or UI front end. To persist the domain objects to a RDBMS I would translate it to a set of nHibernate data classes (which are mapped 1:1 with a db table) and save that. Basically my nHibernate data classes do not represent my domain model, they are just relational data objects I use to persist to a RDBMS.
To move to an OODB I just do away with the translator classes and just persist the domain model directly.
DB4O is a very functional OODB, with transactional support and optional client/server access. I use it for all my transactional and 'process services' (i.e. transient data) though for my repository/catalogue db's I still opt to use a RDBMS as I still like the security/ future proofing/full-text searching that a mature RDBMS can provide.
Simon,
ayende.com/.../Future-Query-Of-implemented.aspx
Check this and also the implementation in that link.
Tobin,
Huh?
select * from Blog join Post on Blog.Id = Post.BlogId
will result in Cartesian product of each blog being repeated for each of its posts.
Andrew,
That is not a single statement.
You can see that I referred to that latest in the post.
Ryan,
You just made my point.
In particular, the part about using the DB locally vs. remotely has huge implications on the application.
Bunter,
something like that would take _time_. I don't have much of that.
Thanks Ayende
I thought that a query with join conditions isn't a cartesian product? Basically, I thought the cartesian product would be this:
select * from Blog, Post
@Tobin,,
It's because he's referring to an Outer Join.
@Ayende,
Sure, but it is a single _call_. And, one that returns much less redundant data and can span a wider/deeper load graph.
@Andrew
Thanks. Wouldn't an outer join just ensure rows are return even if a join condition is not satisfied? it wouldn't result in a cartesian product would it? Maybe I need to hit the SQL books again!...
Heard the news about Microsoft btw, congrats :)
Tobin,
Yeah, this could just be some loose terminology.
Extremely jealous of your guitar setup!
__The only way of doing this with RDBMS is using joins (in a single statement), which is going to cause Cartesian product, which is expensive in the DB and have to be dealt with in the app layer.
Why does it "have to be dealt with in the app layer" ? If you have either .SelectMany() from Repository or some way to say that you want to preload all posts/comments (considering that you do want to preload them and lazy load does not work well enough), why should the app layer care about how the Repository does it?
Comment preview