Ayende @ Rahien

It's a girl

Ask Ayende: Repository for abstracting multiple data sources?

With regards to my recommendation to not use repositories, Remyvd asks:

… if you have several kind of data sources in different technologies, then it would be nice if you have one kind of interface. Also when an object (like Customer) is combined from data out of different data sources, the repository is for me a good place to initialize the object and return it. How would you solve this cases?

My answer is: System.ArgumentException: Question assumes invalid state.

More fully, this is one of those times where, in order to actually answer the question, we have to correct the question. Why do I say that?

Well, the question makes the assumption that actually combining the customer entity out of different data stores is desirable. Having made that assumption, it proceed to see what is the best way to do that. I am not going to recommend a way to do that, because the underlying assumption is wrong.

If your Customer information is stored in multiple data stores, you have to ask yourself, is it actually the same thing in all places? For example, we may have Customer entity in our main database, Customer Billing History in the billing database, Customer credit report accessible over a web service, etc. Note what happens when we start actually drilling down into the entity design. It suddenly becomes clear that that information is in different data stores for a reason.

Those aren’t the druids you are looking for might be a good quote here. The fact that the information is split usually means that there is a reason for that. The information is handled differently, usually by different teams and applications, it deals with different aspects of the entity, etc.

Trying to abstract that away behind a repository layer loses that very important distinction. It also forces us to do a lot of additional work, because we have to load the customer entity from all of the different data stores every time we need it. Even if most of the data that we need is not relevant for the operation at hand.

If would be much easier, simpler and maintainable to actually expose the idea of the multiple data stores to the application at large. You don’t end up with a leaky abstraction and it is easy to see when and how you actually need to combine the different data stores, and what the implications of that are for the specific scenarios that requires it.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Frank
01/17/2012 10:04 AM by
Frank

I guess you mean droids instead of druids? ;)

tobi
01/17/2012 10:13 AM by
tobi

All of this is true.

Also, it is not possible to hide the mechanics of different database types like SQL and key-value. Of course it is possible, but it causes terrible performance and abstraction problems.

Frank Quednau
01/17/2012 10:47 AM by
Frank Quednau

Indeed, Ben Kenobi was protecting droids, not druids.

Invalidating the question is a simple trick, but you lend it from philosophy. In the depths of IT departments philosophy is a meek progeny of our glorious brains.

A business may certainly want to see the bigger picture of "A customer" as, philosophically speaking, slicing a system into modules may blur the sight on essential features of the entity in question. In fact, if the question assumes invalid state, it can rightfully do so, since especially in times of change (i.e. always) invalid state may be your only companion in your despair, as you try to adapt the slices to the forces of management.

I would expect that you would advice the business to gather the data into a DWH?

Then comes the question of acting upon the new, integrated view of your customer. Now you need some piece of IT that allows somebody to act against the customer based on the integral view that you have provided.

What now?

satish
01/17/2012 10:50 AM by
satish

Shared database is one type of distributed architecture.. do you mean to say doing that way is wrong

tarwn
01/17/2012 11:46 AM by
tarwn

I think, like any pattern, there are places that Repository is overkill, but it can also be used to provide direct benefit or mitigate risk. I've had several cases where a system needed to transparently fail from a local store to a web service, or vice versa to retrieve an entity. I've had one case where an interface was fed to a strategy object for processing, 3 implementations loaded filesystem data and one led database data but all appeared to be simple repositories (get all records, get specific record, delete record). I've also had cases where things were up in the air enough that using repository patterns ended up reducing a lot of the impact of later changes without requiring us to try to make infinitely flexible code up front (the abstraction allowed us to rip out and change the earlier implementation with minimal impact on the rest of the app).

If you need the ability to select between lazy or eager loaded data, then there is no difference in coding a repository call that returns a lazy set and a second that calls that and returns an eager one (if you in fact need both) than it would be to code two queries directly in your business logic, one with a ToList() on the end. The difference is that, while you make the decision in your business logic, the implementation isn't embedded in it and there it has slightly more knowledge of the data then it would otherwise (although still much less than if it had the query or mapping directly coded in it).

I think in many cases the business logic doesn't need to know that it got Customer from one data store and Billing from another. Yes, I would make them separate entities with separate behaviors, but my application doesn't need to know they have separate homes, that's what I have a data access layer for. Let the Data Access Layer figure out the proper way to save and load data, the business logic needs all the clarity it can get.

RichB
01/17/2012 12:02 PM by
RichB

Let's say I want to migrate my 100-table system away from Oracle to RavenDB. There are two choices:

1) Big Bang 2) Gradual

c.f. Joel Spolsky for the reasons against Big Bang.

So we're left with Gradual. At what level would you provide the abstraction which makes the migration easy? via Repositories?

Alireza
01/17/2012 12:15 PM by
Alireza

I'm following your code review posts for a while now but I still didn't find out how do you mock data layer for unit tests, without using any abstraction ?

James McKay
01/17/2012 12:24 PM by
James McKay

@RichB: You're planning for something that isn't likely to happen, and even if it does, it's still going to be a Big Bang no matter how many layers of abstraction you plaster over it, simply because the two are based on very different architectures and the abstraction will be as leaky as anything.

Either that or you'll come up with an architecture so restrictive that you end up with all sorts of hideous performance problems that you can't do anything about.

__Joker
01/17/2012 12:30 PM by
__Joker

@Alireza, AFAIK, one argument of Ayende I remember is when you are writing your repositories on top NHibernate's ISession. Ayende argues why not mock ISession if you want to test any dependencies which depend ISession.

Phillip
01/17/2012 01:39 PM by
Phillip

@Alireza - he mentioned in the comments of his last post. InMemory DB for NHibernate.

Victor Kornov
01/17/2012 02:44 PM by
Victor Kornov

What Oren is discussing here is called Bounded Context http://lmgtfy.com/?q=bounded+context Simply put, there isn't such a thing as unified Customer entity. All that data (Billing History, credit report etc) indeed has been placed into different data stores for a reasons - because all those aspects are independent bounded contexts.

Mark
01/17/2012 03:58 PM by
Mark

Excellent posts lately Oren!

Chris
01/17/2012 04:43 PM by
Chris

@Ayende, I agree on the fact that dropping the repository in favor of using a pure O/RM simplifies the app, but on the other hand, doesn't it violate the law of demeter? My example would be an asp.net mvc controller for handling operations on a Foo object with the pure session/context passed in - allows you to do what you want, but why should a Foo controller have access to a Bar collection which is exposed in the session/context (assuming that it doesn't want to do anything with it)?

njy
01/17/2012 04:55 PM by
njy

Yes. And no. Your advice that it should paying more attention to the question itself is good. Your advice that usually it is not the right track, even if it seems so, it's good too. The fact that you seem not to consider a case where this may be applicable is, well, simply wrong.

There's a value in being able to unify the data acces strategy among heterogeneous datasources (if used carefully, and i repeat: IF USED CAREFULLY).

Ayende Rahien
01/17/2012 05:41 PM by
Ayende Rahien

Frank, A data warehouse is typically a good thing to do for reporting concerns, yes. For OLTP / operational ones? Not so much. Trying to have a unified data model is almost always a mistake, for versioning, concurrency, management and coordination reasons.

Ayende Rahien
01/17/2012 05:42 PM by
Ayende Rahien

Satish, Depending on what you are doing, yes. While I wouldn't categorically say it is wrong, I dislike it. That is, Shared Database is usually going to lead you into problems down the road. Your database becomes your contract, and you can't make changes there, you have to coordinate with many people on anything that happens, you have to worry about other people messing you up, etc. I much rather have isolated databases and a shared reporting DB.

Ayende Rahien
01/17/2012 05:43 PM by
Ayende Rahien

Alireza, That is because I am not doing that. I am testing my code against an in memory database

Ayende Rahien
01/17/2012 05:45 PM by
Ayende Rahien

Chris, The law of demeter is nice, except that it ignores a lot of other factors.

For example, with the RavenDB session, we moved a lot of operations in the session.Advanced property, just because it made the intellisense easier.

Fluent interfaces are also in viloation of demeter, etc, etc.

Ayende Rahien
01/17/2012 05:48 PM by
Ayende Rahien

Njy, You make a wrong assumption here, you are starting out by assuming that the rare scenario is in effect here. I am assuming that this is the common scenario, and only if I get additional data, then I'll have answers for the rare scenario.

Trying to say, "you do it only if" will cause people to turn their head off for the if and just go ahead and do the thing that you don't want.

In the same sense that you say "never hitch a ride with a stranger", emphasis on the never. But there are a lot of exceptions for that. For example, if you are in a burning field and the only way out is to hitch a ride on a car driven by someone you don't know or be burned alive... But that isn't what you focus on.

Chris
01/17/2012 05:51 PM by
Chris

@Ayende, fair point. Let me rephrase that - do you consider the fact of passing 'to much' along with the session a problem? If so, how would you fix that?

Ayende Rahien
01/17/2012 05:53 PM by
Ayende Rahien

Chris, I don't understand the question, can you expand on that?

njy
01/17/2012 06:00 PM by
njy

Again, yes and no, imho.

Trying to say, "you do it only if" will cause people to turn their head off for the if and just go ahead and do the thing that you don't want.

Yep Ayende, stupid people are everywhere :-) !

I mean, the same could be said for "don't use RDBMS if..." or "don't use RavenDB if...".

What i'm saying is that you are actually quite right but, since you seem to not consider the possibility of a case where repos may make makes sense, i just pointed that out (that is, in fact, there are cases where it could make sense).

Case in point: you seem to take for granted that you can model the data storage and it's infrastructure: in one of my previous projects (last year) i had to work with external data, provided by third parties, and available only as a custom xml feed from a set of remote jsp webservices. Being able to let the team work with that data in a standard and consolidated way (even before the data was available, using fake in-memory data) was a huge win for us.

In the end, i repeat: you are correct for almost all of the common cases, but consider other casistics and situations, they happen too.

Whaddaya think?

Ayende Rahien
01/17/2012 06:04 PM by
Ayende Rahien

Njy, I do consider those rare cases, but unless I actually have a reason to speak about those scenarios, why do so? Again, "never hitchhike".

Chris
01/17/2012 06:06 PM by
Chris

Let's take the example I had earlier - so a controller with capabilities for doing some operations on objects of a single type. If you have a dependency in it for a session or a db context then it's not possible to reuse that controller elsewhere (e.g in a different application). If on the other hand, the controller would have a dependency on a repository interface, then it would be reusable, provided that the session (or whatever is used to wrap the session) implements that particular interface.

So my question is how would you achieve this type of re-usability?

Ayende Rahien
01/17/2012 06:10 PM by
Ayende Rahien

Chris, Can you really think about common scenarios where you actually reuse a controller? Controllers are really part of the application, they are about what the application is doing. Moving them to another app makes very little sense.

njy
01/17/2012 06:18 PM by
njy

In general, i tend not to use words like "always" or "never", unless accompanied by "except in rare cases".

Because, i mean, people may think to actually NEVER or ALWAYS do this or that things. Saying NEVER use repos, where instead there are cases where it may makes sense, is missing something. I know it may sounds like nitpicking, it's just that it is happened to me a lot of times to hear people say stuff like "in university, they teach us to NEVER denormalize data!" or "he said to NEVER use a nosql" or, like in this case, repos.

Just that, cheers and keep up the good work with Raven & co.

njy
01/17/2012 06:21 PM by
njy

I'll add: you originally quoted the phrase "... if you have several kind of data sources in different technologies ..." and that is exactly my situation 1 year ago.

I've used that thin repo abstraction, it worked wonderfully, and everything went fine. And there were no better alternative that i can speak of.

So, i give to you guys my own experience with that, saying that sometimes it may be useful.

Last thing, just to be clear: it probably hurts more to not use a repo when you actually can, then to use it when you should not.

njy
01/17/2012 06:25 PM by
njy

Sorry, my last phrase should be intended in the opposite way. This is what i intended:

"Last thing, just to be clear: it probably hurts more to use a repo when you should not, then to not use it when you actually may have use it."

Sorry for the rapid triplette.

Chris
01/17/2012 06:28 PM by
Chris

Sure - embedding blogging, e-comerce or live chat capabilities into a custom web app. This is actually something that I had to do recently.

Ayende Rahien
01/17/2012 09:08 PM by
Ayende Rahien

Chris, Huh? Why are you building this into the app? Those are separate apps that needs to integrate at the UI level. A module based approach would be much better. Not to mention that something as simple as different user definitions would require you to do a LOT of work even if you could try to abstract things away

Chris
01/17/2012 09:55 PM by
Chris

Not sure that I understand you clearly. I'm building this into the app, because these are the requirements. But my guess it that we have a bit different definitions of 'building into'. For me it means 'providing the functionality'

User definitions are the same here (yes I know that this is a strong assumption, but I had the comfort of creating this from scratch so no collision here)

Modular approach - I totally agree, but again, I have a feeling that we have different things in mind. Can you elaborate on that?

Chris
01/17/2012 10:05 PM by
Chris

on a related note, when is macto going to see the day light? I'm really anxious to see the code.

Ryan
01/18/2012 12:26 AM by
Ryan

Chris - that doesn't make much sense. There are dozens of plugins you can embed in your site to get chat, ecommerce, and blogging. This has no relation to choosing to use the repository pattern. It's purely a UI concern. Sure you might need integrated authentication, but your main app should have no idea that these other services exist. They are aspects of the web experience.

Riccardo
01/18/2012 10:05 AM by
Riccardo

If I'm a developer responsible of the database, I'm not happy if UI developers can interact freely with a model exposed from a context. I'd like that UI developers are allowed to interact with a fixed and easy to control interface as a repository is.

Frank Quednau
01/18/2012 10:30 AM by
Frank Quednau

Ayende, I didn't want to insinuate that accessing a DWH in a OLTP context is a good idea.

What I am trying to say is that your bounded contexts may shift for various forces of change. Short of doing great migration projects a (maybe interim) solution may be to have UIs accessing several systems.

With regard to the modular approach of UIs - I think there is some handwaving done at that end of the system. If you look at an iPad, that's certainly a modular UI with pretty weak integration between the different modules. If you imagine some process spanning several bounded contexts, the UX experience would be rather disgusting.

RichB
01/18/2012 12:24 PM by
RichB

@JamesMcKay I fully believe in YAGNI. But also know that sometimes a problem changes and you do need it. Case in point: my current system uses an ORM because of a mandated change from Oracle to SQL. Previously, it was SP based due to some misguided architecture from the original devs 8 years ago.

This system was migrated without a big bang approach. At the point of SQL GoLive, the system could run on either database.

Ayende Rahien
01/18/2012 06:06 PM by
Ayende Rahien

Riccardo, Again, the problem is in the problem statement. I don't believe in UI and DB developers. The very fact that you structured your team this way means that you already structured your project this way. See Conway's Law. You shouldn't do things like that up front, first choose an architecture, then build the team structure.

And I would strongly recommend on vertical division, not horizontal division.

Ayende Rahien
01/18/2012 06:09 PM by
Ayende Rahien

Frank, Bounded contexts shift over time, but very very slowly, and usually as a result of shifting responsibilities inside the org itself. Bounded contexts represent how the org view itself, a reflection of how it behaves. If you have a system whose contexts are drastically different than the internal organization structure, it is a bad system for the org, and likely to be causing problems.

You are giving the wrong example with the iPad, that isn't a single system, it is an OS with a shell, no commonality except UI guidelines. When talking about bounded contexts, we always talk about them inside a system / application, not generically.

Mark Nuttall
01/19/2012 12:52 AM by
Mark Nuttall

Some examples of "mixing" data sources:

http://www.jboss.org/teiid - "Teiid is a data virtualization system that allows applications to use data from multiple, heterogenous data stores"

On this next link, Look at "Chapter 22. Cross-store persistence" for storing part of the data in a Graph Database and the rest in a RDBMS http://static.springsource.org/spring-data/data-graph/snapshot-site/reference/pdf/spring-data-neo4j-reference.pdf -

Additionally, you can store your data in an RDMBS (or non-rdbms) and use an Index like Lucene as a datastore. In fact, the above example does all 3.

Another example would be to store some things in an Content repository (for you Java people - JCR) with metadata and the rest in RDMBS (or pick your datastore).

Also take a look at http://www.jboss.org/modeshape

Ayende Rahien
01/19/2012 12:59 AM by
Ayende Rahien

Mark, The Neo4j reference and the Lucene example aren't about multiple data stores, not logically. They are simply replica of the information in a way that is easier to consume in some manner. No different from creating a lookup table or denormalization in RDBMS.

The first example I am not familiar with, but just from the description, it frightens me. Different data sources have different behaviors, abstracting that away leads to trouble.

Mark Nuttall
01/19/2012 01:37 AM by
Mark Nuttall

On the Lucene/Neo4j - Correct. Well maybe, because they might not contain all the same data. But they do work "seemlessly". I was just pointing that out so don't get caught up in that detail. Look at the bigger picture. The Neo4J/RDMBS is about multiple datastores. Think about the context and don't nit pick it so much. I was just throwing things out there.

As for the second example, understand your fear. Is it right for all things? No. Does abstracting away lead to trouble? Not always. Isn't NHibernate abstracting RDBMS? Additionally the Repository and DAO patterns abstract the data store. FYI, JBoss (aka Red Hat) is the company that manages Hibernate...

I not surprised you are not familiar with it. Then again, I am sort of surprised you are not. What I mean is that you are a pretty smart guy (so i would think you would keep up on things), but you seem to be in the .NET box (i.e. not looking at anything not happening in the Microsoft world).

Ayende Rahien
01/19/2012 03:48 AM by
Ayende Rahien

Mark, There is a huge difference in having a secondary index in an external data store and actually having the relevant data in a separate data store. Note the terms I am using, secondary vs. separate, index vs. data.

I am quite familiar with both Red Hat and JBoss, and you can safely assume that I have some knowledge about Hibernate.

And while I will freely admit to being focused on the .NET stack, I am most certainly not "not looking at anything not happening in the Microsoft world".

Abstraction is something that you have to approach carefully. It has its own set of costs, and you have to be aware of leaky abstractions.

Comments have been closed on this topic.