More on Data Layer Componentization

time to read 11 min | 2026 words

A while ago I commented on Alex's data-layer componentization idea. Alex replied, and Diego as well, but I didn't get around to answering them until now.

To remind you, the basic idea is:

A data-layer component simply allows for information to be shared between different applications.

By supporting re-use data layer componentization allows you to break the duplication habit. You can still have silos, but the natural tendency to duplicate data disappears. Each new silo will instead contain only new types of data and where old types are required, simply pointers, or cross silo foreign keys.

I am going to get to answering their points in a bit, but one thing that I do agree on is the idea of a DataServer, to quote Alex:

I like to think of this as a DataServer as opposed to a Database. The idea being that a DataServer is the place you go to get data in the shape that makes sense for you (i.e. the conceptual model), and is responsible for getting that data from one or more data-sources or databases on your behalf.

Except that I see this in an entirely different light. I don't like the idea of distributing the data over many sources, I would much rather pull all this information to my own DB, and shift the problem from downed application to stale data. That can be a problem, but that depends on the application at hand. See the end of the post about my ideas of this, for now, I want to get to Alex's points.

I mentioned that taking dependencies on several databases cause a server decrease in the reliability of the system, any of them going down will cause the entire system to go down:

Of course systems with multiple dependencies are less robust, but the alternative opens you up to relying on out of date or incorrect data, which hardly seems any better?

It really depends on the application at hand, most business application can work with stale data without much problem, and recovering from that isn't really a problem. But it is far less common to find a scenario where possibly stale answer is better than a downed application.

Another point that I raised was that changing a single application can ripple to all the other applications in the organization.

This is essentially the dual schema problem n’est pas? So you handle it the same way you always handle the dual schema problem, with an intermediary conceptual model. Incidentally there is no reason you have to have only one conceptual model for any given data model either. If one application changes and another doesn't create a different conceptual model.

That is actually the N'th Schema problem. I am nervous about making changes to one application just because another have changed, much less N other applications. Shotgun Surgery in a grand level. Even if we assume that we need to merely update the mapping layer, I have had enough problems with that to cause me to call it simple or painless. And if we are talking about a change that affect the conceptual model as well, this gets even more interesting.

Another concern was about business logic that is required to actual do something with the data. Using this data server approach, you need the business logic to make sense of the data, and if this business logic change in the original application, what happens to the rest of the applications?

I’m not claiming that the ability to share data-layers suddenly negates the need for a business layer! In fact by sharing data-layers it finally becomes possible to share business layers. That is what Base4 tried to do, it was after all a client/server ORM.

I am not seeing how you can share a business layer from multiply application in a single application. You can build a web service that would get the data, already processed, but that is not sharing the business layer from another application, that is accessing that other application for the data. I mentioned security as well, row and column level security that is bounded by business rules, Alex gave much the same answer for this as well, and I have the same answer, I can't see it working after the first change.

Cross databases FK:

Until relational databases matured I am sure the management of standard FK’s was tricky too, we need to recognize that we are the age of distribution, all the data necessary often can’t be in one place, it is an interconnected world and we need to understand this. This is why we need data-servers not just databases.

Nevertheless, this doesn't solve the problem. And trying to do cross DB FKs is a painful performance issue, with no real good answers. That leads well into the problem of performance when you need to pull the data from disparate sources:

Yes doing joins across databases is likely to be slow, but using this performance issue, to say it shouldn't be done is a little like saying downloading a movie from the internet takes a long time,  so we should never do it. I don't care if it takes a long time, I want to see the damn movie!

That is fine and good, but saying that you don't care about performance stops when you realize that you have to send the result back to the client in 500ms or less. There were few to none movie downloads in the 90s, because download speed and cost made it impractical in the extreme. Only when it got to the point where you could download a movie fairly easily is started to be really popular.

There are two kinds of performance problems, the small that you don't care about in advance (calling the DB in a loop is a favorite of mine) and the architectural ones that you really want to at least consider ahead of time. Calling external resources is right there on the top of the list as far as I am concerned. If in order to serve a request I need to make 5 requests by design, then I am going to think it really hard before I go on. It is simply not the place I want to start with.

Just because something has issues, doesn’t mean it isn’t useful. I am not claiming that data layer componentization solves every problem; rather that it is a very nice trick to have in the arsenal.

No argument here, I certainly think that Alex has a lot of merit here, I am trying to point out the other side. I don't think that this is something that is quite relevant yet.

Alex then went own to point the flaws in my own preferred approach, bringing all the data into the application's DB using some sort of an ETL process:

For example Ayende's preferred route of using an ETL process to create local copies.

  1. More often than not developers don't actually know that there is another database that has the information they need or simply don't bother with an ETL process anyway.
  2. Often times every database thinks it is the master, and the natural result is no database can be trusted.
  3. Sometimes It is just not practical to duplicate all data, sometimes there is just too much, or sometimes the cost of transfer is too high, or it takes too long to do it, or you can't tolerate dirty data, or you have too many places to copy that data.
  4. In order to avoid (2), unwieldy and error prone ‘human’ processes need to develop, people need to remember to go into 2, 3, 6, 15 (?) systems to keep everything in sync.

To answer the points in order:

  1. If they don't know there is another DB, how would they access the data at the first place. I find it hard to believe that a developer will think "Hm, I guess I need the customer data, let us just invent one"
  2. Defining who is the master of the data is often critical, absolutely. Master-Master merging can be a PITA of a high order.
  3. I can usually copy parts of the data, or do diffs, to decrease the cost of moving the data. The point about stale data is a very good one, I'll answer it in the end of this post.
  4. I don't think so, in order to avoid the master/master problem you need to define who the master is, and how you go about updating that piece of the data.

 image Diego's post is a more thorough analysis of why taking the ETL approach to the extreme is not a good idea. He focus on the cost of making a change to the system after it was deployed, specifically, what happen when you have N ETL processes that needs to be modified as a result of a schema change?

The cheeky answer to that, have the ETL work of the conceptual schema, at which point you have a single place to go and modify, the ETL processes.

But no, I don't think that I would recommend this approach for all the systems in an organization, assuming N systems that needs each other information, you would have NxN processes, and that is problematic.

So, what is my view of the data server? Well, let us start by removing the data, and then the server.

I have said it before, a application database is an implementation detail, you don't want outsiders to start pawing over your data. The way I consider it, define a set of web services for each application, that allows external applications to use the data in the application. If you needs BA reports, ETL the data to a reporting database, you are likely to want to that anyway.

No, I didn't drink the WS koolaid, and I am not really happy that I am suggesting a WS as a "Solution", but that is just my biases against buzzwards. At its heart, this is the same idea, but with WS instead of data, and with the conceptual schema captured as XML messages.

The major difference is that I have more tools at my disposal to work with WS than with databases. Versioning, for one, is something that isn't trivial by any means, but it is something that WS does much better than databases. Most of the other concerns are also mitigated by this approach. Business level decisions, be it logic or security, for instance, are enforced by the service, which is part of the application, so changing that would immediately affect everything else.

But, didn't I just said that I want to minimize the amount of external dependencies that I have? Doesn't it count as the same thing?

Yes and no. I wouldn't want my application to talk to a dozen services any more than I would want it to talk to a dozen databases. That is where service aggregator come into place. They aggregate related services into a single place. This is the place where I would apply caching, for instance, but not where I would put business logic. This is also a good place to decide what to do about downed service. I can either use a cached value, retry a number of times, try an alternate method, or fail the whole operation.

If this looks like an ESB, I am sorry about it, this is not the intention. I view the service aggregator as a way to aggregate the information from disparate sources into a single place. I can certainly see a way to create some framework for this, and rollout an aggregator per application, to meet that application need. Although I think that several core services or core aggregators should do for most of what the applications need to do.

And yes, that is not my ideal, but I can't really see a way to work with disparate sources without handling this disparity somewhere, so I don't see a way around that.