A vision of enterprise platformA database that you don't hide in the attic
For some reason, the moment that peole start working on Enterprise Solutions, there is a... tendency to assume that because we are going to build a big and complex application, we can afford to ignore best practices and proven practices.
That is just wrong. And it applies in many manners to many parts of the application, but it never applies to something as obiviously as it applies to the database.
This is just the results of a very short research:
- The metadata driven approach, AKA the 5 tables DB
- The Blob
- The 5th normalized form DB
- The 5th denormalized form DB
- The table a day DB
- The obsfuscated database - create table tbl15 (fld_a int, fld_b int, fld_g nvarchar(255) )
And those are just of the database structure, we haven't yet arrived to the database behavior yet. Here we have interesting approachs such as rolling your own "Transactions" table, requiring 40 joins to get a simple value, using nvarchar as the ultimate extensibility, etc.
So, how do we approach building a database that we can actually show in public? Let us start with thinking about the constraints that we have for the application. The database needs to support...
- OLTP for the application itself.
- Reports.
- Performance.
- ETL processes to take data in and out of the sytem.
- Large amount of data.
- Large amount of entities.
Notice how we don't have any suggestion about integrating with the application at the database level. That is a fairly common antipattern.
We are going to try to keep a one to one mapping between the entity and the table structure, because that will make it significantly easier to work with the system. One to one means that the Account Name translate to an Account.Name in the entity model, and Accounts.Name in the database model.
We probably want the usual auditing suspects, such as Id, CreatedAt, ModifiedAt, CreatedBy, ModifiedBy, Owner, OrganizationUnit. Maybe a few more, if this makes sense. Probably not too much, because that can get to be a problem.
Accessing and manipulating this in the application layer it going to be trivial, so I am not really going to cover that in depth.
What I do want to talk about is how we expose it to the outside world? And by that I mean reports and ETL processes.
We have several options to try to do that. We can just let the reports and ETL processes to read from the database tables directly. This is the simplest approach, I think.
Other options include views, stored procedures, and other DB thingies that we can use. I have seen systems where an entity was composed of several tables, and the report was done off of a view that joined it all together.
The underlying problem is that I have versioning to consider. I am going to dedicate a full post to the problem of building upgradable user customizable applications, so I'll delay it to then, but let us just say that radically changing the database schema between version will be painful for the users. The usual way to handle that is to only make promises for the views, and not for the tables themselves. That is a good way to handle that, in my opinion.
I would suggest putting those in a separate schema, to make it clearer that the seperation is imortant. This also gives you the ability to later do a view to table refactoring, and maintain that with triggers. This is usually a bad idea, but for performance reasons, it may very well be a good solution.
ETL processes can use the same mechanism that reports use to read data from the database efficently, but nothing writes to the database except the application. At one point I wrote a change DB password util that run every two hours, it would change the database password and update the relevant config files.
I think that you can guess what the driving force to that where, no?
Getting data into the database can be done through application services (not neccesarily web services, btw). A simple example would be API similar to this one:
void Insert(params Account[] accounts); void Update(params Account[] accounts); void Delete(params Account[] accounts);
This API explicitly allows for bulk operations, so it can be very nice to work with, instead of having to do things on per-row basis, which basically kills performance.
How to get good performance from this system is another hard question. In this case, I would usually recommend on getting a good DBA to look at the perfrormance charactaristics of the application and optimize the database structure if needed. But, a much easier solution to performance problems in the database server is to not hit the DB server, but use caching. Distributed caching solutions, like Memcached, NCache, etc are really good way of handling that.
No business logic in the DB! This is important, if you put business logic in the DB, you have to get to the DB in order to execute the business logic. This kills scalablity, hurts the ability to understand the solution, and in general makes life miserable all around.
Reports are an interesting problem. How do you deal with security, for instance? Consider the security infrastructure that I already presented. This security infrastructure should also come with a database function that you can use like this:
SELECT * FROM Accounts
WHERE IsAllowed(Accounts.EntitySecurityKey, @UserId, 'Account.View')
Or this:
SELECT * FROM Accounts WHERE Accounts.EntitySecurityKey IN ( SELECT EntitySecurityId FROM GetAllowed(@UserId, 'Account.View') )
Both of which provides really easy manner to get security for the reports. If we wanted to enforce that, we can force the report writer to write somtehing like this:
SELECT * FROM GetAllowedAccounts(@UserId, 'Account.View')
We can probably get away with assuming that 'Account.View' is the default operation, so it is even shorter. Among other things, this actually have valid performance characteristics.
This post it turning out to be a "don't make stupid mistakes" post, because I don't think that I am writing anything new here. About how to avoid making stupid mistake, that is fairly simple as well. Get a good DBA (that is left as an excersize for the reader), give him/her a big stick and encourage good design through superior firepower.
More posts in "A vision of enterprise platform" series:
- (29 Nov 2007) A database that you don't hide in the attic
- (24 Nov 2007) Hot & Distributed Deployment
- (17 Nov 2007) Security Infrastructure
Comments
I think there is a pattern for reports that is broadly applicable, but hardly ever implemented. The pattern is to split the 5 aspects of reports, which are Options (parameters), Provider (runs the report), Data Adapter (fetches the data), Data (represents the data) and Presentation.
The point I want to make is this: When you talk of SQL Views for reports, you are suggesting that the Data Adapter logic be encapsulated in the database - the very thing you just recommended against doing for other types of logic.
Your report-security sidetrack became necessary because your data logic was in the database. If you move your data logic out of the database, then security is much easier to apply.
Only in recent years has this become possible, because many reporting tools now allow reporting off of non-database representations of data (xml, datasets, etc).
Oren, I think your posts are very informative, its nice to see smart people writing about what should and should not be done when programming and also trying to find better ways to do things.
One thing we have in common is the ability to look at a database schema design, classify it into a familiar pattern, and identify its strengths and weaknesses.
I have found that the best way to design a database schema is direct from the use cases. The use cases define how data is loaded and persisted. The more frequent use cases have the highest priority, but the database ultimately must satisfy all of the use cases in a simple manner.
I have made the 5-tables mistake before, I have made the blob mistake before. Heck I have even attempted to push all of my business logic in to the stored procedures before (talk about refactoring nightmare!)
I like to think that my skill has been earned simply from doing things the wrong way so many times before!
My point is, your post is a "don't make stupid mistakes" post, but it comes from hard-earned knowledge... I sometimes wonder if everyone must make these stupid mistakes to appreciate the "best practices"
Oren,
There was a post of your's quite some time ago where you pointed out an object model for looking at objects that change over time (you had a .At(time) method on them or something like that). Would it be reasonable to use that rather than just the created / modified time / user?
In my experience almost every business logic (BL) changes some data so in majority of cases you must access database any way. If BL is in the middle tier you will probably round-trip to the database many times (many UPDATES, INSERTS and DELETES). If you put your BL in DB round-trip number will be lower, possibly just 1. And your business logic gets enforced even if you access DB directly and not through your middle tier. Don't forget that customer data will live a great longer than your application.
Let's not forget that at least some part of BL is always in DB. Constraints. And those come in to flawors: declarative and procedural. Declarative constraints should obviously be put in DB. Procedural constraints, well you could probably put those in DB or in the middle tier.
Don't get me wrong. I am not advocating that we should always put BL in DB (support for many DBMS is an example when we should minimize BL in DB). But to say that we should never do that seems equally wrong to me. I is an "it depends" decision I guess.
Seems to me that DBMS are more scalable then some custom middle tier or IIS.
Why ? PL/SQL is harder to understand than C# ?
"""We probably want the usual auditing suspects, such as Id, CreatedAt, ModifiedAt, CreatedBy, ModifiedBy, Owner, OrganizationUnit. Maybe a few more, if this makes sense. Probably not too much, because that can get to be a problem."""
I strongly disagree here. Auditing info is IMO nothing that should be part of the entity itself, unless it's an integral part of the model's business needs.
My solution here is different: I have separate changelog and changelogdetails tables. Any changes to entities that need to be monitored are logged to that tables by using the proper ActiveRecord/NHibernate hooks. This allows auditing entities by simply changing their base class instead of adding and maintaining a bunch of attributes on the entities themselves.
And yes, even for account entities, auditing attributes are not an integral part of the business model, but the best example for why auditing changes is required in enterprise apps.
"I have made the 5-tables mistake before, I"
What is that 5-table mistake? How can i know that i have or have not made it?
Steve,
You are correct in your approach, certainly.
The different is that SQL offers a lot of rich tools for doing reports, while any other method means that I have to write it from scratch.
The basic idea is that you can offload that to the end user, and tell them that it is a way they can write their own stuff easily
Jose, thanks.
Perter,
Sometimes you get the easy way by mentoring, but you have to have some way to feel the pain.
A DB that has the following tables
Schemas
Tables
Columns
Rows
Values
I almost wept when I saw a column called OrganizationUnit. We'd be lucky to get something as readable as ORG_UNT out of our DBAs. :-) And making column names match Java field names ... wouldn't that open the risk of using an ORM tool instead of writing millions of lines of code?
Seriously, good post. I was exposed to the all metadata all the time horror only once and the others not at all, I think, but they are certainly out there.
Peter,
Please read this part, it hit upon my thinking very well.
http://www.evanhoff.com/archive/2007/06/05/19.aspx
SQL has very poor support for abstraction, encapsulation, reuse, etc.
Comment preview