ReSaving to Blob
Mats (NPersist) is talking about Saving to Blob, basically suggesting that you can turn your persistence strategy into Table(Id, Blob):
An alternative would be to serialize the non-identity fields into a blob of some kind and save that to just one column in the database. The database table would then have primary key columns matching the identity fields and then just one more blob column for the serialized non-identity fields of the object.
Since this basically prevent any useful filtering on the data, he suggest that you can extract columns outside the blob, much in the way we currently do with indexes.
As far as I am concerned, this is the worst possible persistence strategy. It means that you have taken some of the great strengths of the database and turned it into a stupid file server, for no real benefit. Mats suggest that this can have positive performance implications, but I doubt that this is the case. Just consider the cost of doing something as simple as providing a sortable, paged, grid using this approach. If the user has chosen to sort on a column that is inside the blob, then you would need to pull all the table in memory, sort it, and then hand the application just the page it is interested in.
It also completely destroy the possibility of working with the database with database oriented tools. I can go into my DBs and look at the tabular data, but 0x20495bacfca12312 doesn't have much meaning to me. ETL tools, reporting, etc are going to become useless. Not to mention that now you are in versioning hell. Serialization is notoriously difficult to handle correct across versions, and to have all the data in the DB in a format that is sensitive to serialization is quite a problem.
Handling associations between objects is another issue, I can use the DB's FK capabilities if I am using the standard mode, but a blob can have no FK enforcement by the DB. In fact, I can't really think of a good reason to do this, and I can think of plenty of reasons not to.
As far as I know, there were "Object <-> Blob Mappers" on the Java side at one time, and they are one of the points that always come up when an argument about OR/M come up, because that is a really bad way to handle this requirement.
Note: There is a really good article on this issue with Java Serialized Objects & Reporting that I have read a few years ago in an Oracle site, but I can't find it.
More posts in "Re" series:
- (19 Jun 2024) Building a Database Engine in C# & .NET
- (05 Mar 2024) Technology & Friends - Oren Eini on the Corax Search Engine
- (15 Jan 2024) S06E09 - From Code Generation to Revolutionary RavenDB
- (02 Jan 2024) .NET Rocks Data Sharding with Oren Eini
- (01 Jan 2024) .NET Core podcast on RavenDB, performance and .NET
- (28 Aug 2023) RavenDB and High Performance with Oren Eini
- (17 Feb 2023) RavenDB Usage Patterns
- (12 Dec 2022) Software architecture with Oren Eini
- (17 Nov 2022) RavenDB in a Distributed Cloud Environment
- (25 Jul 2022) Build your own database at Cloud Lunch & Learn
- (15 Jul 2022) Non relational data modeling & Database engine internals
- (11 Apr 2022) Clean Architecture with RavenDB
- (14 Mar 2022) Database Security in a Hostile World
- (02 Mar 2022) RavenDB–a really boring database
Comments
I must disagree. At my company we've used this approach for years, and it can work quite well. Quite often there is data that you know needs to be searchable, and data that doesn't. Maintaining complex relational schemas (and dealing with the upgrade issues that go along with evolviing those schemas) can be quite a maintenance burden, and you can sidestep a lot of that with this approach.
There absolutely can be performance advantages as well. Persisting a complex, nested data structure into a pure relational schema could involve dozens of tables, but can be collapsed down to one table with this approach. The serialization format issue is manageable, and often necessary anyway.
Obviously this isn't an architecture for every application, but for the right ones, it can work extremely well.
Kevin,
How do you handle reporting, ETL, etc?
Are you using any OR/M tool, or are you doing everything by hand?
How are you handling schema changes in a blob?
Re: Kevin
Why use a database in the first place? You'd be better off with a filesystem.
And serialized objects infamously don't version well. You risk ending up with an unreadable data silo once you switch runtime version or when you refactor your classes.
Still, I don't doubt it worked well in your case. But the generalization of that approach gives me shivers.
@Ayende,
I should be clear that we don't generally use the degenerate case of Table(ID, Blob). We use the hybrid approach, where the blob stores the entity, and properties of the entity that need to be searchable are externalized to relational columns. For applications with very rich data structures but constrained searching requirements, this can make for a far more manageable database schema than relational-only.
Reporting is generally handled either against the externalized properties, against separate history tables, or against a set of OLAP cubes. Our app is a BPM system, so often the reporting is focused as much on the transitions of the data as the data itself, and an all-relational schema doesn't give you that for free anyway. Actually, Microsoft's Workflow Foundation uses a similar approach - blog storage for the process instance, separate tracking tables for reporting. We don't use an OR/M - this is an application that has existed for years, before OR/Ms were a glint in some programmer's eye. ;)
Sorry, I'm not sure I understand your question about handling schema changes in the blob. The structure of the blob data can be evolved independently of the relational schema.
@Yann,
There are some really good reasons to use a database rather than just a file system. Since we need SOME data in the database, keeping it all there simplifies backup and recovery enormously. Also, you get transactional integrity.
As for serialized object formats...well, this application has transitioned from LISP (you read that right) to COM to .NET, so apparently it's manageable. Avoiding platform-specific binary serialization formats is one key factor.
Again, I'm certainly not saying that this is a good approach for all applications. But it absolutely can be a good approach for some.
I'm not sure if it was clear enough in my post, but what I am talking about is the hybrid approach that Kevin is talking about - when you have Table(Id, Blob, SearchableField1, SearchableField2...)
/Mats
just make sure blob is last column in table ;)
Hoya,
That makes a difference? :-O
And assuming it does ... the RDBMS doesn't automatically put it last then ? :-O
/Mats
There are definitely times that while an application needs a "database', it doesn't necessarily need a "relational database" (with SQL overhead and friends). In the non-relational database world, key-value is the way to go. And why must my indexed column exist outside the blob in order to create an index on it. Your current vendor may not currently support it, but isn't an index just a pointer into a data structure anyway. ;-)
BerkeleyDB is the first thing that popped into my head when I read it. When I talked to one of the engineers at yahoo last year, they were using some form of it for the search engine.
I have been looking at using some form of non-relational (ie.. key-blob) database for an application I have on the side. It involves a flickr like site where photos are key-values in a non-relational db and membership/user data is stored in sql server.
Now, one thing I wouldn't try is the key-blob thing inside an RDBMS. You already have the SQL overhead. You might as well put it to work.
Two question to the BLOB-fraction:
1) Isn't a CLOB with XML-persistence data more "agile"? It can at least be full-text-indexed and searched even without searchable columns in case you need a special property very seldom. Or is the performance loss so significant in that case?
2) How do handle Collections and n:1-relations with shared items? Are those referenced item stored multiple as part of the serialized content or are they referenced? The latter calls for homegrown serializers, doesn't it?
-Markus
@Markus
1) I don't know about agile, but indeed if you identify useful text searches over your serialized data, by all means run them :-)
2) Well, we're talking about the hybrid approach where you put some fields in the blob column and others into their own columns, right? As has been observed, fields that contain relationships and so could use a foreign key constraint in the database should probably get their own columns. So, on the reference side, the reference field will have a FK column in the database. On the collection side....well, the collection property would normally not be persisted to the database /anyway/ since it is the "slave" property - in a bi-directional property pair, normally only one of the properties is actually persisted (since they both map to the same table and columns in the database) and in an n:1 relationship it will usually be the collection property that is not persisted.
For a m:n property you would still use a many-many association table in the db and your collection properties would be mapping to that table in the normal way.
So, there's normally no need to put a serialized version of the collection propery into the blob.
However, if we're talking about a non-hybrid approach, where everything goes into the blob, you have a very different scenario. In that case you would, instead, always have to persist both sides of the relationship (both properties in a bi-dir prop pair) into their respective blobs. That goes for reference as well as collection properties.
@Evan,
"And why must my indexed column exist outside the blob in order to create an index on it. Your current vendor may not currently support it, but isn't an index just a pointer into a data structure anyway. ;-)"
In my post that Ayende replies to I pointed out exactly that. But even though the db could potentially understand any serialization mechanism, I was explicitly talking about the case when it didn't happen to understand the blob format - either because it couldn't or simply because we didn't care to inform it, since we decided we didn't need to do filtering in the database for those fileds.
It ties into this:
"Now, one thing I wouldn't try is the key-blob thing inside an RDBMS. You already have the SQL overhead. You might as well put it to work. "
The point of the discussion is to draw attention to the fact that sometimes, you'll have columns in the database that you never exploit the structure of (do queries on) in the database - and when you don't exploit the structure in the database, letting the RDBMS even /know/ about the structure for /those/ fields is actually just overhead.
Say that in a table with many columns I have two datetime columns that I never filter on in the database. Well, putting the database to work, ensuring that all values in those fields are valid dates, can be more expensive (for the database) than using just a single blob column for the dates. The work of dealing with the structure for those values can then be offloaded to the app servers.
/Mats
You are aware of the general sentiment toward MySQL non-validation of data types in the past, right?
Referential integrity and type validations are two things that I want the DB to do, because letting someone else do it means that it is very easy to get corrupt values.
If the argument is performance then I would like to see benchmarks that would show that there is a significant benefit to using this approach, and frankly, I doubt it.
It is basically making anyone who want to deal with the database much harder.
This is possibly the stupidest suggestion I have heard since a friend mentioned the great "1 table database solution" a PHD grad in his company had come up with.
The whole application had 1 table with "ultimate flexibilty". You store a unique id, data, metadata about the data you stored and a "related to" field.
Don't people realise there was a reason for inventing databases?
@Mats,
"The point of the discussion is to draw attention to the fact that sometimes, you'll have columns in the database that you never exploit the structure of (do queries on) in the database - and when you don't exploit the structure in the database, letting the RDBMS even /know/ about the structure for /those/ fields is actually just overhead. "
Until your BA comes over and asks you to run an ad-hoc query on them for x purpose. The key-blob thing works really well in the non-RDBMS world because you basically give up ad-hoc querying (most of the time).
In the RDBMS world, I would bet money that dropping referential integrity would give you a much much bigger performance improvement than combining fields. None of us would likely argue that point unless we were in some crazy wierd edge case. And let's be honest, the cost of verifying a particular type (such as a DateTime) is likely negligable in the face of a query parsing and execution engine(ie..there might be a 0.0001% improvement on UPDATEs and INSERTs).
The key-blob thing works really well in the non-RDBMS world. It's also the unchallenged performance king when compared to RDBMS. I think it's probably best left to that world though. Especially if we are trying to "embrace change".
@Ayende,
Yes I am aware of that. Again, I'm talking about a hybrid approach, using the rdbms for what it is good at but not burdening it when it isn't needed.
This isn't anything weird I am suggesting - the same argument applies any time you are storing something in a blob! Storing a jpeg picture in a blob? Well, it has structure - structure that the image reader can understand but that the rdbms doesn't know about. You may have more datastructures that would be best saved as blobs, you probably do? What I am saying is that, as an advanced type of optimization, you can sometimes use this thinking on /parts/ of your objects' internal structure - namely the parts you won't have to query on.
Here's a recent and concrete example: I just wrote a simple chess program (blogged about it here http://www.matshelander.com/wordpress/?p=68 ).
Now, it doesn't have persistence to database yet, but I plan to add it. My ChessBoard class has several properties like booleans showing which castlings are possible, whose turn it is and so on. It also contains a collection of BoardCell objects and every BoardCell may or may not contain a ChessPiece object. This whole structure I serialize to a simple string that shows the state of the board. Data structures for representing chess boards are a-plenty, I didn't chose a particularly sophisticated one, but it turns my whole DOM into a fairly compact little string. When saving a board to the database, I'll be saving that string. I will want to be able to find the row with that state, so I want to be able to do a = query on that column, but I don't need to do further querying on subparts of that structure so I'm not going to normalize it. However, I'm probably going to add other columns to the table, for things I do want to search on in a structured way, such as the evaluation score for the board.
@Dries,
With all due respect, I think the one table solution is far more stupid than what I am suggesting :-)
@Evan,
Remember that I haven't advocated putting stuff that you may need to query on in the blob.
The first rule of optimization, "Don't Do It".
Until I am shown some benchmark that shows me some significant advantages of this approach, I am not going to consider this as an option. The sheer cost of handling that is fairly big, and I don't see the value in that.
Just to point out, I fairly often does ad-hoc queries that I would never do in the application itself.
Your chess sample is an example of a time where you already did serialization, so you can benefit from that, but to me, for such small stuff, there really isn't any meaning to the question of perf here.
I could literally slap attributes all over the place, gen the DB and be done with it, never once taking a look at how the DB will look like.
@Ayende
I agree with your attitude. Don't do something like this because you heard some guy say it on the web and think you have just written an ultra-scalable application.
But,
"Until I am shown some benchmark that shows me some significant advantages of this approach, I am not going to consider this as an option"
You could turn that around and say - if you have an application where scalability is still a concern after you have tried every trick known to you - would you be prepared to try out yet another trick? Doing all the appropriate measurments in the actual app of course, to see if it helped in your particular situation.
A chess database isn't "small stuff" (even though my toy app is)...it is a potentially very huge thing. That's one type of scalability concern - when your data is extremely large.
But to be clear: I have not tried to suggest that using a hybrid approach would be something that could help you with /performance/ ! It is something that can help you with /scalability/ !
P.S To make the discussion even more interesting, you can mix / hybridize even more by using more than one blob per table. In my chess database example, I will most likely have an additional blob where I put a serialization of all the legal moves from the boardstate. Why not put that in the same blob? Because I wanted to be able to search on exact matches on the blob with the board state! :-)
@Mats,
I hate to call this out, but performance and scalability are two completely different things. The more you focus on performance, the worse scalability becomes (usually). An application written in assembler is a pretty good extreme example of the difference. Cutting 0.0001% off INSERTs and UPDATEs isn't going to give you squat in scalability. You are likely just wasting your time (is that 0.0001% really your bottleneck?). Again, I'm way more likely to drop referential integrity or transactions (and that's not happening anytime soon).
My experience tells me that as soon as I make the assumtion that field X does not need to be queried, the BA asks for a query right after the design went into production.
I would argue that for your chessboard example, you don't necessarily need an RDBMS (unless it's just for ease of development). You might even get away with serializing state to the filesystem.
Conceptually, I could make the same argument about porting an application from c to c++. You might say that the application is already written using function libraries and structs. Do you really need that "class" thing? Maybe not, but I won't be recommending that to other OOP designers in c++. Ignoring the power of the platform is only for specific edge cases.
One other thing to remember, there are diffferent classes of applications. A chess program and a shopping cart are in different classes of applications. A web server is in a 3rd category of applications. I'm not likely to use an RDBMS inside a webserver (ie..build IIS using Sql Server). Usually, businesses require the power of RDBMS. Other applications, not necessarily.
Reading through my last comment, I didn't make a clear distinction between performance and scalability.
A performant app uses its resources efficiently.
A scalable app can grow with additional resources.
The two are orthogonal.
In terms of software architecture, we usually make performance worse in order to make scalability better (we make latency worse so we can scale).
Squeezing out 0.001 ms of performance has nothing to do with scalability.
Comment preview