That No SQL ThingWhy do I need that again?
During the course of this series, I got a lot of people declaring: “But you can do that with RDMBS if you use XYZ”. The problem inherit in this statement is that it ignore the fact that if you really want to, you can use a hammer to put screws, it isn’t nice or efficient, but you can do it.
I run across this Dare Obasanjo’s post that explains the actual problem in beautiful terms:
What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.
It is unsurprising that after being at this point for years, some people in our industry have wondered whether it doesn’t make more sense to use data stores that are optimized for the usage patterns of large scale websites instead of gloriously misusing relational databases. A good example of the tradeoffs is the blog post from the Digg team on why they switched to Cassandra. The database was already sharded which made performing joins to calculate results of queries such as “which of my friends Dugg this item?” to be infeasible. So instead they had to perform two reads from SQL (all Diggs on an item and all of the user’s friends) then perform the intersection operation on the PHP front end code.
I can’t use most of the relational database traditional advantages anyway. Not the moment that I step out of a single machine boundary. At that point, I have to re-evaluate what I am doing to see if it make sense to have to deal with the traditional relation database disadvantages. In many cases, the answer is no, and a solution that fit the problem better can be found.
This is where the NoSQL databases started. I think that once they gotten mature enough, they have advantages for smaller scale solutions, but that is a problem for another day.
More posts in "That No SQL Thing" series:
- (03 Jun 2010) Video
- (14 May 2010) Column (Family) Databases
- (09 May 2010) Why do I need that again?
- (07 May 2010) Scaling Graph Databases
- (06 May 2010) Graph databases
- (22 Apr 2010) Document Database Migrations
- (21 Apr 2010) Modeling Documents in a Document Database
- (20 Apr 2010) The relational modeling anti pattern in document databases
- (19 Apr 2010) Document Databases – usages
Comments
Yeah I think there will always be a little stigma to advocating the use of a newer, unknown persistence technology. RDBMS's have over 30 years of maturity behind them with ubiquitous skills in the marketplace making them the 'safest choice' to go with. The problem is they are best-suited for storing relational data and although you can potentially map any data structure to a relational one - most of the time this requires more effort than an elegantly-designed NoSQL solution.
At the same time NoSQL solutions have manifested themselves because of deficiencies in existing RDBMS technology. Unfortunately because of this they are collectively referred to as 'NoSQL' solutions which is a rather aggressive tone promoting an 'Us' vs 'RDBMS' mentality and consequent flame-war. The problem with this is that they all exist to solve a different set of problems arising from data persistence:
Cassandra, Hadoop and Scalaris - Are built to scale
Redis and MongoDB - Are built for speed
CouchDB - Not as efficient as the above, but is REST+JSON based making a good solution for Ajax apps
Neo4J - built for traversing social graphs
Then there are other technologies like RabbitMQ, ZeroMQ, MSMQ, ActiveMQ, etc. which are just message queue technologies that could potentially fit under the 'NoSQL' definition but as their exists a more appropriate label for them, they don't.
Now you could potentially build an RDBMS solution to solve all the above scenarios, except it will likely require more effort to develop, run slower and handle less load which is the reason for their existence in the first place.
Though at the same time you're likely only going to be hitting RDBMS limits when dealing with internet scale, so most of the time inside the firewalls of most Enterprises RDBMS's are still going to be your safest and best choice. Our company is actually hitting these limits where we have tables with 2M+ rows and we're no longer able to deliver a non-indexed query in a reasonable time frame, so in these cases we lose the benefits and rich querying abilities of SQL. We fare a little better performance-wise when we query across our sharded databases as we're able to parallize the queries, but this does require custom-development effort collating, sorting and stitching the results back up.
As we care deeply about the performance of our web services we've decided to use Redis in conjunction with our sharded PostgreSQL persistence to deliver maximum performance with the least development effort. Managing user sessions, real-time notifications, list of 'live posts' and 'online users' are examples of solutions that we're elegantly solving with Redis that would require more development effort and slower runtime performance than an equivalent RDBMS-based solution.
Google analytics... runs on mysql. Just to name a large application.
I know you're actually trolling, but I just want to say that I agree with Tony here, and want to add that just because the 'Digg' team needed some NoSQL db, doesn't mean it proves the case that if you need a big database you need a nosql db. for example Slashdot runs on mysql as well. It could well proof that the Digg programmers actually have no idea what they're doing.
I also think that claiming that because there are multiple machines involved, the RDBMS advantages are out the window is a bit of a silly move, namely: prove it. Prove that multiple machines running an RDBMS can't possibly scale, run transactions etc. and basicly are a dumb pile of turds who only eat cash. No, some social site built by non-programmers in an interpreted language isn't proof.
You say you know RDBMS-s, so it must be easy to claim why a large server farm with oracle blades can't possibly scale well, and why a multi terabyte DB2 database ran on multiple machines using a shared file cabinet can't possibly scale at all, because of the sole reason that they're a sql based db and not a nosql db...
So, Oren, what will you do with the data in 20 years time, when your database is still running, but the application once using the db is long gone? There's no schema, no query projection system to work with the data, as the data is stored in a nosql fashion using the object types of the application the database was built for. Oh of course, the database doesn't live that long, right? dream on.
Oops! I fed the troll afterall...
Frans,
"shared file cabinet" is a single point of failure.
As for proving it, well, what about Amazon? Are they big enough case study for you?
What about Microsoft? Take a look at how something like MSM Messenger works. Or Azure, where your relational databases are limited to puny 10 GB.
I assume you have seen 20 years old RDMBS, they don't have any schema either :-)
Not true at all.
In document/graph & column family databases, the data is stored in a way that can be read / written to without needing to have the application code to do so.
With key/value stores, you have a byte[] array, and it is the app responsibility to use it properly.
Then again, that is how files works.
Oh, so it must be bad eh? Ever looked at these things? How they make sure they don't drop dead when a bit keels over?
If you're writing the new Amazon killer, great. Everyone else (99.9999999% of the developers out there) isn't writing any amazon killer, and can benefit from a relational database. Of course, everything we've learned to love in the past is of course bullshit, and NoSql solves our problems properly this time, right? Don't kid yourself. NoSQL db's are simply OODBs and we all know how well they did.
It's not about badly designed legacy cruft, it's about " I have this blob of data and what it means is beyond me, because what gave meaning to the data is gone.... along time ago".
So you're saying that when I toss away the application, I can have a nosql db, which has consistent data (hard to believe without constraints and transactions) and which is usable in whatever other application there is, be it OO, a procedural app or whatever? I don't think so.
So, how are you going to make certain your data in your precious nosql db is consistent and correct without distributed transactions and constraints? Remember, digg and fellow sites contain a big pile of bullshit data no-one gives a **** about if it's consistent or not. data of banks, hospitals, financial institutions, large companies... you really think they'll store it in a nosql db, because you said it was more scalable? (you still haven't proven sql based db's aren't scalable) Without transactions and constraints? haha :)
@Frans Bouma
Everyone seems to reference the same application that has been made to scale. They tend to forget that google and other big internet company dealing with large volumes of data (i.e. Google, Amazon, Facebook, LinkedIn, Twitter, etc) have actually thought they would get more benefit in building and using their own NoSQL solution then getting their existing RDBMS's solutions to scale, it's weird how they all ended up with the same conclusion - it must be because they're PHD's are incapable and not certified oracle savvy. Every other large internet company with significant volume like ebay have made RDBMS's to scale by partitioning their data into sharded databases, I'm not sure of the internals of Google Analytics setup but I would be very surprised if they are not using a sharded solution as well, in which case a lot of the benefits of using and RDBMS is lost.
I guess because they ran out of money? No one is suggesting that RDBMS's can't be made to scale only that it is a lot more expensive and harder to do. I have not read any case studies where an RDBMS's is more efficient than a comparative NoSQL solution for handling large volumes of data, only the contrary. It seems that everyone before switching to NoSQL have done their due-diligence in comparing the benefits/trade-off's of each.
Here is another article explaining why Digg moved to NoSQL (spoiler: its cost): stu.mp/.../...l-vs-rdbms-let-the-flames-begin.html
Otherwise here is a slide by twitter's architect detailing the use of NoSQL at twitter, he includes honorary mentions of potentially using an RDBMS as well:
www.slideshare.net/.../nosql-at-twitter-nosql-e...
Does this qualify as proof? or are all the folk at Twitter muppets as well? Otherwise if you really are interested in viewing case studies of people scaling with NoSQL solutions, you can hang out at: http://highscalability.com/
Most NoSQL solutions are open source, and maintain a schema-less, self-describing format. In other words not locked into a proprietary vendor's format (that many others may have experienced when their data is locked into a proprietary db format). So if the worse happened and every project maintainer of your selected NoSQL solution disappeared from this planet, you still have the 'open source code' for others to continue enhancing the product or 'open data format' which would make it trivial to export into the next big 'NoSQL thing' should you need to.
Now I've always advocated using the right tool for the right job (evident by caveats on most of my comments), I also consider myself to be very pragmatic which is why I continually evaluate different solutions and tools all the time ensuring that I'm productive as possible and knowledgeable about the strength's and weaknesses of each solution. I've always maintained for corporate Intranet applications the best choice is usually to go the RDBMS route although if your data is not relational and you are only persisting/retrieving data (i.e. not querying with SQL) then the schema-less nature of NoSQL solutions does offer a productivity enhancement.
So, Oren, what will you do with the data in 20 years time, when your database is still running, but the application once using the db is long gone? There's no schema, no query projection system to work with the data, as the data is stored in a nosql fashion using the object types of the application the database was built for.
Why is self describing data different then using a schema, only in that it is more flexible. It is very reasonable to state that there is still a sort of grouping available and that many documents will share a similar structure. I don't see how this would be a problem over 20 years more then when using a RDBMS with schema's
-Mark
@Mark as long as there's a descriptive 'schema' (I use that term to cover the concept) giving meaning to the bits in the storage, it's fine by me.
I find 'self-describing data' a little bit odd though. Data is just data, bytes in a row. What makes one string a name and the other one an id ? That's what the schema defines. If that's tied to the application built on top of the document DB, it's not going to be useful after the app died. With a relational database you don't have that, as the relational model is self containing: it defines what's inside itself.
If a document DB also contains a separate schema, it's effectively the same thing.
@Demis
"Most NoSQL solutions are open source, and maintain a schema-less, self-describing format. In other words not locked into a proprietary vendor's format (that many others may have experienced when their data is locked into a proprietary db format). "
It's not the format of the tables, it's the schema definition, i.e.: table DDL SQL, FK constraint DDL SQL etc. How that's stored inside the db is irrelevant, as you can ask the db to give it to you.
@Frans
Agreed the lack of any FK present in the NoSQL schema does make it less knowledgeable about itself (especially if you can only refer to the data itself) and because it's an application-level constraint and not a data-persistence-level constraint potentially less verifiable. Although the lack of a rigid, well-defined schema is both its blessing and potentially its curse as a lot of the schema manipulations and migrations can be done at runtime by the application. It does make it easier for bugs to creep in and corrupt your data but at the same time it can be more productive as your code doesn't have to map to an intermediary schema, most of the times it's just C# -> Data Store.
This may not be true for all NoSQL solutions but it is in my C# Redis client where the client natively persist POCO objects, the C# Type is the master schema. So if you wanted to you can simply define statically-typed references using C# attributes. Although at the same time I think it will be pretty rare to only have access to the data store and not the application that created it. In most of these cases I imagine the data on its own to be pretty useless outside of reporting purposes anyway (the first RDBMS example that comes to mind would be MS's CRM internal db schema).
For anyone who might be interested, I have an example of what I mean when I say the 'C# Type is the schema' where I follow ayende's simple blog application and show how you would build an equivalent solution in Redis:
code.google.com/.../DesigningNoSqlDatabase
@Demis: your example is exactly what I was referring at. If in 10-15 years time C# is overtaken by some other language/platform, your db isn't really usable, and what's worse: your data isn't either.
That's why a rigid (I dont see it that way btw) schema is essential for data which is used for anything useful. Sites like reddit, twitter, facebook, digg, they all host big piles of nonsense. If something keels over (like at the moment, with everyone having 0 followers at twitter ;)), there's nothing really lost and in 10 years time, I don't think anyone gives a hoot about tweets, urls etc. hammered into those systems today. That's also why it's dangerous to use these sites as an example of how great and useful nosql db's are: a financial system with the requirement that data is consistent and correct, that's a good example if a db can pull it off or not.
@Frans Bouma
Just so we're clear, the data and underlying schema is not tied to the C# language, it's tied to the English description of the IL/.NET Declaration Type. If you don't use complex types and just use UTF-8 string values and persist them directly in Redis's rich native data structures (i.e. List / Set / SortedSet / Hash) then there is no problem and every language binding in Redis has equal access to your data.
However if you do decide to persist POCO types, then the niceties of the C# language offer a DSL-like way to define schemas (i.e. POCO's), in this way it is no different then say a '.proto' definition for the protocol buffers serialization format. Also there is no .NET type information serialized with the schema so any POCO type that is serialized can be de-serialized by a loosely-typed string Dictionary (or dynamic JavaScript object, etc) which are some examples of not needing the original Master Schema to access the data. Here is an example which shows just how resilient the format is:
"Painless data migrations with schema-less NoSQL datastores and Redis"
code.google.com/.../MigrationsUsingSchemalessNoSql
Now ultimately the (open source) serialization format doesn't include any .NET types per se so is easily mappable to a POJO/JSON object with a similar structure, it is in a human readable and lexically parse-able format so a machine or human will not have any problems understanding it. By utilizing a code-first approach I am also able to use the rich reflective abilities of the .NET runtime to code-gen different models in different languages (e.g. similar to the cross-language Thrift framework), so it would be trivial for me to add first-class support for objective-c and JavaScript languages (which is in the current roadmap :). Also the popularity and open nature of Redis ensures that there is a language binding for almost every language actively in-use today.
Now I've made a conscience decision to support a code-first, convention-based POCO development approach in all technologies that I develop as I find that it enables more code-reuse and I can build richer and more powerful frameworks with the minimal amount of development effort. E.g. In @ServiceStack.net, using the existing technologies it is trivial to develop a generic solution to receive an async web service DTO request (@ServiceStack web services), store it in Redis (C# Redis client), drop the request in an offline mq (currently have rabbitmq / redismq / in-memory-mq), pick it up in a windows service (ServiceStack.Messaging) and populate an RDBMS table (ServiceStack.OrmLite) and maintain a Lucene index, all off the same POCO. I'm not saying it's always a good idea (i.e. Separations of Concerns, etc), but when you conform your technologies and map them to a convention-based POCO, I find you enable a lot more functionality that what is otherwise possible.
(Contd.)
(Contd. Pt 2)
Not that it needs to be (for reasons above), but I think the .NET platform has as much chance of being obsolete as C structs (i.e. not much). But in any case I can use OrmLite today and re-use the POCO-types as-is to create and populate 1:1 RDBMS tables should I choose to see fit to.
I'm not sure which side of the tables-first/code-first fence you're on (I'm heavily rooted in the latter), but there has not been too many times where I'm expected to export data out of a legacy system without the underlying source code and only have access to the persisted data store. I'm into developing SOA-style architectures anyway, so I never access the internal table structures of an application directly, instead I prefer to expose a language neutral HTTP+XML facade on top of the application in order to access the underlying data-source. At the same time all the internal nuances of the underlying persisted format is stitched together by the master application and exposed in a way that makes sense to an external client who knows nothing about the internal schema.
Frans,
By that one statement, you've just blown your credibility out of the water. If you'd taken 10 minutes to look at the properties of a NoSQL DB such as Cassandra, you'd not have made such an erroneous statement. You may be an expert on RDBMSs, but it's not wise to argue a case in an A vs B debate unless you know both A & B. Otherwise you just end up with egg on your face. I think Ayende knows A & B. So who's the real troll here? ;-p
@Frans
Considering the current state of the financial sector, I very much doubt they have "the requirement that data is consistent and correct" ;)
We are about to embark on redesigning an existing system the key requirements are high availability and near linear cost scaling. There will be a lot of analytics run on the data so we need to keep an eye on TCO and we are working against tight deadlines (6 months) so we need efficient dev tools build automation etc. It is clear that an RDBMS will scale but the license and hardware costs will not be linear. We also want the ability to add capacity without downtime and we have very spiky demand at the end of each month which means we have to consider the dreaded Cloud. Nearly all our devs primary skillsets are c# & Sql. At the moment we are trying to assess where our main pain points will be if we adopt NoSql, will reporting be a nightmare for our anaysts, will the availability story be better or worse. Will our dev productivity be reduced?
Charlie,
It is hard to answer those questions without more data.
In particular, I would say that unless you have well defined analysis needs, you want to separate OLTP and Reporting stores.
Give the analysts a reporting database to play with. Do the transactions on a separate data store.
@Charlie Barker
High up on the comment list I've included is a pretty complete list of the best NoSQL solutions to choose from, i.e: Cassandra, Hadoop Scalaris, Redis, MongoDB, etc.
What is missing from this list are the hosted and managed cloud data services, which depending on your requirements may or may not be an option for you. If this is an option, then my first recommendation would be to check out Microsoft Azure services as it offers an elastic solution with costs that are fairly linear but importantly you can re-use your existing teams skill sets.
Other leading hosted highly available data services is of course Amazon SimpleDB and Google's App Engine + Big Table. Unfortunately there is not much C# love here as there is only preliminary support (its ugly) for C# with Amazon SimpleDB and no support for C# in Google's app engine. Google only offers a Python and Java API, although the Java language API may be more familiar to C# developers.
If a hosted cloud service is not an option then you may want to consider a NoSQL solution. Before we delve too deeply into, you may want to consider the conventional approach to building a fairly efficient application i.e have a RDBMS back-end but maintain aggressive, intelligent front-line caches - We do this at work on top of a sharded PostgreSQL solution (for 0 licensing costs) which works well.
Still no good? then its off to NoSQL land we go. What are your data characteristics, is it dataset volume or load throughput? The reason is Cassandra, Hadoop/HBase, CouchDB can handle large volumes of data while Redis and MongoDB can handle large loads very efficiently, as they primarily operate in memory but offer back-end HDD storage. The way they achieve high availability is via replication. HBase/Cassandra are BigTable-inspired 'column-orientated' data stores, while CouchDB and MongoDB are document-orientated databases. Redis OTOH provides rich server-side data structures that can store is serialization format agnostic.
Note: Ayende is due to release RavenDB into production soon which is a managed (fully?) document-orientated NoSQL implementation with built-in support for Lucene.
With NoSQL like most open source projects, C# is treated like a second class citizen. The exceptions are MongoDB who provide a fairly complete C# client here ( http://github.com/samus/mongodb-csharp) and Redis where I actively maintain first class support for the C# Redis client here ( code.google.com/.../ServiceStackRedis). The differences are MongoDB provide a loosely-typed Document-centric API (that is queryable), While I maintain a native strongly-typed API (w. built-in .NET collection for Redis's server-side collections), but with little support for querying.
Although the alternate solutions are still usable from C# clients as they offer a REST-like interface to store and manipulate their data. Personally for large data requirements I would go with CouchDB as I prefer the simplicity and pure REST-ful API. However if elastic scaling is one of your main requirements then I would consider exploring the technologies with more engineering effort in this area, namely Cassandra/Hadoop.
So the remaining questions that will help evaluate the best NoSQL solution for your requirements are: If you need to handle Large volumes of data? Large throughput? Rich Querying support? etc. As always I invite you to do your own reading so you can find the solution that best meets your requirements, Hopefully I have been at least helpful in pointing you in the right direction :)
Full disclosure: Unless it wasn't made expressly clear before, As the maintainer of the ServiceStack C# client, I have a general bias/preference for my own client. At the same time I wouldn't be maintaining it if I didn't think it offered a superior productive solution / API. I believe Redis offers a simple comp-sci like data structures that can provide elegant solutions to many problems.
@Demis, Thanks for all the really useful info lots of reading to do :)
Plaah, the person you quote clearly doesn't know what multi-host relational databases can do, so limited to his little knowledge on single box databases he has come across he considers all to be the same. Thats racism. While I agree you lose FK-s and triggers on most of the cases, transactions and joins work very well with distributed RDBMS-s. Take Teradata, Greenplum, Netezza. You can have databases spanning tens of physical hosts and you almost won't even notice it.
Bunter,
The person I am talking to is one of the people building Microsoft Live services.
And tens of physical hosts isn't really impressive, how do you think a transaction will work with 5000 machines participating in it?
Great article. People will always complain and ask "why" would you do something. Why write a database in .Net at all? "everyone" knows C++ is faster so just do it there.
I, for one, applaud your efforts. I started my first C# database in 2003. It scaled to handled 1.6 TB of data we used at that time for spam filtration. Now I have the commercial database VistaDB, but it is heavy compared to document databases mostly due to the requirements of relational data and SQL in general.
SQL is a pain, no two ways about it. I love LINQ. We have been working on a LINQ database internally for about a year (nothing public yet). I wish you the best of luck on your project. We need more people writing things in C# that others say can't be done!
So, 'MikeS' you know what I do understand and what not, eh? it's not about the database system the data is in, it's about what system is used to transform 'data' (i.e. 'bits') into 'information' (i.e. bits with a meaning). In RDBMSs, schemas are used to give meaning to bits, if you don't have a schema, what's giving meaning to the bits? That was my point. Self-describing data is thus bullshit, as data can't describe itself. How is "FooBar" describing itself? You need context! That's right, Mike, Context. So what's giving that Foobar string the context it needs to have any meaning? That's the question. With that reply you quoted out of context, I was responding to the 'self-describing data, we don't need a damn schema' remark. Self-describing data is an oximoron, like self-describing code is too (code doesn't embed design decisions, WHY things are the way they are, only the result of the final decision)
The troll remark was about the blogpost's intention. As if RDBMSs can't do what the majority of developers need. And more.
Btw, 'MikeS', I spend the last 15 years living in database land and breathing databases in all shapes and forms. Please don't assume I don't understand squat about nosql db's and what's important about data and what not. Also, who cares if 'Cassandra' has some kind of descriptive nature built in. It's not about a given implementation of some concept, it's the concept itself.
sigh. Your post at least shows 'NoSql' is another topic in the list of topics you can't debate on the web without being butchered by fanboys, who attack you personally and claim you don't know shit about the topic, because you're not part of the in-crowd fans.
What's next? Saying you like SQL and RDBMSs and getting slaughtered by fanboys of nosql similar to what happens when you say 'waterfall has its place' and getting slaughtered by agile pundits ? Because that's what your reaction looks like, 'MikeS'.
Document/OODBs have their usages, like any tool does (for you: that's also RDBMs). That's why I called the post a troll as it claims RDBMSs don't. Discussing the topic like nosql is the new way of doing databases and RDBMs and SQL is for dinosaurs and for people who don't get it or are too stupid is exactly what you shouldn't do. As it will soon position Nosql next to OODBs (which are very much alike btw) in the same niche corner.
@Ayende
Who on earth needs a 5000 machine wide transaction? You? In which application?
Frans,
{ "Name": "Ayende"},. however, is very self describing.
As for 5k tx, I don't need it, but you'll, if you want a relational database that works on a large set of machines in a distributed fashion.
You won't be able to get consistency otherwise.
Ayende - it can also be (few) hundreds of boxes with petabytes of data. And still do transactions. But most certainly not thousands, true. However, they are targeted for different type of applications than web based applications with massive (tens of thousands) concurrent users with every user living with a small isolated island of data. But not all applications (in fact most applications) are not like that. Just don't throw out dogmas like "cannot X with Y if I have more than one box" or "can do X with Y but hammer nail" and other embarrassing childish metaphors.
Bunter,
When I say something like "the bottom of all ships is wet", you can very well point me to a ship in dry dock and say that its bottom is dry.
But I expect readers to add the common sense qualifications where they are appropriate without me having to preface every statement with a dozen qualifications that doesn't really matter for the subject at hand.
And no, you can't do transactions when you have to touch many machines and handle failure and keep reasonable performance.
@Frans,
I understand your frustration, personal attacks are not cool - especially when most of the time we're just voicing our own opinions and preferences.
Obviously you prefer the safety of maintaining an external RDBMS schema. As we all know is still the conventional way most of us develops applications today. There are also times when it will always be the preferred option, i.e. when you have multiple applications accessing the same database (although from a purist SOA POV it's not something I advocate).
Now the biggest friction point I have while working on a deployed app is trying to manage an evolving RDBMS schema, where even in an RDBMS world I've shunned my uni normalization teachings and are now storing complex KVO types as text blobs in a table for performance and agility reasons.
There are also times when I think storing in an RDBMS is just bad practice, i.e. in transitive message queue and long-running process services.
Here you have no querying requirements and just need to retrieve, persist and send arbitrary datasets where needing to map a complex request DTO onto an ORM configured onto a RDBMS is just a waste of development effort.
A lot of the times this friction is a result of trying to map the schema of our hierarchical programming model to an external RDBMS schema. In these cases the programming model maintains the 'master schema' while the RDBMS maintains the 'persisted schema'.
In NoSQL you no longer have the 'full-schema' in the persisted NoSQL db but you have a self-describing one which for a person using JSON might look like:
{"Name": "mythz", "Age": 31, "IsActive": true}
Now the Name, Age and IsActive fields here are as useful as the equivalently named columns in an RDBMS table. JSON also supports primitive types so we can infer that "Name" is a string field, "Age" is of integer and "IsActive" holds a boolean type. What we can't infer is whether this is a complete list of all the fields for the person type.
This 'lack of full-schema' only matters when you're accessing the data directly in the NoSQL db by-passing the application that created it. Which I don't envisage this to happen a lot since most applications come with an in-built UI. Now once the data is back in 'app land' then you know have there is no problem as you now have the data and the strongly-typed 'master schema' that created it. If this application is XML-based web service then the 'full-schema' is still available to external clients outside of the service boundary via the applications XSD/WSDL.
Anyway I'm just re-iterating these details again just to show that although the persisted NoSQL database doesn't have full-knowledge of your data, its is ultimately no real concern since its maintained in other places.
BTW Frans, I've read a bit of your blog and you seem and you seem to be a math/algorithmic-based person that I think you will appreciate a lot of the features of Redis as it uses comp-sci data structures to provide an alternate approach (IMHO sometimes more elegant) than a traditional NoSQL DocumentDB. It also has powerful notification/messaging mechanisms in-built which you may find some use-cases for.
@Ayende
that's more than just data, isn't it? I don't care in what shape or form the context in which data is stored is specified, as long as there IS a context definition. With NoSQL db's, there's no true context which specifies what's true and what not, I can store many instances of 1000 different Customer types in the same db. In an RDBMS I can store only many instances of a single type of Customer, the one which fits in the Customer table (relational model -interpretations based inheritance implementations aside)
That's not a bad thing, but it's important to know. Not for when you're using the DB with the app it was written for. It's important when the app is gone but the DB isn't and some other team has to write a new app on top of the database, as the data in the database is the ASSET of the company, the source of information the business drives on.
What I personally think what's best is not important, let that be clear. What I'd like to see in this nosql - RDBMS - OODB - XML file - I_dont_care_about_databases debate is that instead of bashing the other camps, the unknowing user gets enough information about strong points and weak points about the technologies available. So for example, with an RDBMS, being a centralized shared resource, scaling has consequences. It has also advantages that it's no problem to throw away the application the database was designed for and in 10-15 years time start over with a different app and re-use the exact same database base and more importantly: the contents of it. I'm not convinced with NoSql and its lack of information about the structure, the context the data is stored in (COMPARED to rdbms-s!) and it reliance on the application to control what's going in/out of the db and what's true and what not what can relate to each other and what not! if it can offer the same level of re-usability for the DATA, to serve for the source of information a company is based on.
That however may be perfectly fine. For example for applications which have a 1000:1 read/write ratio or only store data which is worthless within a week / month / year (twitter! ;)), it's unlikely that the data is key for a company's existence. But do the readers of your and other people's blog know that? Do they have enough information to make that important decision: is the data stored in a database THROUGH my application going to last for a long time to come so I have to take into account the re-usability of the data?
And please... I'm not stupid, I know you can traverse using the sparse info available in the db, that 'a' customer instance has a navigational property 'Orders' which navigates you to 'a' set of 'a' type Order. but what if I have 1000 Customer and 500 Order types (over-exaggerated ;)) ? which one stores the info I need? Mind you: every app can store its own types in the db.
I think the debate about these database types can only become a healthy one IF people realize the consequences of the choice of ANY type of db, be it RDBMSs or NoSQL or other. Claiming that one type is always the best choice is of course naive, that's not the point, the point is will readers of this blog and other blogs / articles get just 1 side of the story or the full story. That's also why I called your post a troll, as you bash RDBMs like they're not good for anything and NoSql dbs are. Fact is however that many websites of average companies can even run from a shoddy access db.
@Demis:
History learns us that it does happen, and often more frequently than you imagine. Ever heard the phrase "It's a legacy database" ? :) (often used as a synonym for 'poorly designed cruft') The db lives on, the app gets rewritten, or a different app is built on top of it.These situations (and other) are situations which need attention and could make or break the decision for some person which db to choose. When the original app is necessary to understand what's going on in the DB... it's very fragile.
Reading about NoSQL db's on various blogs I get the feeling I'm being sold the holy grail of database technology: what we had (rdbms) is bad, what we have now is better it solves all your problems. No it doesn't, nor is it the successor of a RDBMS.
RDBMS vendors do the same thing btw, so it's not that the NoSQL world is the bad guy, the whole debate is just polluted and IMHO currently done wrong. And then I'm not even referring to the new trend started by big software vendors like Microsoft to lift services to the level of database systems (OData) so you can program against those instead of to a local db. Common factor in all stories is that the technical tool to work with the data is what makes it so special. But the way the data is preserved and re-usable across applications now and in the future (so in a way, system agnostic), is important. Tools come and go, the data stays and only when its context is there too.
@Frans
Maybe its just me but whenever I've had to replace a 'legacy database' there was always a pre-data migration stage where the old database is projected into a new (and correct) schema. I guess for some very big and important legacy databases the risk is too high and this is not done in which case I don't really see the benefit of a complete app re-write.
Yeah though I don't view it as a complete replacement at all rather a complimentary technology but it does bring some benefits to the table which I'm happy to take advantage of if the use-case permits. e.g. At mflow we use Redis to manage user sessions, handle real-time notifications, hold the security tokens for our 'full previews' and maintain the list of 'live posts' and 'online users', in addition to all our multi-tier caching needs. We've had it in production for nearly 6-months and its working well providing great performance to our web services as well as taking lots of load off our sharded postgresql databases.
As a back-end systems developer my 'world-view' if you like is a little different. i.e. all the clients that I serve are binded to loosely-coupled DTO's or 'web service contracts' if you like. Now in order to maintain a working system I know that I just have to maintain that 'contract' in future.
From there my main goals are to full-fill the services in the fastest and most scalable way by whatever means that's possible.
At mflow we originally started with a db4o solution plus a 'sqlite db for each user to manage their inbox'. As soon as our internal ORM+LINQ provider was developed (that's another story) we migrated everything to a sharded MySql RDMBS solution. When we had problems trying to replicate and cluster on MySQL we moved to PostgreSQL which is where we remain today. Most of the frequently accessed web services don't actually hit the databases directly any more instead they hit our multi-tiered intelligent caches that is maintained in Redis.
The point of the story is that we've changed our 'data sources' many times and as long as we didn't change the web service definition, the clients are none the wiser and work without skipping a beat. So its not the database/data-source that remains, its the application. This may not be everybody's experience but has been mine for a while.
I will say this though, that any decision-maker entertaining the idea of a move to a NoSQL database should be very informed in its benefits/limitations. You no longer have the rich querying of SQL and will usually require maintaining your own indexes.
That is how most of the data is stored in NoSQL solutions.
And I think that your fear about thousands of types of customer is over stated. You tend to have mostly predictable system.
If the app is gone, in all cases I have seen the database being replaced.
It is when the app is still there (along with its extended family) that you see a database being continued to be used.
I wouldn't touch a 10 - 15 years old database with a long pole. I would certainly not reuse it without a DAMN good reason, and "that is how we did things" isn't a good enough reason.
I would extract the content using some ETL process to my database, using my schema.
I don't know about you, but I was involved in 4 - 5 different projects where we had an initial database, usually 10 years old plus, and we had to migrate.
In one memorable case, we had to hack the database to get access to the data because it was stored in propriety format and the database company refused to allow us to get (the client's own data!) out of it.
15 years ago.... That means SQL Server 6.0 and Oracle 7. If I give you one of those databases, good luck getting it to work.
Moreover, perceptions, usage and best practices change over time. You want to tell me that a database developed for (at best) departmental usage in the age of the client/server is an appropriate database for a multi user web based system?
See my point above about always migrating unless there ARE apps using it.
And given that most NoSQL solutions are inherently self describing. (JSON for doc dbs, properties & nodes for graph databases, schema for column databases), that is absolutely not an issue.
"When I say something like "the bottom of all ships is wet", you can very well point me to a ship in dry dock and say that its bottom is dry."
Sorry, but in this case I feel it's more like lot of people not singing the NoSQL gospel are saying ship bottoms are wet and you are trying to point out that they can be try :D
"And no, you can't do transactions when you have to touch many machines and handle failure and keep reasonable performance."
Your readers must be really in tune with you if they can easily contextualize "many" and "reasonable performance", I guess. I will try to find the specs for some largest installations for MPP DB-s. I bet you will be suprised how far good old relational oldies can go once you go from single to multiple boxes and with little more knowhow than "DB is ze place there my app writes data"
Comment preview