That No SQL Thing
Probably the worst thing about relational databases is that they are so good in what they are doing. Good enough to conquer the entire market on data storage and hold it for decades.
Wait! That is a bad thing? How?
It is a bad thing because relational databases are appropriate for a wide range of tasks, but not for every task. Yet it is exactly that that caused them to be used in contexts where they are not appropriate. In the last month alone, my strong recommendation for two different client was that they need to switch to a non relational data store because it would greatly simplify the work that they need to do.
That met with some (justified) resistance, predictably. People think that RDBMS are the way to store data. I decided to write a series of blog posts about the topic, trying to explain why you might want to work with a No SQL database.
Relational Databases have the following properties:
- Table / Row based
- Rich querying capabilities
- Foreign keys
Just about any of the No SQL approaches give up on some of those properties., usually, it gives up on all of those properties. But think about how useful an RDBMS is, how flexible it can be. Why give it up?
Indeed, the most common reason to want to move from a RDBMS is running into the RDBMS limitations. In short, RDBMS doesn’t scale. Actually, let me phrase that a little more strongly, RDBMS systems cannot be made to scale.
The problem is inherit into the basic requirements of the relational database system, it must be consistent, to handle things like foreign keys, maintain relations over the entire dataset, etc. The problem, however, is that trying to scale a relational database over a set of machine. At that point, you run head on into the CAP theorem, which state that if consistency is your absolute requirement, you need to give up on either availability or partition tolerance.
In most high scaling environments, it is not feasible to give up on either option, so relational databases are out. That leaves you with the No SQL options, I am going to dedicate a post for each of the following, let me know if I missed something:
- Key/Value store
- Key/Value store – sharding
- Key/Value store - replication
- Key/Value store – eventually consistent
- Document Databases
- Graph Databases
- Column (Family) Databases
Other databases, namely XML databases and Object databases, exists. Object databases suffer from the same problem regarding CAP as relational databases, and XML databases are basically a special case of a document database.
What do you think about using Document Database as a Time Series database ? Time series will be used as whole chunks- it is not really useful to have a single sample, usually you need the whole document to perform some computation ie a forecasting -. If you think is appropriate at a frist glance, what about dealing with different series version, in which the different version differ for a little portion of samples ?
The only thing preventing us from moving to nosql is ease of reporting and easy access to data outside of an API. Are there any solutions to this?
I would need more information to tell.
You shouldn't be reporting from an OLTP anyway, the ETL solutions for NoSQL -> OLAP are straigtforward
Why would reporting in no sql be more complicated than in sql?
What do you mean with without an API, something like sql manager studio?
Yes. I mean ease of extracting data without having to do it through code. Is there any tooling to help with extraction process to OLAP? I have never been able to find information about this.
You say it is straightforward. It would be great if you could back this up with an example. Thx
Please remind me about this in a week, when I am done with the currently scheduled posts.
@Ayende Sure. Will do. Many thx
I would be interested in what happens to performance when you turn some features off like foreign keys or ACID-properties. As far as I know, you can turn those off easily in mysql, but in mssql there are surely hacks like a filter driver which turns off fsync or a ramdisk.
It doesn't really matter, it isn't about performance, it is about _scaling_, two different things.
Ayende, looking forward to these posts, but I think you should include a "how a reporting process would work with a document store nosql db" at this moment we're using rhino etl,but pushing to another sql instance. I would love to see your recommendations and perceived advantages in reporting.
how about bigtable?
I am very much looking forward to your posts. However, I agree with the other people, writing about solutions for dealing with reporting needs would be very, very, relevant and would answer questions that many developers have when the subject of No SQL comes up.
Sharding, replication, eventual consistency aren't exclusive properties of key-value stores. They are part of column/document databases as well. I know you know this, but you might want to might it clear to the readers.
Reporting -- Those who are asking about reporting really need to decide whether they are talking about OLTP or OLAP scenarios. For OLTP reporting, you often don't have to do anything special in NoSQL solutions because your data is already denormalized. There is plenty of good information out there on this:
NoSQL databases seem like they're a good solution for websites that store large globs of non-reportable free text, like Digg or Facebook do. As long as the DB can handle searching properly, they're probably a better choice than a RDBMS.
But, NoSQL still has an uphill battle in a lot of areas. It's nice for some kinds of applications, but I'm not convinced it's a drop-in replacement for a real RDBMS in more than a few narrow scenarios. How do you handle DR? Also, most large organizations have a major investment in tools and reporting staff that already speak SQL. Not everyone is a programmer, and if an IT staff can offload reporting tasks to other departments, all the better. How do non-programmers write queries against NoSQL dbs? If you need transactions and ACID compliance, how do you accomplish that? Point-in-time backups? What hosting companies allow you to run a NoSQL db?
NoSQL dbs seem to be a great tool for certain kinds of applications with few developers, but as an enterprise offering, it seems a bit immature at the moment. And all this talk about not scaling seems a bit misleading - sure, I'll agree that you can't easily scale an RDBMS to handle loads like what Facebook, Digg, Amazon, and their ilk handle. But how many real world apps need to handle that much traffic? Having run a multi-terabyte data warehouse and a bunches of 100+ GB OLTP databases, the scalability wall is so far distant on an RDBMS that it's silly to even consider it for most apps. I'm not yet convinced, though I'm pretty interested in playing with Raven DB to see if it'll change my mind at all!
BigTable is a Column Family DB
I really like your explanation Ayende. I'll follow the series.
My question is; Is "scaling" the main reason to chose nosql db instead of sql db?
Within the confines of a company many many applications don't need to scale to any level where a RDBMS breaks into any sweat. Also pretty cool was the move to introduce a standardized DSL to query a data store: A thing that made NH talking to RDBMS of different vendors possible.
Anyway, our last project was a system that is completely temporal (all entites and their relations are completely traceable) and after reading some wicked posts from Udi and hitting quite some complexity with our approach, I will gladly listen to alternative approaches in persisting stuff.
Theoretically rdms might not scale, in reality they do it a lot better than most of the homegrown K/V engines...
It isn't theory, that is plain fact. With more than enough evidence about that all around you.
Excellent! I am going to email the link to every post in this series to my co-workers. If I hit them with a hammer enough times, certain ones might get the point. I'm looking at you, Kon. (Yes, that's really his name. abbreviated actually. Go ahead, I'm sure he's never heard the Star Trek joke before.)
NoSQL have very limited transaction support or none at all and sometimes you can't live without acid
Most No SQL DBs have full ACID support, there is nothing in No SQL that forbid transactions.
Let me just point to:
About relational databases, that they don't scale: I don't really know what you've been using but that's utter nonsense.
At some point, RDBMS can't handle the load without running into CAP.
At that point, they stop working for real world usage.
See the links that I provided.
At that point, they stop working for real world usage."
I've been using relational databases for many many years, and I've yet to see a relational database that big that was too slow to keep up. Mind you: millions and millions of new rows per day isn't too much.
For the vast majority of the people using databases in their applications, relational databases are just fine, only for the very few who write the new amazon.com, or the google competitor might need different databases, but how many are those? a handful.
I'm not saying NoSQL should just die off as it has no value, it does have value, but the scope is pretty limited as the data contained in document databases is rigid and you need the software to give meaning to the data, plus creating new projections on the data without the software (e.g. in different software) is hard. If that's of no concern, be my guest, but people should realize that.
Btw. NoSQL means Not Only SQL. It doesn't mean !SQL.
It is actually quite easy to get to a point where SQL don't really work for you. We are dealing with a lot more scale than in the past.
Sure, for apps that have relatively few users (where few is tens of thousands), that is not an issue, but when you start talking about large number of users, large amount of data, complex interactions that you need to deal with, you run into the limitations of CAP, and when you do that, you can't use RDBMS.
Case in point, http://www.gilt.com/, this is an online shopping site where they have a sale every day at noon, they have huge peeks at those times, they tried using RDBMS, and then moved to other options, because they handled insane peeks much better.
Even few thousands users hammering the site at a given point in time is likely to kill it. And we haven't talked yet about how to handle tables with billions of rows where you _are actually touching those rows_, rather than just write them out.
Yes, RDBMS are perfectly fine for a lot of apps, but I think that I did a good job in establishing the context in which you shouldn't use them.
Data in doc dbs is rigid? What gave you that idea?
I haven't noticed the data in document db's to be rigid at all. I have made many changes to the layout of my data in my current application with almost no need to think about it. At least no more thought than I would have using a typical instance of SQL Server.
I think what you may be thinking of are object databases that serialize objects to persist them. Then you do run into versioning issues. However, most modern document db's do their best to make the requested data fit into whatever object type you specify (and even easier if your in a dynamic language where it can always make it fit).
Where did you get the cazry notion?
People use replciated databases all the time. By their very nature they don't have foreign keys, nor are they necessarily consistent with the main database at any given time. (Where I work each table is anywhere from a couple of seconds to 2 hours delayed depending on how important it is and how often it changes.)
And then there are true reporting databases. These are denormalized versions of the data that are specifically designed to be fast to query. They may even be reduced all the way down to key-value pairs. (Again, where I work we have XML blobs that were distilled from a dozen tables.)
You seem to be suffering from the same mistake a lot of novice database developers make, which is thinking that the game being and ends with a single, highly normalized, transactional database. But in a real system you could have dozens of databases serving different roles, often using different schema to share the same data.
I think you are wrong. Postgrsql can scale
Nope, it doesn't matter what the DB product is, relational DBs breaks down at scale not because of implementation, but because of their very nature.
You aren't working with a single database, you are working with copies of it. That isn't a single RDBMS spanning multiple machines and allowing you to query on all the data in the DB.
"dozens of databases serving different roles" - agreed, that IS my point.
I am familiar with this (well, I am familiar with this on Oracle's RAC side).
The problem is that those solutions run into CAP head on. If one of your servers goes down, Bad Things happen.
Look at the 4th paragraph in the link that you provided, that lay out the problem in easy to understand terms.
All of the solutions provided in that link are ways of dealing with information on a single server, and just load balancing the load.
The only one that provide a multi server query execution is the paragraph with: Multi-Server Parallel Query Execution
And that does seem to provide for solutions for failover, relies on a single server and is going to hit a perf cap very quickly (with a 100 servers, your query time is going to be consumed with managing the query, not actually querying).
I'm glad you're weighing on this topic as well Oren, as there have been a flamefest brewing on the Internet of late, mostly from people who haven't used NoSql databases before and think they're 10 years of RDBMS experience gives them enough qualifications to comment on it - it really doesn't NoSql datastores have solved a lot of problems that have been typically hard to do with RDBMS.
Note NoSql is not a replacement for RDBMS, they actually complement each other quite well. It's still all about choosing the best tool for the job. Quite simply RDBMS is good (and is still the best at) storing Relational, tabular data, their is no disputing that and that statement still holds true. It's not however so good for storing deep hierarchical data or for storing alternate data structures i.e. Message Queues, etc. (it can still be done, but like any hammer it's not a good fit).
Ok I've noticed a couple of one-line comments that indicate that RDBMS (or even their particular brand of RDBMS) can actually scale. Scaling for all intents and purposes means 'horizontal scaling' i.e. you can throw an extra commodity server in the cluster and you can handle 1/n more load than it did before. Usually this means that there are no single bottlenecks (i.e. central servers) each request goes through which allows you to evenly distribute your requests evenly over your app servers and data stores. Most NoSql databases all clients include consistent hashing algorithms which allow you to do this.
In the RDBMS world the way we typically scale is to use either Master/ replicated Read slaves or partition your data in a sharded architecture. These are still good approaches to scaling RDBMS they are however more complex to configure and typically cost more to run and maintain than their NoSql equivalents.
There are other good reasons to try the NoSql route, namely speed (e.g. Redis can perform 110,000 write operations on an entry level linux box) and schema-less designs. Both these topics are too big to cover in a single comment so I'll try cover them in my own blog posts when I can find the time.
Now most of the time we're developing enterprise applications for internal use so we're lucky enough to never hit the scaling limits of RDBMS's in these cases it's safe to ignore NoSql datastores for your own use (although there are still other benefits). Unfortunately as an architect of a social media service (mflow), performance and scalability considerations are mandatory requirements that must be factored into our design which basically consists of an in-memory 'cached data views' mirroring our persisted data which resides on multiple sharded postgresql databases.
For those that are interested in trying out NoSql datastores I recommend looking at Redis for which I maintain a rich C# redis client (and windows server builds) at:
How does load balancing differ from scalability ?
Also I see solutions there does involve multiple servers.
In above read queries are balanced. Thus gives us a degree of scalability.
Statement-Based Replication Middleware
Same goes with above. I assume you read all these.
If you are after something like map-reduce, I would say, map reduce isn't the only way of scalability. Load balancing , even if the query is executed on a single server means scaling in my book. Am I wrong ?
If you're storing XML as text blobs in the database you may want to checkout my Open Source C# POCO serializer:
It's a 3.5x faster and 2.6x more compact than .NET's XML DataContract serializer. It's also cleaner and more resilient to schema changes, supports inheritence, late-bound object properties, etc and can work with any C# POCO type, not just DTO's. Effectively it was made for blobbing data in a fast, clean text format.
In short all RDBMS scaling options require access to a central server (which given a large enough load becomes the bottleneck). They are also more costly in hardware and maintenance costs, here's a good article explaining it in a bit more detail:
"In short all RDBMS scaling options require access to a central server (which given a large enough load becomes the bottleneck)"
I fail to see how above is correct. Please read my earlier post carefully. I don't see a central server (at least for read-only queries) is obligatory.
Replicated read slaves still takes processing resources from the central master database server(s). When the master goes down all system writes effectively cease.
Depending on whether you can partition your data then sharding can be a superior solution. In these cases there is still a master table maintained on a central server(s) that normally serves as a lookup as to which shard the partitioned data (e.g. User) lives. When adding new users you still need writeable access to the master table, again both cases rely on a central database server(s).
In contrast with NoSql datastores if one server goes down, only the users that live on that store are affected. New users can still be stored in one of the available datastores.
We've opted to go with a sharded PostgreSql solution for our persistence needs though we still utilize intelligent in-memory tiered data views for both increased perf and to take processing off the db servers. Any read/write operation we can do on Redis we consider to be a No-op. Even at our small scale we are hitting RDBMS limits as we have services that routinely perform 1000+ writes, on Redis this can easily be achieved <1sec, while on Postgresql (even sharded) it takes a lot longer.
Ok I missed the 'Statement-Based Replication Middleware' solution you've quoted. This very much looks like Master-Master replication which by the sounds of it is suggesting is happening on the app/middle-tier level. Master-Master replication is a good option for a small datasets but doesn't help much with scaling as your writes are effectively multiplied amongst available 'master servers' and not divided in partitioned datastores which is obviously the most efficient solution. Another factor which does not make it an ideal candidate for 'horizonal scaling' is that the dataset is not partitioned and each master effectively has the entire copy of the dataset.
Master-Master replication also adds extra complexity in maintaining consistency across all master servers. Are the writes asynchronous or blocking? How do you compensate for masters that are down? Again this is all do able with RDBMS, it just takes a lot more effort and resources.
I'm not sure why everyone's so concerned with scalability, when the biggest benefits of a OODB is the fact that they're so easy to use.
The fact that no longer being required to to polute 60-75% of our code base with handrolled SQL or being forced to use an ORM just to communicate with a Database should be reason enough to at least look at using a NoSQL solution.
And, as always, if you are doing reporting off your transaction database, you are Doing It Wrong (tm). Quite frankly, it amazes me that it even comes up as an arguement.
Put simply, what when your data set outgrows what can be stored on a single machine? On 5 machines?
I think the space consideration is not the concern here in terms of scalability. Will I be able to do next google ? Hell no. (Yet yahoo uses a highly modified Postgresql) . My very point is the statement "RDBMS does not scale" is wrong.
@Demis please see my first comment in this post that links to Postgresql site. The options are mentioned there.
It doesn't take a lot of data to move outside the realm of a single machine.
If you are assuming read clone, the cost of large amount of storage goes up very quickly.
Why aren't you worried about "corrupt" data as you put it with NoSQL databases?
Giving up transactions lets say is worse than "Read Uncommitted" in MSSQL, and you went on on on how that would lead to corrupt data what gives?
My point is RDBMS give you all those properties if you want them, and many will allow you to turn them off if you want performance.
RDBMS can scale just like a NoSQL solution if you are ok with eventual consistency and lack of transactions and constraints, but they give you the option to sacrafice scalability for correctness, where a NoSQL solution may not.
BTW Google adwords runs on MySQL, is that enough scalability for you?
NoSQL doesn't mean no transactions.
Most (all?) NoSQL solutions are transactionable.
Ayende how do we deal with revisions in object databases ? For eg, if I have a person object. After 6 months in production I add a new column "ExternalID" . At this point how do I deal with data that is already there ? If I don't modify them, then how am I going to show the object in the view ?
BTW where is your gravtar image ? Find it difficult to track your replies!
From your newer post: "Transactions – while it is possible to offer transaction guarantees in a key value store, those are usually only offer in the context of a single key put. It is possible to offer those on multiple keys, but that really doesn’t work when you start thinking about a distributed key value store, where different keys may reside on different machines."
Transactions on a single key put would be like an RDBMS only offering transactions on a single row update at a time. This is LESS isolated than even "Read Uncommitted" in MSSQL, oh no corrupt data!!!
"Some data stores offer no transaction guarantees."
I thought most (all?) offered transactions, which is it?
RDMBS's are being scaled using shared-cache architectures (e.g. Oracle RAC) and in memory data grids (e.g. Oracle Coherence) with ACID intact. But these are expensive solutions in my opinion - a NoSQL approach fronted by data services providing "most" ACID qualities feels like would give better performance throughput per $ spent. Additionally domain models generally map better as object models are more cleanly exposed via such data services.
Like many other latest-and-greatest solutions, this one (NoSQL) solves a problem I do not have. If I did suddenly have terrible scaling problems and run head on into CAP, I would definitely check out NoSQL.
So I am grateful for the explanation of the problem Ayende. One day I hope to have that problem, when I invent the next YouTube and get Google to buy me for way too much money.
Until then, I'll happily use my rusty little SQL database and continue to try to milk as much value as possible out of the technology castle I've built for my company these past 12 years.
I don't know object databases well enough to answer that.
In general, you don't use multiple keys to store a single value, hence, there is no need to handle multi key transaction.
In other words, you bring relational thinking to non relational world, and then complain that things are broken.