NoSQL and Data Warehousing
I recently got this question on email, and I thought it would be a good subject for a post.
I wanted to get your thoughts about using NoSQL for data warehouse solutions. I have read mixed thoughts about this and curious where you stand.
Before we can talk about this, we need to understand what data warehousing is, using wise geek definition, that is:
Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.
And if you follow that definition, it make an absolute sense to ask about data warehousing in a NoSQL situation. But remember, one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution. In order to achieve that goal, you have to be willing to accept the tradeoff associated with that, which is reduced flexibility. You can query a relational database every which way, but most NoSQL solutions have very strict rules about how you can query them, for example.
By the way, I am probably abusing the term SQL here. I meant the whole set of technologies generally associated with relational databases, so in this case, I am talking about OLAP data stores, which are the typical solution for data warehousing scenarios. OLAP is usually queried with MDX, which looks like this:
SELECT { [Measures].[Sales Amount], [Measures].[Tax Amount] } ON COLUMNS, { [Date].[Fiscal].[Fiscal Year].&[2002], [Date].[Fiscal].[Fiscal Year].&[2003] } ON ROWS FROM [Adventure Works] WHERE ( [Sales Territory].[Southwest] )
OLAP & MDX, like the relational database & SQL, gives us a lot of flexibility and power. But like relational databases, those come at a cost. At some point, if you have enough data, it gets impractical to store it all in a single server, and the usual arguments for NoSQL solutions come to the fore.
At that point, we have to decide what is it that we want to get from the data warehouse. In other words, we need to design our solution to match the kind of reports that we want to get out. Of the NoSQL solutions out there (Key/Value stores, Document Databases, Graph Databases, Column Family Databases) I would probably choose a Column Family database for such a task, since my primary concern is probably being able to handle large amount of data.
The type of reports that I would need would dictate how I would store the data itself, but once I built the schema, everything else should just work.
In short, for data warehousing, I think that the relational / OLAP world has significant advantages, mostly because in many BI scenarios, you want to allow the users to explore the data, which is easy with the SQL toolset, and harder with NoSQL solutions. But when you get too large (and large in OLAP scenarios is really large), you might want to consider limiting the users’ options and going with a NoSQL solution tailor to what they need.
Comments
Leaving aside scaling, if I look at the current wide-column family of solutions (Cassandra, HBase, Hypertable) in terms of queries all I get is primary key indexes and range scans. I'd say that this is quite restrictive in terms of BI. You'll probably want to plug in something like Hadoop, Hive or Pig to really get closer to the features needed for a data warehouse.
Ayende, could you provide or point me to an explanation of why this is true:
"one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution."
I gather that will be the first question I have to answer to the DBA :)
@Steve, check out the articles in the NoSQL category here on Ayende's blog: http://ayende.com/Blog/category/565.aspx. In particular, this artical is really good at showing where NoSQL can scale in situations where a relational model cannot: ayende.com/.../that-no-sql-thing.aspx
Alex,
Sure, which is why I said that the NoSQL solution has less flexibility than the alternative.
Map/reduce solutions are good for data crunching to generate the report, as part of the solution (preparing ahead of time).
Steve,
Take a look at my NoSQL category, I have a lot of info about that
What kind of reporting tools are out there that work well with a NoSQL solution?
Thanks,
"one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution."
Go tell that to the guys at Vertica, Greenplum, or VoltDB. SQL seems to scale just fine for them.
SQL nor the relational model has much if anything to do with scalability, its more about which parts of ACID you turn off.
Thanks for answering my question. I was thinking in the same line as you. Since analysis and reporting of data is essential to DW/BI, current NoSQL solutions are not up to par with this yet. Given time, more tool sets will be available to accommodate for this need in the NoSQL arena.
VoltDB is not a RDBMS,not in a traditional way,first you can do unlimited JOINs only to replicated tables(those table aren`t partitioned and should be mostly read-only and small if you want to scale) or to 1 non-replicated table.
also your queries need to be single-sited(their term for hitting only one partition of the table those only hitting one server) in order to get a fast query.
this is how i see VoltDB:
a auto shrading layer on top of a in-memory relational-like storage without even auto balancing or adding nodes online,in my opinion using a sharded Tokyo Cabient or some other fast KV store you will get better performance.
in short:VoltDB is a NoSQL(in sense it is not really relational) with a SQL interface :)
NoSQL is a very wrong term,not relational is way better.
i can give you a SQL interface over whatever you want without joins!
and if you would read a bit more about VoltDB you will see that the joins are very limited!
f you think about it is a in-memory row store with a ability to do lookups to other read-only tables!
you can`t create a fully normalized schema with VoltDB you will be doing too many joins.
for me relational = joins because that is the way you model your data,if you don`t have joins(or have a limited form of joins) you start modeling your data in a very different way.
i really recommend you to install VoltDB(very easy install) and try the auction db,then go to the adhoc query tool(need python for that) and try to do some joins,see what you get :)
by the way VoltDB is more comparable to TimesTen or if we don`t look at their fancy SQL interface to redis(which i think is way better than VoltDB)
"for me relational = joins because that is the way you model your data"
That is not how the creator of the relational model defines it and VoltDB themselves are pretty explicit in stating their goals where a relational db to complete with NoSQL alternatives.
From their FAQ :"VoltDB is a SQL RDBMS for database applications that have extreme transaction processing (XTP) workloads" And "VoltDB is a relational store that uses SQL for its query language and stored procedures for its transactional unit of work. "
So what are they lying?
Again every relational implementation has limitations based on it's architecture.
Relations are Tables, SQL is relational algebra, even without joins.
Your just trying to move the goal posts.
Comment preview