NoSQL and Data Warehousing

time to read 4 min | 601 words

I recently got this question on email, and I thought it would be a good subject for a post.

I wanted to get your thoughts about using NoSQL for data warehouse solutions. I have read mixed thoughts about this and curious where you stand.

Before we can talk about this, we need to understand what data warehousing is, using wise geek definition, that is:

Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.

And if you follow that definition, it make an absolute sense to ask about data warehousing in a NoSQL situation. But remember, one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution. In order to achieve that goal, you have to be willing to accept the tradeoff associated with that, which is reduced flexibility. You can query a relational database every which way, but most NoSQL solutions have very strict rules about how you can query them, for example.

By the way, I am probably abusing the term SQL here. I meant the whole set of technologies generally associated with relational databases, so in this case, I am talking about OLAP data stores, which are the typical solution for data warehousing scenarios. OLAP is usually queried with MDX, which looks like this:

SELECT
    { [Measures].[Sales Amount], 
        [Measures].[Tax Amount] } ON COLUMNS,
    { [Date].[Fiscal].[Fiscal Year].&[2002], 
        [Date].[Fiscal].[Fiscal Year].&[2003] } ON ROWS
FROM [Adventure Works]
WHERE ( [Sales Territory].[Southwest] )

OLAP & MDX, like the relational database & SQL, gives us a lot of flexibility and power. But like relational databases, those come at a cost. At some point, if you have enough data, it gets impractical to store it all in a single server, and the usual arguments for NoSQL solutions come to the fore.

At that point, we have to decide what is it that we want to get from the data warehouse. In other words, we need to design our solution to match the kind of reports that we want to get out. Of the NoSQL solutions out there (Key/Value stores, Document Databases, Graph Databases, Column Family Databases) I would probably choose a Column Family database for such a task, since my primary concern is probably being able to handle large amount of data.

The type of reports that I would need would dictate how I would store the data itself, but once I built the schema, everything else should just work.

In short, for data warehousing, I think that the relational / OLAP world has significant advantages, mostly because in many BI scenarios, you want to allow the users to explore the data, which is easy with the SQL toolset, and harder with NoSQL solutions. But when you get too large (and large in OLAP scenarios is really large), you might want to consider limiting the users’ options and going with a NoSQL solution tailor to what they need.