My article in ACM Queue just got published, it discusses implementing Linq providers for NoSQL databases.
You can read all about it here: http://queue.acm.org/detail.cfm?id=2001564
My article in ACM Queue just got published, it discusses implementing Linq providers for NoSQL databases.
You can read all about it here: http://queue.acm.org/detail.cfm?id=2001564
“We tried using NoSQL, but we are moving to Relational Databases because they are easier…”
That was the gist of a conversation that I had with a client. I wasn’t quite sure what was going on there, so I invited myself to their offices and took a peek at the code. Their actual scenario is classified, so we will use the standard blog model to show a similar example. In this case, we have there entities, the BlogPost, the User and the Comment. What they wanted is to ensure is that when a user is commenting on a blog post, it will update the comments’ count on the blog post, update the posted comments count on the user and insert the new comment.
The catch was that they wanted the entire thing to be atomic, to either happen completely or not at all. The other catch was that they were using MongoDB. The code looked something like this:
public ActionResult AddComment(string postId, string userId, Comment comment) { int state = 0; var blogPost = database.GetCollection<BlogPost>("BlogPosts").FindOneById(postId); var user = database.GetCollection<User>("Users").FindOneById(userId); try { database.GetCollection<Comment>("Comments").Save(comment); state = 1; blogPost.CommentsCount++; database.GetCollection<BlogPost>("BlogPosts").Save(blogPost); state = 2; user.PostecCommentsCount++; database.GetCollection<User>("Users").Save(user); state = 3; return Json(new {CommentAdded = true}); } catch (Exception) { // if (state == 0) //nothing happened yet, don't need to do anything if (state >= 1) { database.GetCollection<Comment>("Comments").Remove(Query.EQ("_id", comment.Id), RemoveFlags.Single); } if (state >= 2) { blogPost.CommentsCount--; database.GetCollection<BlogPost>("BlogPosts").Save(blogPost); } if (state >= 3) { user.PostecCommentsCount--; database.GetCollection<User>("Users").Save(user); } throw; } }
Take a moment or two to go over the code and figure out what was going on in there. It took me a while to really figure that one out.
Important, before I continue with this post, I feel that I need to explain what the problem is and why it is there. Put simply, MongoDB doesn’t support multi document transactions. The reason that it doesn’t support multi document transactions is that the way MongoDB auto sharding works, different documents may be on different shards, therefor requiring synchronization between different machines, which no one has managed to make scalable an efficient. MongoDB choose, for reasons of scalability and performance, to not implement this feature. This is document and well know part of the product.
It makes absolute sense, except that it leads to code like the one above, when the user really do want to have atomic multi document writes. Just to be certain that the point has been hammered home. The code above still does not ensures atomic multi document writes. For example, if the server shuts down between immediately after setting state to 2, there is nothing that the code can do to revert the previous writes (after all, they can’t contact the server to tell it that it to revert them).
And there are other problems with this approach, the code is ugly, and it is extremely brittle. It is very easy to update one part and not the other… but at this point I think that I am busy explaining why horse excrement isn’t suitable for gourmet food.
The major problem with this code is that it is trying to do something that the underlying database doesn’t support. I sat down with the customer and talked about the advantages and disadvantages of staying with a document database vs. moving to a relational database. A relational database would handle atomic multi row writes easily, but would require many reads and many joins to show a single page.
That was the point where I put the disclaimer “I am speaking about my own product, and I am likely biased, be aware of that”.
The same code in RavenDB would be:
public ActionResult AddComment(string postId, string userId, Comment comment) { using(var session = documentStore.OpenSession()) { session.Save(comment); session.Load<BlogPost>(postId).CommentsCount++; session.Load<User>(userId).PostedCommentCount++; session.SaveChanges(); // Atomic, either all are saved or none are } return Json(new { CommentAdded = true }); }
There are a couple of things to note here:
We also support change tracking for loaded entities, so we didn’t even need to tell it to save the loaded instances. All in all, I also think that the code is prettier, easier to follow and would produce correct results in the case of an error.
My session with the .NET user group in Aarhus, Denmark was recorded and is available.
I tried to discuss (in the space of an hour) about the major types of NoSQL databases. It is a bit rush, but I think it is pretty good.
The application data is one of the most precious assets that we have. And for a long time, there wasn't any question about where we are going to put this data. The RDBMS was the only game in town. The initial drive away
from the RDBMS was indeed driven by the need to scale. But that was just the original impetuous to start developing the NoSQL solutions. Once those solutions came into being and matured, it isn't just the "we need web-scale" players
that benefited.
Proven & Mature NoSQL solutions aren't applicable just at high end of scaling. NoSQL solutions provide a lot of benefits even for applications that will never need to scale higher than a single machine. Document databases drastically
simplify things like user defined fields, or working with Aggregates. The performance of a NoSQL solution can often exceed a comparable RDBMS solution, because the NoSQL solution will usually focus on a very small subset of the
featureset that RDMBS has.
I recently got this question on email, and I thought it would be a good subject for a post.
I wanted to get your thoughts about using NoSQL for data warehouse solutions. I have read mixed thoughts about this and curious where you stand.
Before we can talk about this, we need to understand what data warehousing is, using wise geek definition, that is:
Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.
And if you follow that definition, it make an absolute sense to ask about data warehousing in a NoSQL situation. But remember, one of the things that tend to lead people to the NoSQL land is the desire to scale in some manner (more data, more users, higher concurrency, cheaper TCO) than is possible using a SQL solution. In order to achieve that goal, you have to be willing to accept the tradeoff associated with that, which is reduced flexibility. You can query a relational database every which way, but most NoSQL solutions have very strict rules about how you can query them, for example.
By the way, I am probably abusing the term SQL here. I meant the whole set of technologies generally associated with relational databases, so in this case, I am talking about OLAP data stores, which are the typical solution for data warehousing scenarios. OLAP is usually queried with MDX, which looks like this:
SELECT { [Measures].[Sales Amount], [Measures].[Tax Amount] } ON COLUMNS, { [Date].[Fiscal].[Fiscal Year].&[2002], [Date].[Fiscal].[Fiscal Year].&[2003] } ON ROWS FROM [Adventure Works] WHERE ( [Sales Territory].[Southwest] )
OLAP & MDX, like the relational database & SQL, gives us a lot of flexibility and power. But like relational databases, those come at a cost. At some point, if you have enough data, it gets impractical to store it all in a single server, and the usual arguments for NoSQL solutions come to the fore.
At that point, we have to decide what is it that we want to get from the data warehouse. In other words, we need to design our solution to match the kind of reports that we want to get out. Of the NoSQL solutions out there (Key/Value stores, Document Databases, Graph Databases, Column Family Databases) I would probably choose a Column Family database for such a task, since my primary concern is probably being able to handle large amount of data.
The type of reports that I would need would dictate how I would store the data itself, but once I built the schema, everything else should just work.
In short, for data warehousing, I think that the relational / OLAP world has significant advantages, mostly because in many BI scenarios, you want to allow the users to explore the data, which is easy with the SQL toolset, and harder with NoSQL solutions. But when you get too large (and large in OLAP scenarios is really large), you might want to consider limiting the users’ options and going with a NoSQL solution tailor to what they need.
I spent 4 hours talking about the different No SQL approaches, you can find the video here.
Column family databases are probably most known because of Google’s BigTable implementation. The are very similar on the surface to relational database, but they are actually quite different beast. Some of the difference is storing data by rows (relational) vs. storing data by columns (column family databases). But a lot of the difference is conceptual in nature. You can’t apply the same sort of solutions that you used in a relational form to a column database.
That is because column databases are not relational, for that matter, they don’t even have what a RDBMS advocate would recognize as tables.
Nitpicker corner: this post is about the concept, I am going to ignore actual implementation details where they don’t illustrate the actual concepts.
Note: If you want more information, I highly recommend this post, explaining about data modeling in a column database.
The following concepts are critical to understand how column databases work:
Columns and super columns in a column database are spare, meaning that they take exactly 0 bytes if they don’t have a value in them. Column families are the nearest thing that we have for a table, since they are about the only thing that you need to define upfront. Unlike a table, however, the only thing that you define in a column family is the name and the key sort options (there is no schema).
Personally, I think that column family databases are probably the best proof of leaky abstractions. Just about everything in CFDB (as I’ll call them from now on) is based around the idea of exposing the actual physical model to the users so they can make efficient use of that.
It is important to understand that when schema design in a CFDB is of outmost importance, if you don’t build your schema right, you literally can’t get the data out. CFDB usually offer one of two forms of queries, by key or by key range. This make sense, since a CFDB is meant to be distributed, and the key determine where the actual physical data would be located. This is because the data is stored based on the sort order of the column family, and you have no real way of changing the sorting (except choosing between ascending or descending).
The sort order, unlike in a relational database, isn’t affected by the columns values, but by the column names.
Let assume that in the Users column family, in the row “@ayende”, we have the column “name” set to “Ayende Rahine” and the column “location” set to “Israel”. The CFDB will physically sort them like this in the Users column family file:
@ayende/location = “Israel” @ayende/name = “Ayende Rahien”
This is because the sort “location” is lower than “name”. If we had a super column involved, for example, in the Friends column family, and the user “@ayende” had two friends, they would be physically stored like this in the Friends column family file:
@ayende/friends/arava= 945 @ayende/friends/rose = 14
Remember that, this property is quite important to understanding how things work in a CFDB. Let us imagine the twitter model, as our example. We need to store: users and tweets. We define three column families:
Let us create the user (a note about the notation: I am using named parameters to denote column’s name & value here. The key parameter is the row key, and the column family is Users):
cfdb.Users.Insert(key: “@ayende”, name: “Ayende Rahine”, location: “Israel”, profession: “Wizard”);
You can see a visualization of how below. Note that this doesn’t look at all like how we would typically visualize a row in a relational database.
Now let us create a tweet:
var firstTweetKey = “Tweets/” + SequentialGuid.Create(); cfdb.Tweets.Insert(key: firstTweetKey, application: “TweekDeck”, text: “Err, is this on?”, private: true); var secondTweetKey = “Tweets/” + SequentialGuid.Create(); cfdb.Tweets.Insert(key: secondTweetKey, app: “Twhirl”, version: “1.2”, text: “Well, I guess this is my mandatory hello world”, public: true);
And here is how it actually looks:
There are several things to notice here:
That last bears some talking about. In a relational database, we would define a column called UserId, and that would give us the ability to link back to the user. Moreover, a relational will allow us to query the tweets by the user id, letting us get the user’s tweets. A CFDB doesn’t give us this option, there is no way to query by column value. For that matter, there is no way to query by column (which is a familiar trick if you are using something like Lucene).
Instead, the only thing that a CFDB gives us is a query by key. In order to answer that question, we need the UsersTweets column family:
cfdb.UsersTweets.Insert(key: “@ayende”, timeline: { SequentialGuid.Create(): firstTweetKey } ); cfdb.UsersTweets.Insert(key: “@ayende”, timeline: { SequentialGuid.Create(): secondTweetKey } );
On the CFDB, it looks like this:
And now we need more explanation about the notation. Here we insert into the UsersTweets column family, to the row with the key: “@ayende”, to the super column timeline two columns, the name of each column is a sequential guid, which means that we can sort by it. What this actually does is create a single row with a single super column, holding two columns, where each column name is a guid, and the value of each column is the key of a row in the Tweets table.
Question: Couldn’t we create a super column in the Users’ column family to store the relationship? Well, yes, we could, but a column family can contain either columns or super columns, it cannot contain both.
Now, in order to get tweets for a user, we need to execute:
var tweetIds = cfdb.UsersTweets.Get(“@ayende”)
.Fetch(“timeline”) .Take(25)
.OrderByDescending() .Select(x=>x.Value); var tweets = cfdb.Tweets.Get(tweetIds);
In essence, we execute two queries, one on the UsersTweets column family, requesting the columns & values in the “timeline” super column in the row keyed “@ayende”, then execute another query against the Tweets column family to get the actual tweets.
Because the data is sorted by the column name, and because we choose to sort in descending order, we get the last 25 tweets for this user.
What would happen if I wanted to show the last 25 tweets overall (for the public timeline)? Well, that is actually very easy, all I need to do is to query the Tweets column family for tweets, ordering them by descending key order.
Nitpicker corner: No, there is not such API for a CFDB for .NET that I know of, I made it up so it would be easier to discuss the topic.
Why is a CFDB so limiting?
You might have noticed how many times I noted differences between RDBMS and a CFDB. I think that it is the CFDB that is the hardest to understand, since it is so close, on the surface to the relational model. But it seems to suffer from so many limitations. No joins, no real querying capability (except by primary key), nothing like the richness that we get from a relational database. Hell, Sqlite or Access gives me more than that. Why is it so limited?
The answer is quite simple. A CFDB is designed to run on a large number of machines, and store huge amount of information. You literally cannot store that amount of data in a relational database, and even multi-machine relational databases, such as Oracle RAC will fall over and die very rapidly on the size of data and queries that a typical CFDB is handling easily.
Do you remember that I noted that CFDB is really all about removing abstractions? CFDB is what happens when you take a database, strip everything away that make it hard to run in on a cluster and see what happens.
The reason that CFDB don’t provide joins is that joins require you to be able to scan the entire data set. That requires either someplace that has a view of the whole database (resulting in a bottleneck and a single point of failure) or actually executing a query over all machines in the cluster. Since that number can be pretty high, we want to avoid that.
CFDB don’t provide a way to query by column or value because that would necessitate either an index of the entire data set (or just in a single column family) which in again, not practical, or running the query on all machines, which is not possible. By limiting queries to just by key, CFDB ensure that they know exactly what node a query can run on. It means that each query is running on a small set of data, making them much cheaper.
It requires a drastically different mode of thinking, and while I don’t have practical experience with CFDB, I would imagine that migrations using them are… unpleasant affairs, but they are one of the ways to get really high scalability out of your data storage.
Waiting expectantly to the commenters who would say that relational databases are the BOMB and that I have no idea what I am talking about and that I should read Codd and that no one really need to use this sort of stuff except maybe Google and even then only because Google has no idea how RDBMS work (except maybe the team that worked on AdWords).
This is pure “scratch an itch” design post, since I started thinking about the topic, I might as well put my thoughts in a structured format. I am going to use twitter for this example, and assume that my end goal is to be able to do graph traversal, not just finding neighbors. Applying this to other social network scenarios should be pretty easy.
I generated this graph using this tool, this works using mentions, rather than following / followers graph, though. It will server to give you an idea what I am talking about.
One of the fundamentals of computer science is: localizing data leads to greater read performance. This is true whatever we are talking about keeping the data in the CPU L1 cache or in a distributed networked system. Typically, part of a sharding strategy is to keep all the data related to a root in a single location. The problem is, of course, that graphs don’t really have roots. And in most social network graphs, there is no such thing as a closed graph. There are, on average, ~7 million users within three hops from any twitter user. Now, it would be very easy to put 7 millions users to a single machine, except, as they say, they aren’t the same 7 million users.
Given that, I think that I can come up with an approach to allow more efficient queries and higher localization in the graph. The model assume an open and dynamic model (indeed, it relies on that).
We starts with geographical distribution. When we create a new user, we will place it in a shard dedicate to the geographical location the user is located on. This is a piece of data that we can get cheaply, and it has the advantage that users that interact with their circle of physical friends would tend to be clustered together anyway.
Next, we start assigning weights to associations. We only take into account outgoing associations (which solve the problem with outliers for incoming associations such as @aplsuk), but with a small twist, the weight of each outgoing association is taken as a portion of the total number of outgoing associations. In other words, the value of an outgoing association when you are following 10 friends is 0.1, but when you are following 500 friends, the value of each association is 0.002.
Next, we place some value on each mention that the user twits. A mention indicate that the association is active. For that matter, we probably need to create silent associations if a user keep mentioning someone that they are not following. For now, we will say that this value is 1% of the value of an association. That means that if I am following 100 users and I mentioned another user a hundred time, the value of the association is 0.02.
How does gravity comes into play here? Well, each outgoing association exact a pull on a node. But moving a node between shards is expensive, so we give shards an escape velocity. When we check if we need to re-shard a node, we aggregate the pulls of all the associations per node. Only if one shard pull is higher than the current shard pull + escape velocity will the node be shifted to the new shard.
Over time, this will tend to migrate most users close to the users that they are actively associated with. With that in mind, we can now move of to queries.
As I mentioned, I am interested more in this for graph traversal, and the good thing about this approach is that for the most part, most of the relevant information is located on the same shard. When the times comes to perform a query, we can assert that queries, too, need to have an escape velocity to cross a shard boundary. Unless there are enough outgoing connections to a given shard to overcome the escape velocity, outgoing connections to that shard are ignored. We can limit the cost of remote calls further if we increase the escape velocity for the query as the query search deeper. In other words, the escape velocity at depth = 0 would be 10, at depth = 1 it would be 100, at depth = 2 it would be 1000, etc.
While this represent a loss in accuracy of the results, it also mean that for the most case, results will tend to be more relevant.
Is this important if I don’t care for graph traversal?
There is another issue to consider, quite aside from graph traversal. The gravity approach outlined will tend to higher localization, and most operations are local. Consider writing a status update to all the people interested in that, when you have good locality, the cost of such on operation grows down drastically, in fact, I would say, it is highly likely that many such operations could be completed completely locally.
During the course of this series, I got a lot of people declaring: “But you can do that with RDMBS if you use XYZ”. The problem inherit in this statement is that it ignore the fact that if you really want to, you can use a hammer to put screws, it isn’t nice or efficient, but you can do it.
I run across this Dare Obasanjo’s post that explains the actual problem in beautiful terms:
What tends to happen once you’ve built a partitioned/sharded SQL database architecture is that you tend to notice that you’ve given up most of the features of an ACID relational database. You give up the advantages of the relationships by eschewing foreign keys, triggers and joins since these are prohibitively expensive to run across multiple databases. Denormalizing the data means that you give up on Atomicity, Consistency and Isolation when updating or retrieving results. And the end all you have left is that your data is Durable (i.e. it is persistently stored) which isn’t much better than you get from a dumb file system. Well, actually you also get to use SQL as your programming model which is nicer than performing direct file I/O operations.
It is unsurprising that after being at this point for years, some people in our industry have wondered whether it doesn’t make more sense to use data stores that are optimized for the usage patterns of large scale websites instead of gloriously misusing relational databases. A good example of the tradeoffs is the blog post from the Digg team on why they switched to Cassandra. The database was already sharded which made performing joins to calculate results of queries such as “which of my friends Dugg this item?” to be infeasible. So instead they had to perform two reads from SQL (all Diggs on an item and all of the user’s friends) then perform the intersection operation on the PHP front end code.
I can’t use most of the relational database traditional advantages anyway. Not the moment that I step out of a single machine boundary. At that point, I have to re-evaluate what I am doing to see if it make sense to have to deal with the traditional relation database disadvantages. In many cases, the answer is no, and a solution that fit the problem better can be found.
This is where the NoSQL databases started. I think that once they gotten mature enough, they have advantages for smaller scale solutions, but that is a problem for another day.
Yesterday I talked about graph databases, outlining what they are and how they work. One of the interesting things about this series is that in many cases, I am posing a question (to myself), trying to answer it, then go and find out what other people do.
When thinking about scaling scenarios for a graph database, I had the following scenario in mind, a graph of nodes that is spread across multiple servers, where each member in the graph may reside on any machine in the system. The following diagram demonstrate what I am thinking about, each rectangle represent a different machine in the system:
Why is this important?
A single machine solution is obviously a barrier to scaling (and safety, but that is another concern. In a graph database, having relations between the node is the point, that makes sharding a bit more complicated, because unless you store the entire graph on a single machine, you are forced to query across machine boundaries. And you can’t store a graph in a single machine, for the simple reason that it is unlikely that you can limit a graph to be that small. Think about the implications of Six Degrees of Separation for graph databases and it will be clear what the problem is. In real world graphs, everyone is connected to everyone.
The problem with breaking the entire graph across machines is that now it is much more expensive to do graph traversals. The following query, which previous run on a single machine:
new GraphDatabaseQuery { SourceNode = ayende, MaxDepth = 3, RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"}, Where = node => node.Location == ayende.Location, SearchOrder = SearchOrder.BreadthFirst }.Execute();
Now need to touch 3 different machines. Worse, it isn’t the number of machines that impacts that, but the spread of graph nodes across machines in the system.
After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:
The solution most likely to be successful is limiting the depth of cross machine node searches. In many cases, that is acceptable, I think. If we put the depth limit on 3, we can still give pretty good answers in a reasonable time frame. But the only way this can be made to work is with good batching support.
The algorithm may look like:
public IEnumerable<Node> DistributedTraverse(Node sourceNode, int depth, string relationToFollow, Func<Node, filter> predicate) { if(depth == 0) // feeling defensive yield break; var related = GetRelatedNodes(sourceNode.ShardName, relationToFollow, predicate); foreach(var result in related) yield return result; if(depth == 1) // don't even bother asking down the line { yield break; } foreach(var relationsByShard in related.GroupBy(x=>x.ShardName)) { var shard = GetShardProxy(relationsByShard.Key); var results = shard.BatchedQuery(sourceNodes: relationsByShard.ToArray(), depth - 1,relationToFollow, predicate); foreach(var result in results) yield return result; } }
This give us a maximum amount of (depth * number_of_machines_in_cluster) – depth remote calls: With a depth of 3 and 3 machines in the cluster, we would have a max of 6 calls.
With that theory out of our heads, let us examine how real world Graph DBs tried to resolve this issue…
Neo4j (which seems to be pretty much the default for Graph DBs) doesn’t handle this currently, there are some hints that they intend to offer cluster wide replication, but nothing about design or implementation details. Neo4j does offer write-master/read-slaves approach for scaling out, which is really nice, but even that approach is limited at one point, and in this post, I am focusing on what happen when you go beyond that point.
FlockDB (which is what is used by twitter) does include, as part of its design goals: “horizontal scaling including replication”. However, FlockDB isn’t trying to solve the problem outlined above, indeed, graph traversal is a non goal for it. FlockDB is more about finding one level of relations very fast than anything else.
In summary, I believe that while you can shard a graph database, it place a very lot limit on the type of graph walking queries you can make. Now, just to give you an idea, Neo4j, for example, appears to be able to routinely handle billions on nodes on a single machines, so you might no need to scale higher than that..
No future posts left, oh my!