You’ll pry transactions from my dead, cold, broken hands
“We tried using NoSQL, but we are moving to Relational Databases because they are easier…”
That was the gist of a conversation that I had with a client. I wasn’t quite sure what was going on there, so I invited myself to their offices and took a peek at the code. Their actual scenario is classified, so we will use the standard blog model to show a similar example. In this case, we have there entities, the BlogPost, the User and the Comment. What they wanted is to ensure is that when a user is commenting on a blog post, it will update the comments’ count on the blog post, update the posted comments count on the user and insert the new comment.
The catch was that they wanted the entire thing to be atomic, to either happen completely or not at all. The other catch was that they were using MongoDB. The code looked something like this:
public ActionResult AddComment(string postId, string userId, Comment comment) { int state = 0; var blogPost = database.GetCollection<BlogPost>("BlogPosts").FindOneById(postId); var user = database.GetCollection<User>("Users").FindOneById(userId); try { database.GetCollection<Comment>("Comments").Save(comment); state = 1; blogPost.CommentsCount++; database.GetCollection<BlogPost>("BlogPosts").Save(blogPost); state = 2; user.PostecCommentsCount++; database.GetCollection<User>("Users").Save(user); state = 3; return Json(new {CommentAdded = true}); } catch (Exception) { // if (state == 0) //nothing happened yet, don't need to do anything if (state >= 1) { database.GetCollection<Comment>("Comments").Remove(Query.EQ("_id", comment.Id), RemoveFlags.Single); } if (state >= 2) { blogPost.CommentsCount--; database.GetCollection<BlogPost>("BlogPosts").Save(blogPost); } if (state >= 3) { user.PostecCommentsCount--; database.GetCollection<User>("Users").Save(user); } throw; } }
Take a moment or two to go over the code and figure out what was going on in there. It took me a while to really figure that one out.
Important, before I continue with this post, I feel that I need to explain what the problem is and why it is there. Put simply, MongoDB doesn’t support multi document transactions. The reason that it doesn’t support multi document transactions is that the way MongoDB auto sharding works, different documents may be on different shards, therefor requiring synchronization between different machines, which no one has managed to make scalable an efficient. MongoDB choose, for reasons of scalability and performance, to not implement this feature. This is document and well know part of the product.
It makes absolute sense, except that it leads to code like the one above, when the user really do want to have atomic multi document writes. Just to be certain that the point has been hammered home. The code above still does not ensures atomic multi document writes. For example, if the server shuts down between immediately after setting state to 2, there is nothing that the code can do to revert the previous writes (after all, they can’t contact the server to tell it that it to revert them).
And there are other problems with this approach, the code is ugly, and it is extremely brittle. It is very easy to update one part and not the other… but at this point I think that I am busy explaining why horse excrement isn’t suitable for gourmet food.
The major problem with this code is that it is trying to do something that the underlying database doesn’t support. I sat down with the customer and talked about the advantages and disadvantages of staying with a document database vs. moving to a relational database. A relational database would handle atomic multi row writes easily, but would require many reads and many joins to show a single page.
That was the point where I put the disclaimer “I am speaking about my own product, and I am likely biased, be aware of that”.
The same code in RavenDB would be:
public ActionResult AddComment(string postId, string userId, Comment comment) { using(var session = documentStore.OpenSession()) { session.Save(comment); session.Load<BlogPost>(postId).CommentsCount++; session.Load<User>(userId).PostedCommentCount++; session.SaveChanges(); // Atomic, either all are saved or none are } return Json(new { CommentAdded = true }); }
There are a couple of things to note here:
- RavenDB supports atomic multi document writes without anything required.
- This isn’t the best RavenDB code, ideally I wouldn’t have to create the session here, but in the infrastructure, but you get the point.
We also support change tracking for loaded entities, so we didn’t even need to tell it to save the loaded instances. All in all, I also think that the code is prettier, easier to follow and would produce correct results in the case of an error.
Comments
Isn't this solution useless if two people post a comment at the same time?
I would have thought the best solution would be to create a postwithcomments view to provide the count.
Bob, This is really something that depend on your usage scenarios. In RavenDB, you can tell it to fail the transaction because of concurrency conflict, or you can do patching (still within the same transaction), etc. You got options, and a lot of them are really good ones.
Ayende, you should explain what makes this possible in RavenDB, and why transactions are possible with multiple documents in a sharded setup.
Teleo, There are several things involved here. a) For a single server, we support atomic multi document writes natively. (note that this isn't the case for Mongo even for a single server). b) For multiple servers, we strongly recommend that your sharding strategy will localize documents, meaning that the actual update is only happening on a single server. c) For multi server, multi document atomic updates, we rely on distributed transactions.
The last is not really recommended for common use, because it has known scalability issues.
This is a great article, but you could still go one step further. I mean why even increment the counts on any other documents. You could have blown your clients mind by showing them your ability to project count through indexes. By doing so, it reduces the ultimate solution down to 1 or 2 lines.
RavenDB just keeps getting better!
Khalid, Yes, that is not a good solutions in terms of RavenDB, but it is mostly focused on demonstrating a very specific feature.
How's it possible that your client had chosen MongoDB knowing that they will need transactional processing? This is not a shortcoming of MongoDB, it's by design - mongo authors dropped transaction support in favor of performance. MongoDB wasn't too difficult for your client - designing an application was. Probably sticking to good old SQL was the only good decision to make in their case, I don't think using Raven instead of Mongo would improve their chances of success.
Rafal, They run into this requirement about a year after starting working on the system, it wasn't something that they initially had to worry about. It was a requirement that came out of new features popping up that weren't foreseen.
Well, it had to be 'the straw that broke the camel's back' if they decided to throw away the underlying database in order to handle a new requirement. Wonder how many problems do they have now and how it will affect their ability to deliver anything working anytime soon?
Ayende, you are always good at choosing which features you support and which you don't. Transactions were a good choice.
I do not know how anyone could actually use MongoDB in production without transactions. That must bite you all the time. Basically, every bug in you app, that causes a request to crash mid-way, has the potential to corrupt data. I consider this to be completely unacceptable for most types of application.
Nice, that comment that I spent half hour typing out was rolled back because some stupid counts could not be updated :)
I'd blame your client here, load and save User, Post to update some count?? Mongo $inc anyone?
And by the way was surprised to read Mongo has a global reader writer lock across collections! Choose wisely, but guess if it's good enough for Foursquare it is good enough for rest of us....
Ajai
Ajai, As I said, that is really an issue of how you want to deal with things. In their scenario, it absolutely made sense to have it happen in this fashion. The blog model is a very simple one, and one that is very easy to work with and explain, but it is not something that you can say: "NEVER lose this data". The actual information scenario did require them to have all or nothing semantics, please do not try to read too much into the sample scenario, it is intentionally simplified to make it easy to understand.
You forgot to pass 'userId' into 'session.Load<User>()'.
Mike, Thanks, minor detail, but I fixed it.
So the general idea here is MongoDB does not support multiple document transactions, RavenDB does. However, as you mention, if you have to shard and don't/can't localize your documents, you have to use distributed transactions, which you seem to recommend against.
If somebody reached that point with RavenDB, many shards, non local documents, what would you recommend? You'd have to change the model right? Or if possible just use map-reduces for counts and the like.
In other words, if your data gets big enough, you'll probably run against this issue anyway, Mongo or Raven?
Peter,
Note that RavenDB allows you to grow from a single server (itself able to serve a lot of data) to multiple servers. That growth means changes to your application, certainly. But I think that it is better than saying "this feature is hard to implement using shards, we won't allow it ever"
Ayende, Thanks for the response. I was genuinely curious, not trying to score a point on either side. I think you're right on and in practice it is much better to support multi-doc transactions in those scenarios. From a purely theoretical modeling standpoint, do you think it's fair to say that if you are using a document database and have a lot of mult doc transactions, that's probably a warning sign?
Peter, That really depend, usually we are talking about document model in aggregates, but there are a lot of associations between those aggregates. In most scenarios, you probably are wrong to require mutli docs transactions, because it is okay to do this without multi doc transactions in most cases. Most of the time it is an indication of bad aggregates boundaries, but there are good reasons to want to have multiple documents (for example, different reasons for updating something in the same aggregate means that this is split to two documents) that you then need to modify in tandem. This is usually the case of practical reasons causing the splitting of a single aggregate.
Greatest. Comment. Ever. +D
“We tried using NoSQL, but we are moving to Relational Databases because they are easier…”
I think this article should probably make mention that there are other NoSQL databases that support transactions since it's a little misleading as-is:
http://nosql.mypopescu.com/post/6732339201/multi-document-transactions-in-ravendb-vs-other-nosql
In the NoSQL space, there are a couple of other solutions that support transactions:
I'm not particularly well versed in either NoSQL or DDD/CQRS but would this scenario be a candidate for event sourcing?
It seems as if storing a PostAdded document could offload the state management and transactional logic to some other process. If interrupted, said process could simply pick up where it left off. Not truly transactional I know, but could remove some of the issues with original code snippet.
@Neil +1 I was thinking the same exact thing... why not "event source" it and be done? And while it is not in a nice "transaction" it could still be err... transactional. It almost sounds like they were trying to get a "consistent read" out of it...
Demis, You are correct that there are some other NoSQL dbs out there that offer transactions, but most often, one of the laments against NoSQL is that there are no transactions
Neil, The example is intentionally over simplified, to make a point. Yes, there are better ways of doing that, as for "not truly transactional", that is a scary concept. Having transactions is like being pregnant, you can't be half & half.
Ayende,
I get the point that Raven ist probably the best document-db, but what do you mean with "This isn’t the best RavenDB code, ideally I wouldn’t have to create the session here"? I would like to know, how I could write the same code even better / shorter?
Many thanks in advance!
Daneil, Take a look at RaccoonBlog sample app (which is also powering this blog), this is an example of what I consider to be well designed RavenDB application. The basic idea is that you don't really need to worry about session life cycle and calling save changes in the controller.
Ayende, Understood that it's a simplification of the real problem. As you say, the original snippet is ugly and brittle and I agree that trying to approximate transactions is a not an ideal solution. How then, would you propose working around the issue if changing the entire persistence mechanism is not a justifiable option?
Neil, I am not. You can't simulate transactions if the db doesn't support it. Then, you are left with either: - Avoid requiring transactions (which can be hard, but is possible) - Choose a db that supports them
Thanks Ayende. Makes sense really, choose your db based on your essential requirements.
You can definitely write transaction support on top of a non-transactional data store. It just takes far too much time to be useful for most people whose product is not a transactional data store.
Chris, Well, yes, but while it is also possible to walk from Los Angeles to Chicago, you don't see people do that very often. In fact, by most people perception, "you can't walk from Los Angeles to Chicago" is a true statement.
Hi Ayende, I usually don't post very often, but this time, I must say that I am absolutely shocked by how any serious software developer could produce such a "horse excrement" (as you mentioned) in a productive environment. Has the above code really been implemented in a productive environment???
Marcel, Yes, this has been implemented for production To be fair, it is a stopgap measure while they were researching a better alternative
Refer to http://www.mongodb.org/display/DOCS/two-phase+commit for 10gen's suggested way to handle multi-doc transaction with MongoDB.
AJ, You are kidding, right? This still doesn't solve the problem of crashing midway, consider the case of a failure in the middle of step 2.
More to the point, this is a LOT of code, it is VERY complicated, it has tons of failure scenarios, hard to detect bugs, etc.
Sorry, the fact that you can hop on one leg from New York to Las Vegas doesn't imply that this is a viable means of transportation.
Step 2 is idempotent. A failure of it midway can be simply restarted. It only pushes if not already pushed. So a repeat of step 2 will not harm (by duplication).
All the steps are either atomic or idempotent. Crash is handled either through restart or rollback at each step (detailed in the documentation). Sure it is not pretty or easy. But for rare transactional need on a non-transactional database, one can give it a serious thought.
I am not comparing to RavenDB (or other RDBMS) true transactional feature. Just that it was not tried out well enough in MongoDB by your client.
AJ, who is going to restart this step? Where is the transaction coordinator? Where is the information about the tx itself is stored?
The documentation says 'These "repair" jobs should be run at application startup and possibly at regular interval to catch any unfinished transaction.'
Repair jobs will repair if something is in failed state. If not, no action is taken by that repair step. I will expand on it later.
AJ, Let us assume that you have just crashed in the middle of step 2. But, let us assume that you have more than a single server running. That means that you can't just "get list of pending tx, or applied tx", because there are other processes that are going to be actually processing it.
What it goes down to is that because MongoDB doesn't have transactions, you have to build your own distributed transaction coordinator with the basic blocks of atomic swap. I am sorry, but I see no point in which it make sense to do something like that for real software. I am willing to bet that most people's attempt to write a transaction manager are going to be riddled with holes for a variety of edge cases, and that is even before we include the fast that they actually recommend adding business logic to the transaction handler part.
Comment preview