Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,124 | Comments: 45,483

filter by tags archive

The fallacy of distributed transactions

time to read 4 min | 785 words

This can be a very short post, just: See CAP. Unfortunately, we have a lot of people who actually have experience in using distributed transactions, and have a good reason to believe that they work. The answer is that yes, they do, as long as you don’t run into some of the interesting edge cases.

By the way, that is not a new observation, see The Two Generals.

Allow me to demonstrate. Assume that we have a distributed system with the following actors:


This is a fairly typical setup. You have a worker that pull messages from a queue and read/write to a database based on those messages. To coordinate between them, it uses a transaction coordinator such as MSDTC.

Transaction coordinators use a two phase commit (or sometimes a three phase commit protocols) to ensure that either all the data would be committed, or none of it would be.

The general logics goes like this:

  • For each participant in the transaction, send a prepare message.
    • If the participant answered “prepared”, it is guaranteeing that the transaction can be committed.
  • If any of the participants failed on prepare, abort the whole transaction.
  • If all of the participants successfully prepared, send the commit message to all of them.

The actual details are a bit more involved, obviously, but that is pretty much it.

Now, let us take a look at an interesting scenario. Worker #1 is pulling (in a distributed transaction) a message from the queue, and based on that message, it modify the database. Then it tells the transaction coordinator that it can commit the transaction. At this point, the TC is sending the prepare message to the database and the queue. Both reply that they have successfully prepared the transaction to be committed. The TC sends a commit message to the queue, completing the transaction from its point of view, however, at this point, it is unable to communicate with the database.

What happens now?

Before I answer that, let us look at another such scenario. The TC needs to commit a transaction, it sends a prepare message to the database, and receive a successful reply. However, before it manages to send a prepare message to the queue, it becomes unavailable.

Note that from the point of view of the database, the situation looks exactly the same. It got (and successfully replied) to a Prepare message, then it didn’t hear back from the transaction coordinator afterward.

Now, that is where it gets interesting. In an abstract world, we can just wait with the pending transaction until the connection with the coordinator is resumed, and we can actually get a “commit / abort” notification.

But we aren’t in abstract world. When we have such a scenario, we are actually locking records in the database (because they are in the process of being modified). What happens when another client comes to us and want to modify the same record?

For example, it is quite common for to host the business logic, queue and transaction coordinator on the same worker instance, while the database is on a separate machine. That means that in the image above, if Worker #1 isn’t available, we recover by directing all the users to the 2nd worker. However, at that point, we have a transaction that was prepared, but not committed.

When the user continue to make requests to our system, the 2nd worker, which has its own queue and transaction coordinator is going to try and access the user’s record. The same user whose record are currently locked because of the ongoing transaction.

If we just let it hang in this manner, we have essentially created a situation where the user’s data become utterly unavailable (at least for writes). In order to resolve that, transactions comes with a timeout. So after the timeout has expired, we can roll back that transaction. Of course, that leads to a very interesting situation.

Let us go back to the first scenario we explored. In this scenario, the queue got both Prepare & Commit messages, while the database got just a Prepare message. The timeout has expired, and the database has rolled back the transaction.  In other words, as far as the queue is concerned, the transaction committed, and the message is gone. As far as the database is concerned, that transaction was rolled back, and never happened.

Of course, the chance that something like that can happen in one of your systems? Probably one in a million.


Oğuzhan Eren

A good paper for distributed transactions: http://www.eaipatterns.com/docs/IEEESoftwareDesign_2PC.pdf


According to Google all you need is a bunch of atomic clocks and GPS satellites... problem solved :D


Interesting. My understanding is that this is a recoverable scenario. In case of a disconnect/crash during phase 2 of the two phase commit process, the relevant system (assume database) would be in a pending commit state. After recovering, it could query the MSDTC and ask what happened at which point the result, commit or rollback, would be communicated.

See the answer (and comments) here: http://stackoverflow.com/questions/55878/how-do-distributed-transactions-work-eg-msdtc

Any thoughts on this?

Jimmy Zimmerman

Actually the statement that the message on the queue is gone isn't always true. That's actually an implementation issue on the underlying transactional resource. For instance, this is where things like transaction logs actually come into play.

Of course the moral of the story you outline is Distributed Transactions don't automagically make it all right; You've got to understand your environment you're enlisting in. As I tell my engineers: your 2 Phase Commit is only as strong as your weakest participant.

Jimy Zimmerman

Err I should say the data isn't necessarily gone. The message is however.

Ayende Rahien

Rohland, Sure, that assume that the DTC is up. What if it isn't? What happens from the point of view of the database at that point?

That is the whole point of this article.

Ayende Rahien

Jimmy, The DTC protocol doesn't have a way of saying: "remember commit T92131, we actually need to revert it". That means that to restore the message, you have to manually use the queue tools to recover the data. As far as the queue is concerned, everything works, and we are done.

Also, your link returns 404.


It's been a while since I've looked into this, but in the case of a distributed transaction, the MSDTC service on all participating servers will be involved. If the DTC on a participating server is down, then we know the transaction will fail outright as phase 1 will fail (presumably with a timeout). If the DTC goes down during phase 2, then I think there are mechanisms to deal with this.

If the server crashes then so does the database service since the participating DTC will be on the same server - the initiating DTC is still online. Upon restart, the database and DTC could collaborate to ascertain what the result was (i.e. Communicate with initiating DTC).

If the DTC service hangs/crashes but database remains online, I'm not convinced that the database simply rolls back after a timeout as you suggest. See this article here: http://www.eraofdata.com/orphaned-msdtc-transactions-2-spids/

I only have a peripheral understanding of how it all hangs together, however I would think that there are various other mechanisms (such as retries) built in that, in addition to the above, drastically reduce the odds of inconsistencies in a distributed transaction.

Ayende Rahien

Rohland, Not really, no. That assumes that you can have MSDTC on all servers. What if it is an Oracle db running on Linux? Or a RavenDB running on the cloud? In both cases, a proxy has to be used.

And the links you pointed to are exactly the problem that I'm talking about. You have to go and manually resolve those issues, because as a result of this issue, you have permanently locked transactions.


Yes, correct on both counts. I am aware that the MSDTC only works for a small subset of technologies (SQL Server, MSMQ etc).

I just wanted to make the points that (1) orphaned transactions seem very unlikely to occur as MSDTC seems to support mechanisms to attempt to resolve and finalize a transaction after crash/network partition; (2) the database will not simply rollback a transaction during phase2 if it hasn't received the final go/no-go - this is something that needs to be resolved manually in rare cases.

I hope this further clarifies the points you were raising in this post. For some time I've searched for thorough documentation of exactly how the MSDTC works to clarify these things. If anyone has such a reference, please post it!

Ayende Rahien

Rohland, If the database doesn't roll back the transaction, you run the assured risk that you are going to effectively take the database down, because you have locked records that will never be freed.

Ayende Rahien

Rohland, Assume that we have an orphaned transaction that locked the user's record. And assume that we want to update the last login date for that user. Under this scenario, the entire system is unavailable for this user.

Ayende Rahien

Rohland, Also, documentation about how MSDTC actually work are pretty much non existant.

Jimmy Zimmerman


Yes you're totally correct. This is a manual process (or better put, not an automagial process). I was more noting that in the above example many transactional stores have journaling concepts baked into them that can be used as part of a rectification process for a data critical consistency processes. Important (and used) in environments like trading systems.

However you're correct to point out that a TM cannot guarantee consistency in all circumstances. Just in a broad enough set of them to be widely useful.

-PS stay safe

Comment preview

Comments have been closed on this topic.


  1. RavenDB 3.5 whirl wind tour: I’ll find who is taking my I/O bandwidth and they SHALL pay - 5 hours from now
  2. The design of RavenDB 4.0: Physically segregating collections - about one day from now
  3. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - 2 days from now
  4. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 5 days from now
  5. The design of RavenDB 4.0: Voron has a one track mind - 6 days from now

And 12 more posts are pending...

There are posts all the way to May 30, 2016


  1. RavenDB 3.5 whirl wind tour (14):
    02 May 2016 - You want all the data, you can’t handle all the data
  2. The design of RavenDB 4.0 (13):
    03 May 2016 - Making Lucene reliable
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats