The fallacy of distributed transactions

time to read 4 min | 785 words

This can be a very short post, just: See CAP. Unfortunately, we have a lot of people who actually have experience in using distributed transactions, and have a good reason to believe that they work. The answer is that yes, they do, as long as you don’t run into some of the interesting edge cases.

By the way, that is not a new observation, see The Two Generals.

Allow me to demonstrate. Assume that we have a distributed system with the following actors:

This is a fairly typical setup. You have a worker that pull messages from a queue and read/write to a database based on those messages. To coordinate between them, it uses a transaction coordinator such as MSDTC.

Transaction coordinators use a two phase commit (or sometimes a three phase commit protocols) to ensure that either all the data would be committed, or none of it would be.

The general logics goes like this:

For each participant in the transaction, send a prepare message.

If the participant answered “prepared”, it is guaranteeing that the transaction can be committed.

If any of the participants failed on prepare, abort the whole transaction.
If all of the participants successfully prepared, send the commit message to all of them.

The actual details are a bit more involved, obviously, but that is pretty much it.

Now, let us take a look at an interesting scenario. Worker #1 is pulling (in a distributed transaction) a message from the queue, and based on that message, it modify the database. Then it tells the transaction coordinator that it can commit the transaction. At this point, the TC is sending the prepare message to the database and the queue. Both reply that they have successfully prepared the transaction to be committed. The TC sends a commit message to the queue, completing the transaction from its point of view, however, at this point, it is unable to communicate with the database.

What happens now?

Before I answer that, let us look at another such scenario. The TC needs to commit a transaction, it sends a prepare message to the database, and receive a successful reply. However, before it manages to send a prepare message to the queue, it becomes unavailable.

Note that from the point of view of the database, the situation looks exactly the same. It got (and successfully replied) to a Prepare message, then it didn’t hear back from the transaction coordinator afterward.

Now, that is where it gets interesting. In an abstract world, we can just wait with the pending transaction until the connection with the coordinator is resumed, and we can actually get a “commit / abort” notification.

But we aren’t in abstract world. When we have such a scenario, we are actually locking records in the database (because they are in the process of being modified). What happens when another client comes to us and want to modify the same record?

For example, it is quite common for to host the business logic, queue and transaction coordinator on the same worker instance, while the database is on a separate machine. That means that in the image above, if Worker #1 isn’t available, we recover by directing all the users to the 2nd worker. However, at that point, we have a transaction that was prepared, but not committed.

When the user continue to make requests to our system, the 2nd worker, which has its own queue and transaction coordinator is going to try and access the user’s record. The same user whose record are currently locked because of the ongoing transaction.

If we just let it hang in this manner, we have essentially created a situation where the user’s data become utterly unavailable (at least for writes). In order to resolve that, transactions comes with a timeout. So after the timeout has expired, we can roll back that transaction. Of course, that leads to a very interesting situation.

Let us go back to the first scenario we explored. In this scenario, the queue got both Prepare & Commit messages, while the database got just a Prepare message. The timeout has expired, and the database has rolled back the transaction. In other words, as far as the queue is concerned, the transaction committed, and the message is gone. As far as the database is concerned, that transaction was rolled back, and never happened.

Of course, the chance that something like that can happen in one of your systems? Probably one in a million.

Tweet Share Share 14 comments

Tags:

Comments

17 Jul 2014
10:05 AM

Oğuzhan Eren

A good paper for distributed transactions: http://www.eaipatterns.com/docs/IEEE_Software_Design_2PC.pdf

17 Jul 2014
17:45 PM

Dan

According to Google all you need is a bunch of atomic clocks and GPS satellites... problem solved :D

17 Jul 2014
18:35 PM

Rohland

Interesting. My understanding is that this is a recoverable scenario. In case of a disconnect/crash during phase 2 of the two phase commit process, the relevant system (assume database) would be in a pending commit state. After recovering, it could query the MSDTC and ask what happened at which point the result, commit or rollback, would be communicated.

See the answer (and comments) here: http://stackoverflow.com/questions/55878/how-do-distributed-transactions-work-eg-msdtc

Any thoughts on this?

17 Jul 2014
21:46 PM

Jimmy Zimmerman

Actually the statement that the message on the queue is gone isn't always true. That's actually an implementation issue on the underlying transactional resource. For instance, this is where things like transaction logs actually come into play.

Of course the moral of the story you outline is Distributed Transactions don't automagically make it all right; You've got to understand your environment you're enlisting in. As I tell my engineers: your 2 Phase Commit is only as strong as your weakest participant.

17 Jul 2014
23:11 PM

Jimy Zimmerman

Err I should say the data isn't necessarily gone. The message is however.

18 Jul 2014
05:59 AM

Ayende Rahien

Rohland, Sure, that assume that the DTC is up. What if it isn't? What happens from the point of view of the database at that point?

That is the whole point of this article.

18 Jul 2014
06:01 AM

Ayende Rahien

Jimmy, The DTC protocol doesn't have a way of saying: "remember commit T92131, we actually need to revert it". That means that to restore the message, you have to manually use the queue tools to recover the data. As far as the queue is concerned, everything works, and we are done.

Also, your link returns 404.

18 Jul 2014
07:08 AM

Rohland

It's been a while since I've looked into this, but in the case of a distributed transaction, the MSDTC service on all participating servers will be involved. If the DTC on a participating server is down, then we know the transaction will fail outright as phase 1 will fail (presumably with a timeout). If the DTC goes down during phase 2, then I think there are mechanisms to deal with this.

If the server crashes then so does the database service since the participating DTC will be on the same server - the initiating DTC is still online. Upon restart, the database and DTC could collaborate to ascertain what the result was (i.e. Communicate with initiating DTC).

If the DTC service hangs/crashes but database remains online, I'm not convinced that the database simply rolls back after a timeout as you suggest. See this article here: http://www.eraofdata.com/orphaned-msdtc-transactions-2-spids/

I only have a peripheral understanding of how it all hangs together, however I would think that there are various other mechanisms (such as retries) built in that, in addition to the above, drastically reduce the odds of inconsistencies in a distributed transaction.

18 Jul 2014
07:22 AM

Ayende Rahien

Rohland, Not really, no. That assumes that you can have MSDTC on all servers. What if it is an Oracle db running on Linux? Or a RavenDB running on the cloud? In both cases, a proxy has to be used.

And the links you pointed to are exactly the problem that I'm talking about. You have to go and manually resolve those issues, because as a result of this issue, you have permanently locked transactions.

18 Jul 2014
07:41 AM

Rohland

Yes, correct on both counts. I am aware that the MSDTC only works for a small subset of technologies (SQL Server, MSMQ etc).

I just wanted to make the points that (1) orphaned transactions seem very unlikely to occur as MSDTC seems to support mechanisms to attempt to resolve and finalize a transaction after crash/network partition; (2) the database will not simply rollback a transaction during phase2 if it hasn't received the final go/no-go - this is something that needs to be resolved manually in rare cases.

I hope this further clarifies the points you were raising in this post. For some time I've searched for thorough documentation of exactly how the MSDTC works to clarify these things. If anyone has such a reference, please post it!

18 Jul 2014
07:46 AM

Ayende Rahien

Rohland, If the database doesn't roll back the transaction, you run the assured risk that you are going to effectively take the database down, because you have locked records that will never be freed.

18 Jul 2014
07:47 AM

Ayende Rahien

Rohland, Assume that we have an orphaned transaction that locked the user's record. And assume that we want to update the last login date for that user. Under this scenario, the entire system is unavailable for this user.

18 Jul 2014
07:47 AM

Ayende Rahien

Rohland, Also, documentation about how MSDTC actually work are pretty much non existant.

18 Jul 2014
17:11 PM

Jimmy Zimmerman

Oren,

Yes you're totally correct. This is a manual process (or better put, not an automagial process). I was more noting that in the above example many transactional stores have journaling concepts baked into them that can be used as part of a rectification process for a data critical consistency processes. Important (and used) in environments like trading systems.

However you're correct to point out that a TM cannot guarantee consistency in all circumstances. Just in a broad enough set of them to be widely useful.

-PS stay safe

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB