Patterns for Distributed Hash TablesCheap Cross Item Transactions

time to read 5 min | 846 words

And yet another post in my series on using Distributed Hash Tables (DHT).

The previous ones are:

    And in this post, I want to discuss how we can implement cross item transactions. This is a bit more complex than it sounds, because most of the DHT out there will not allow you to have any atomic operation on more than a single item. This makes sense, since DHT is well... distributed, trying to coordinate a transaction in a DHT would require a distributed transaction, something that has a very high performance penalty.

    Again, I'll say upfront that if you need transactions in a DHT, you are probably doing something wrong. A DHT is based on a key / value lookup, trying to turn that into something that resemble a RDBMS will cause problems. Use this approach with care.

    Let us consider the following case, we have Customer, Order and the association between customer and the orders. We have decided to implement this in the following fashion:

    • Customer #1 - customer data
    • Customer #1 Orders - associated orders ids
    • Order #14 - order id

    I am repeating again that a probable better way would be to aggregate all the customer information (including its orders) in the customer data, instead of spreading that around on multiple keys, but I have to use some example here, and I choose that one.

    We want to be able to add an order to the customer in a single transaction, that is, with all the usual ACID properties. Let us say that adding an order will modify customer and the associated orders as well, so we can't have just update the associated key and end with that, we have to have a smarter approach, because we are updating two keys, which must be in sync.

    One aspect of the propose solution is that we have to think ahead and design what our transaction boundary will be. In effect, this is very similar to the way we draw transaction boundaries in DDD, although in this case this is actually mandatory.

    The proposed solution is quite simple, instead of putting the data in the cache as described above, we can put it using the following approach:

    • {Customer #1 Group} Customer #1
    • {Customer #1 Group} Customer #1 Orders
    • {Customer #1 Group} Order #14

    The important bit is the use of an item group, because we have this, we can now perform an update using the following approach:

    LOCK "Customer #1 Group"
    customerGroup = GET "Customer #1 Group"
    customer, associated_orders = GET customerGroup + " Customer #1", customerGroup + " Customer #1 Orders"
    // Do business logic to update customer & orders
    customerGroup = Guid.NewGuid()
    PUT customerGroup + " Customer #1", customer
    PUT customerGroup + " Customer #1 Orders", associated
    PUT "Customer #1 Group", customerGroup
    UNLOCK "Customer #1 Group"
    // optional: delete previous version, or just let it expire

    The logic here is simple, we lock on the group key, thus preventing anyone else from updating on it. Then we get the group key and the current state of the items in the group. We update them, generate a new group key and put the new items there, finally, we update the customer group and unlock it.

    Let us look at it from the ACID properties that we manage to get:

    • Atomicity - until we update the group key, which is the final stage, all of our changes are private, and not visible to the outside world. In fact, if another "transaction" has started reading while this was running, it will get the previous version of the value, while running concurrently with this one, instead of blocking.
    • Consistency - see atomcity, since updating the group key is the last step, we always publish a consistent state.
    • Isolation - see atomicity, no one can access the new values until we publish the group key
    • Durability - this depends on the guarantees that the DHT makes

    This is, I think, a very elegant solution. I would caution you again from adopting it across the board, use a DHT as it is meant to be used, not try to force it to be a RDBMS.

    More posts in "Patterns for Distributed Hash Tables" series:

    1. (09 Aug 2008) Range Queries
    2. (09 Aug 2008) Lookup by property
    3. (09 Aug 2008) Cheap Cross Item Transactions
    4. (09 Aug 2008) Item Groups
    5. (08 Aug 2008) Locality