Ayende @ Rahien

Unnatural acts on source code

Break the algorithm: Distributed Lock

The scenario for this is to create a locking mechanism in a Distributed Hash Table where nodes are allowed to fail without taking the entire DHT down.

Now, don’t expect too much out of this, I thought this out at 2 AM or so, and just sat down to hurriedly write it before it escape my mind.

The environment in which it runs is a DHT, where a key may reside on several nodes (usually 1 or 3).  Taking a look means placing a lock item in over half of the nodes. Lock expires after a set amount of time (because we can’t trust the client to clear them). We assume a system that share a clock (or synchronize clocks).

The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.

Here is the pseudo code:

def LockKey(key, recursionDepth) as bool:
topology = dht.GetTopologyFor(key)
successfulLocks = 0
lockExpiry = DateTime.Now.AddMinutes(1)
lockKey = key+"_lock"
for server in topology:
try:
server.WriteIfDoesNotExistsOrSameServer(lockKey, currentServerName, lockExpiry)
successfulLocks += 1
except ServerDown:
ignore error
except KeyAlreadyExists:
if ScavengeExpiredLocks(lockKey):
return LockKey(key, recursionDepth+1) if recursionDepth < 3
return false

return (successfulLocks/2) >= (topology.Count/2) //at least half the servers have the lock



def ScavengeExpiredLocks(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if HasExpired(val):
server.RemoveKey(lock, val.Version)
else:
return false
except ServerDown:
ignore error
except KeyVersionChanged:
return false

return true

def ClearLock(key):
topology = dht.GetTopologyFor(key)
for server in topology:
try:
val = server.ReadKey(lockKey)
if BelongsToCurrentServer(val):
server.RemoveKey(lock, val.Version)
except ServerDown:
ignore error
except KeyVersionChanged:
ignore error

So, how many critical bugs do I have here?

Comments

Jason
09/05/2009 02:23 PM by
Jason

Don't at least n/2+1 servers need to return success in order for the lock to be considered as 'entered'? Otherwise two nodes could enter the lock, each with half the nodes.

This is sort of like the Google Chubby protocol. They use a lock revision # that is incremented 'atomically' across nodes to ensure that two nodes can't lock on the same key.

Ayende Rahien
09/05/2009 03:42 PM by
Ayende Rahien

What do you mean by automatically incrementing the lock revision?

Can you explain?

Ayende Rahien
09/05/2009 03:42 PM by
Ayende Rahien

Jason,

You are probably right.

Jason
09/05/2009 03:52 PM by
Jason

As I understand it, the Chubby service associates a number that is incremented for locking requests. It's a variant of the Paxos algorithm and the number is needed to form a consensus on who owns the lock.

That may be overkill for this use, but I think the idea is if you need to change something in the DHT from one state to the next, this gives you some 'consensus' of the 'start state' before you begin.

Uriel Katz
09/05/2009 05:17 PM by
Uriel Katz

what interest me is that your pseudo is in Boo/Python(or really close to it) :)

Justin Chase
09/05/2009 06:09 PM by
Justin Chase

That looks more like boo than psuedo code :-P

Ayende Rahien
09/05/2009 09:59 PM by
Ayende Rahien

Uriel,

I love Boo for its clean syntax.

It is almost pseudo code

Alex Yakunin
09/06/2009 11:13 AM by
Alex Yakunin

Oren, I'd recommend you to read about Chubby & distributed consensus algorithms (Paxos, etc.). You'll see principal issues, rather than just technical.

Btw, we evaluated DHT approach for distributed storage for DO databases. And finally decided this approach won't work for storages we typically need: index seek can't beimplemented there well, but this is essential in quite many cases.

Note that e.g. BigTable is not DHT.

Alex Yakunin
09/06/2009 03:49 PM by
Alex Yakunin

P.S. For me the worst issues here are:

  • Operations aren't atomic. Its completely unclear what guarantees are you going to provide after their completion.

-There must be issues related to difference in time

  • Its completely unclear what will happen when new server wakes up after temporary failure (e.g. network outage).

  • It is unclear how they're classified as down / working. What invariants are guaranteed to be maintained?

P.S. One more good article to read is Microsoft Boxwood project description.

Alex Yakunin
09/06/2009 03:54 PM by
Alex Yakunin

Don't at least n/2+1 servers need to return success in order for the lock to be considered as 'entered'?

As far as I can judge from above, there is no code related to distributed consensus. So no any "global" state guarantees. Thus it's difficult to judge if this will work at all: no one can predict how such a system will work after failure, because state of recovered node initially can be completely unexpected.

Alex Yakunin
09/06/2009 03:57 PM by
Alex Yakunin

The annoying thing is that we need to recover from situations in which some of the nodes holding the key are down or inaccessible.

Ah, I see... That's the most complex problem. Check out the links ;)

Howard Pinsley
09/08/2009 12:51 PM by
Howard Pinsley

Ayende:

I'm confused by this line:

return (successfulLocks/2) >= (topology.Count/2)

Why is it not

return successfulLocks >= (topology.Count/2)

Are you somehow trying to deal with and even number of servers?

Ayende Rahien
09/08/2009 02:04 PM by
Ayende Rahien

Howard,

I would like to deal with even number of servers, yes.

But you are right, your code is simpler, much simpler.

Howard Pinsley
09/08/2009 11:09 PM by
Howard Pinsley

Actually, now that I think about it, since you want a majority for quorum, it probably should be:

return successfulLocks >= (topology.Count / 2) + 1

Comments have been closed on this topic.