ZooKeeper Pitfalls

Below, I'll describe three pathological situations that I personally encountered when working with ZooKeeper, and I feel they deserve broader attention. The first concerns the general problem with distributed locks, while the other two are specific to ZooKeeper.

1. Distributed locks are not Mutexes

This is not new information — Martin Kleppman wrote about it in 2016. Unfortunately, there continue to be mis-conceptions about this — mostly due to bad documentation, and a little bit due to poor API design.

Take this recent (from Dec 2019!) GitHub issue from etcd as an example:

As a part of our Jepsen testing, we've demonstrated that etcd's locks aren't actually locks: like all "distributed locks" they cannot guarantee mutual exclusion when processes are allowed to run slow or fast, or crash, or messages are delayed, or their clocks are unstable, etc.

The fundamental problem is that in a system that uses timeouts to reclaim locks (to prevent starvation in case the lock holder permanently goes away), it is impossible to guarantee that no two lock contenders will simultaneously "think" that they have the lock. This can happen, for example, if the zookeeper client thread at the original lock holder is not scheduled for the entire duration of the session timeout.

An interesting thing to observe is that none of the underlying consensus protocols — ZAB, Raft, or Paxos — provide native distributed locks. However, all the coordination services built on top — ZooKeeper, etcd, and Chubby — provide locking either as a first-class API (etcd and Chubby) or as a client-side recipe (ZooKeeper). I posit that this has happened because software engineers seek familiar tools in the new, unfamiliar environment of distributed systems.

There are a couple of work-arounds, though neither is perfect:

Google's Chubby delays lock re-acquisition if the lock was reclaimed as a result of the holder failing:

... if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a period called the lock-delay. Clients may specify any lock-delay up to some bound, currently one minute; this limit prevents a faulty client from making a lock (and thus some resource) unavailable for an arbitrarily long time. While imperfect, the lock-delay protects unmodified servers and clients from everyday problems caused by message delays and restarts.

It's worth noting that this is not possible in ZooKeeper since locking is not provided as a first-class server API, and a client will never know whether a lock was released explicitly or reclaimed.

Use fencing, as suggested by Martin in his blog — fencing requires tighter integration between the locking service and whatever shared resource is being contented upon.

2. Creating ephemeral znodes

The ZooKeeper locking recipe leverages ephemeral (+ sequential) znodes for lock contention — a contender creates a znode with the sequence and ephemeral flags set. It then checks if the created znode has the smallest sequence number among all the nodes contending on the lock. If it does, it assumes ownership of the lock. Otherwise, it waits for the znode with next smallest sequence number to go away. Ephemeral znodes are used to remove contenders or the lock holder when their session expires.

Failure

Unfortunately, the create call can fail even if the creation of the znode was successful at the zookeeper servers, as noted in the latest ZooKeeper recipes page:

Important Note About Error Handling ... When creating a sequential ephemeral node there is an error case in which the create() succeeds on the server but the server crashes before returning the name of the node to the client. When the client reconnects its session is still valid and, thus, the node is not removed. The implication is that it is difficult for the client to know if its node was created or not.

Affect on locking:

A reasonable thing to do on failure of create, if one doesn't know about the above failure-mode, is to retry the operation. The trouble is that this will cause a deadlock, sometimes long after this failure has happened. When the znode created on the first attempt becomes the one with the smallest sequence number, the creator has no idea that it has the lock, and is itself blocked on others in the chain.