Failure detectors are key components of any distributed system. I created the following summary of research on failure detectors when looking into building failure detection in the Libra network. If there's something missing/incorrectly represented, please let me know and I will be happy to make a correction.
Papers:
- Unreliable Failure Detectors for Reliable Distributed Systems [1996]
- A Gossip-Style Failure Detection Service [1998]
- Heartbeat and gossip ideas
- On the Quality of Service of Failure Detectors [2000]
- Provides 3 key metrics (as random variables) for measuring QOS of a failure detector:
- Detection Time: Time from crash to permanent suspicion
- Mistake recurrence time: Time b/w 2 consecutive mistakes in a period of no failure
- Mistake duration: Time it takes for the FD to correct a mistake
- An Adaptive Failure Detection Protocol [2001]
- Model: reliable communication channel is assumed - more specifically, messages are never dropped
- Introduces idea of lazy failure detectors, which rely on application msgs to track health, and uses control messages only when no pending application msgs are present.
- It keeps track of max observed message delay, and declares a node as suspected when oldest pending message has been unacknowledged for more than previously observed max messaged RTT.
- SWIM [2002]
- Improvement over the [1998] paper on gossip-style failure detection
- Non-adaptive, but uses indirect probes and suspicion/rebuttal scheme
- The Phi Accrual Failure Detector [2004]
- Provides a suspicion value rather than Boolean outlook on healthy/unhealthy node
- Uses heartbeats(push) rather than health probes(pull)
- Assumes normal distribution for inter-transmission delay of heartbeats
- A New Adaptive Accrual Failure Detector for Dependable Distributed Systems [2007
- Detecting failures in distributed systems with the FALCON spy network [2011]
- Reliable Failure detector - guaranteed reliability by killing suspected "layer"
- Not suitable for decentralized systems - targeted for data-center/enterprise environment
- Currently deployed at 4 layers - application, OS, VMM and network switch. The spy at each layer monitors that layer and the spy at the layer above. The client library gathers input from all the spies to provide the client with an UP/DOWN status.
- Fireflies [2015]
- Models probing as a negative binomial experiment with parameter r = 1
- Also determines probability of a probe succeeding from measured packet-loss rate and uses simple exponential smoothing model for estimating future loss rate. Refer paper for details.
- The byzantine fault tolerance of Fireflies is above the failure detector layer
- A Weibull distribution accrual failure detector for cloud computing [2017]
- Proposes the Weibull distribution as being a better representative of network heartbeat inter-arrival time. Nothing else!
- Lifeguard [2018] (video: https://www.hashicorp.com/resources/failure-detection-in-the-era-of-gray-failures)
- Adds local health awareness to probe interval, probe timeout and suspicion timeout parameters in SWIM
- Probe interval and timeout are adjusted according to number of pending pings, number of suspicion messages against self, missing nack messages from indirect ping requests, and successful probes.
- Suspicion timeout is reduced on receipt of suspicion from other peers about the same remote peer.
- Capturing and Enhancing In Situ System Observability for Failure Detection [OSDI 2018]
- Also targeted towards detecting gray failures rather than just crash failure
- Uses information at network end-points about communication success/failure to build model of health of remote peer
- No separate heartbeat/probing thread running
Other references:
- Netflix Hystrix
- Uses low and high watermark of error volume to close/open/half open a circuit
- Service Fabric [2018]
- The novel idea in ServiceFabric is to use lightweight arbitrator groups to ensure membership stays consistent (in the ring neighborhood). [this looks similar to Rapid]
- Rapid [2018]
- Uses pluggable failure detection module