These are my notes after learning the Paxos algorithm. The primary goal here is to sharpen my own understanding of the algorithm, but maybe someone will find this explanation of Paxos useful! This post assumes fluency with mathematical notation.
I must confess it took me a long time to understand distributed consensus. I’ve read a whole bunch of papers (Part Time Parliament, Paxos Made Simple, Practical BFT, In Search of an Understandable Consensus Algorithm, CASPaxos: Replicated State Machines without logs), but they didn’t make sense. Or rather, nothing specific was unclear, but, at the same time, I was unable to answer the core question:
What breaks if this particular condition is removed?
That means that I didn’t actually understand the algorithm.
What finally made the whole thing click are
I now think that the thing is actually much simpler than it is made to believe :-)
Buckle in, we are starting!
Paxos is an algorithm for implementing distributed consensus. Suppose you have N machines which communicate over a faulty network. The network may delay, reorder, and loose messages (it can not corrupt them though). Some machines might die, and might return later. Due to network delays, “machine is dead” and “machine is temporary unreachable” are indistinguishable. What we want to do is to make machines agree on some value. “Agree” here means that if some machine says “value is X”, and another machine says “value is Y”, then X necessary is equal to Y. It is OK for machine to answer “I don’t know yet”.
The problem with this formulation is that Paxos is an elementary, but subtle algorithm. To understand it (at least for me), a precise, mathematical formulation is needed. So, let’s try again.
What is Paxos? Paxos is a theorem about sets! This is definitely mathematical, and is true (as long as you base math on set theory), but is not that helpful. So, let’s try again.
What is Paxos? Paxos is a theorem about nondeterministic state machines!
A system is characterized by a state. The system evolves in discrete steps: each step takes system from state to state'. Transitions are non-deterministic: from a single current s1, you may get to different next states s2 and s3. (non-determinism models a flaky network). An infinite sequence of system’s states is called a behavior:
1
state_0 → state_1 → ... → state_n → ...
Due to non-determinism, there’s a potentially infinite number of possible behaviors. Nonetheless, depending on the transition function, we might be able to prove that some condition is true for any state in any behavior.
Let’s start with a simple example, and also introduce some notation. I won’t use TLA+, as I don’t enjoy its concrete syntax. Instead, math will be set in monospaced unicode.
The example models an integer counter. Each step the counter decrements or increments (non-deterministically), but never gets too big or too small
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Sets:
ℕ -- Natural numbers with zero
Vars:
counter ∈ ℕ
Init ≡
counter = 0
Next ≡
(counter < 9 ∧ counter' = counter + 1)
∨ (counter > 0 ∧ counter' = counter - 1)
Theorem:
∀ i: 0 ≤ counter_i ≤ 9
-- Notation
-- ≡: equals by definition
-- ∧: "and", conjunction
-- ∨: "or", disjunction
The sate of the system is a single variable — counter. It holds a natural number. In general, we will represent a state of any system by a fixed set of variables. Even if the system logically consists of several components, we model it using a single unified state.
The Init formula specifies the initial state, the counter is zero. Note that = is a mathematical equality, and not an assignment. Init is a predicate on states.