An introduction to lockless algorithms [LWN.net]

https://lwn.net/Articles/844224/

Lockless algorithms are of interest for the Linux kernel when traditional locking primitives either cannot be used or are not performant enough. For this reason they come up every now and then on LWN; one of the last mentions, which prompted me to write this article series, was last July. Topics that arise even more frequently are read-copy-update (RCU — these articles from 2007 are still highly relevant), reference counting, and ways of wrapping lockless primitives into higher-level, more easily understood APIs. These articles will delve into the concepts behind lockless algorithms and how they are used in the kernel.

Low-level knowledge of the memory model is universally recognized as advanced material that can scare even the most seasoned kernel hackers; our editor wrote (in the July article) that "it takes a special kind of mind to really understand the memory model". It's been said that the Linux kernel memory model (and in particular Documentation/memory-barriers.txt) can be used to frighten small children, and the same is probably true of just the words "acquire" and "release".

At the same time, mechanisms like RCU and seqlocks are in such widespread use in the kernel that almost every developer will sooner or later encounter fundamentally lockless programming interfaces. For this reason, it is a good idea to equip yourself with at least a basic understanding of lockless primitives. Throughout this series I will describe what acquire and release semantics are really about, and present five relatively simple patterns that alone can cover most uses of the primitives.

Acquire, release, and "happens before"

In order to make the big step from the (relative) comfort of synchronization primitives to lockless programming, we shall first take a look at why locks work in the first place. Usually, this is taught in terms of mutual exclusion: locks prevent multiple threads of execution from reading or writing the same data concurrently. But what does "concurrently" really mean? And what happens when thread T is done with that data and thread U starts using it?

In order to answer these questions, we can turn to a theoretical framework that Leslie Lamport established in his 1978 paper "Time, Clocks and the Ordering of Events in a Distributed System". According to the paper, the events in a distributed system can be ordered according to whether an event P happens before another event Q:

The ordering of events is total within a single thread of execution. In layman's terms, for any two events from the same thread you can always say which came first and which came second.
If two events do not happen within a single thread of execution, event P happens before event Q if event P is a message send and event Q is the corresponding message receive.
The relation is transitive. Therefore, if event P happens before event Q and event Q happens before event R, then event P happens before R.

The "happens before" relation is a partial ordering: it is possible to have two events P and Q such that neither happens before the other. When this happens, the two events are concurrent. Remember how a lock prevents concurrent accesses to the same data structure? That is because, when you protect a data structure with a lock, all accesses to the data structure form a total ordering, as if they came from a single thread. Lamport's framework also provides a basic idea of what happens when a lock is handed off from a thread to another: some kind of "message passing" ensures that the unlock operation of thread T "happens before" the lock operation of thread U.

As it turns out, this is not just theory: in order to ensure that their caches are coherent, CPUs exchange messages over buses such as Intel's QPI or AMD's HyperTransport. However, this level of detail is definitely too much for our purposes. Instead, we will generalize the "happens before" definition to cover any kind of synchronization primitive.

Lamport's fundamental insight was that synchronization happens when two threads operate with symmetric operations on the same data structure. In our generalization, we will list pairs of operations (such as sending and receiving a message through the same queue) that synchronize one thread with another. Furthermore, we will classify the operations in each pair as either release or acquire; equivalently we can say that they "have release (or acquire) semantics".

Within a pair, the release operation synchronizes with the corresponding acquire operation. "Synchronizing" means that, whenever one thread performs a release operation and another thread performs the corresponding acquire operation, the "happens before" relation grows edges from the releasing thread to the acquiring thread. "Happens before" remains a partial order in the general case, but it now spans two threads, or more than two, thanks to transitivity. More formally:

The ordering of operations is total within a single thread.
If an operation P with release semantics synchronizes with an operation Q with acquire semantics, then operation P happens before operation Q, even if they happen in different threads.
Just like before, the relation is transitive and defines a partial ordering of operations.

The old definition follows by giving release semantics to message send and acquire semantics to message receive. A message send synchronizes the sending thread with the thread (or threads) that receives the message. We can also rephrase our previous findings according to this new definition: for locks to work, unlocking must have release semantics and must synchronize with locking—which in turn has acquire semantics. No matter if the lock is contended or not, the resulting "happens before" edges ensure a smooth hand-off from one thread to the other.

Acquire and release semantics may seem like an abstract concept, but they truly provide simple explanations of many common multithreaded programming practices. For example, consider two user-space threads accessing a global variable s: