Untitled

Concerns over the performance of programs written in Python are often overstated — for some use cases, at least. But there is no getting around the problem imposed by the infamous global interpreter lock (GIL), which severely limits the concurrency of multi-threaded Python code. Various efforts to remove the GIL have been made over the years, but none have come anywhere near the point where they would be considered for inclusion into the CPython interpreter. Now, though, Sam Gross has entered the arena with a proof-of-concept implementation that may solve the problem for real.

The concurrency restrictions in the CPython interpreter are driven by its garbage-collection approach, which uses reference counts on objects to determine when they are no longer in use. These counts are busy; many types of access to a Python object require a reference-count increment and (eventually) decrement. In a multi-threaded program, reference-count operations must be performed in a thread-safe manner; the alternative is to risk corrupted counts on objects. Given the frequency of these operations, corruption in multi-threaded programs would be just a matter of time, and perhaps not much time at that. To avoid such problems, the GIL only allows one thread to be running in the interpreter (i.e. to actually be running Python code) at a time; that takes away almost all of the advantage of using threads in any sort of compute-intensive code.

The reference-count problem can be trivially solved (for a relatively advanced value of "trivially") by using atomic operations to increment and decrement the counts. There is just one little problem with this solution, as outlined in this design document posted by Gross:

The simplest change would be to replace non-atomic reference count operations with their atomic equivalents. However, atomic instructions are more expensive than their non-atomic counterparts. Replacing Py_INCREF and Py_DECREF with atomic variants would result in a 60% average slowdown on the pyperformance benchmark suite.

Given that the vast majority of Python programs are single-threaded (and likely to remain so), it is not surprising that there has never been much appetite for solutions that impose this sort of cost.

Biases, immortals, and deferrals

Gross has taken three different approaches to the CPython reference-count problem, the first of which is called "biased reference counts" and is described in this paper by Jiho Choi et al. With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.

Whenever the owning thread drops a reference to an object, it checks both reference counts against zero. If both the local and the shared count are zero, the object can be freed, since no other references exist. If the local count is zero but the shared count is not, a special bit is set to indicate that the owning thread has dropped the object; any subsequent decrements of the shared count will then free the object if that count goes to zero.

This algorithm improves reference-count performance because, of all the objects that any thread will create, few will be shared with other threads. So, most of the time, the shared reference count will be unused and the cost of using atomic operations to manipulate that count will be avoided.

There are, naturally, some subtleties in how the reference counts are handled. One of those is that, for reasons to be described next, the two least-significant bits of the local reference count are reserved. An increment to the local reference count, thus, adds four to that count. These details are hidden in the Py_INCREF() and Py_DECREF() macros, so most code need not be aware of them.

Some objects are heavily shared between threads, though; these include singletons like True, False, and None, as well as small integer values, some type objects, and more. These objects will also never go away during the execution of the program — they are "immortal" objects for which reference counting is a waste. Gross's CPython interpreter marks these objects by setting the lowest significant bit in the local reference count. If that bit is set, the interpreter doesn't bother tracking references for the relevant object at all. That avoids contention (and cache-line bouncing) for the reference counts in these heavily-used objects. This "optimization" actually slows single-threaded accesses down slightly, according to the design document, but that penalty becomes worthwhile once multi-threaded execution becomes possible.

Other objects in Python programs may not be immortal, but they are still long-lived; functions and modules fall into this category. Here, too, it can make sense to avoid the cost of reference counting. The idea makes even more sense when one realizes that many function and module objects, by virtue of appearing in the globals dictionary, essentially form reference-count cycles anyway and their counts will never go to zero. For these objects, a technique called "deferred reference counting" is used; the second-leasst-significant bit in the local reference count is set, and (most) reference counting is skipped. Instead, a garbage-collection pass is used to find and free unused objects.

"Most" reference counting is skipped because the CPython interpreter does not, on its own, have a complete picture of whether an object using deferred reference counting is truly unused. Specifically, extension code written in C could be holding references that the interpreter cannot see. For this reason, reference counting is only skipped within the interpreter itself; any other code will manipulate the reference counts as usual.

Other changes and results

The reference-counting changes are a key part of Gross's work, but not all of it. The interpreter's memory allocator has been replaced with mimalloc, which is thread-safe, fast, and is able to easily support garbage-collection operations. The garbage collector itself has been modified to take advantage of mimalloc, but is still "a single-threaded, stop-the-world implementation". A lot of work has gone into the list and dict implementations to make them thread-safe. And so on.

Gross has also put some significant work into improving the performance of the CPython interpreter in general. This was done to address the concern that has blocked GIL-removal work in the past: the performance impact on single-threaded code. The end result is that the new interpreter is 10% faster than CPython 3.9 for single-threaded programs. That will certainly sweeten the pot when it comes to acceptance of this work, though, as Guido van Rossum noted, the Python developers could always just take the performance improvements without the concurrency work and be even faster yet.

That seems like an unlikely outcome, though, if this work stands up to closer scrutiny. When pointed to the "ccbench" benchmark, Gross reported speedups of 18-20x when running with 20 threads. That is the kind of concurrency speedup that Python developers have been wanting for a long time, so it is unsurprising that this work has seen an enthusiastic reception. As an added bonus, almost all Python programs will run on the modified interpreter without changes. The biggest source of problems might be multi-threaded programs with concurrency-related bugs that have been masked by the GIL until now. Extensions written in C will need to be recompiled, but most of them will not need to be changed unless they (as some evidently do) access reference counts directly rather than using the provided macros.

The end result thus appears to be a GIL-removal effort that has a rather better-than-average chance of making it into the CPython interpreter. That would be cause for a lot of rejoicing among Python developers. That said, a change this fundamental is unlikely to be rushed into the CPython mainline; it will take a lot of testing to convince the community that it is ready for production use. Interested developers may be able to hasten that process by testing this work with their programs and reporting the results.

Did you like this article?

trial subscription offer