Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 583 – A Concurrency Memory Model for Python

Author:
Jeffrey Yasskin <jyasskin at google.com>
Status:
Withdrawn
Type:
Informational
Created:
22-Mar-2008
Post-History:


Table of Contents

Abstract

This PEP describes how Python programs may behave in the presence of concurrent reads and writes to shared variables from multiple threads. We use a happens before relation to define when variable accesses are ordered or concurrent. Nearly all programs should simply use locks to guard their shared variables, and this PEP highlights some of the strange things that can happen when they don’t, but programmers often assume that it’s ok to do “simple” things without locking, and it’s somewhat unpythonic to let the language surprise them. Unfortunately, avoiding surprise often conflicts with making Python run quickly, so this PEP tries to find a good tradeoff between the two.

Rationale

So far, we have 4 major Python implementations – CPython, Jython, IronPython, and PyPy – as well as lots of minor ones. Some of these already run on platforms that do aggressive optimizations. In general, these optimizations are invisible within a single thread of execution, but they can be visible to other threads executing concurrently. CPython currently uses a GIL to ensure that other threads see the results they expect, but this limits it to a single processor. Jython and IronPython run on Java’s or .NET’s threading system respectively, which allows them to take advantage of more cores but can also show surprising values to other threads.

So that threaded Python programs continue to be portable between implementations, implementers and library authors need to agree on some ground rules.

A couple definitions

Variable
A name that refers to an object. Variables are generally introduced by assigning to them, and may be destroyed by passing them to del. Variables are fundamentally mutable, while objects may not be. There are several varieties of variables: module variables (often called “globals” when accessed from within the module), class variables, instance variables (also known as fields), and local variables. All of these can be shared between threads (the local variables if they’re saved into a closure). The object in which the variables are scoped notionally has a dict whose keys are the variables’ names.
Object
A collection of instance variables (a.k.a. fields) and methods. At least, that’ll do for this PEP.
Program Order
The order that actions (reads and writes) happen within a thread, which is very similar to the order they appear in the text.
Conflicting actions
Two actions on the same variable, at least one of which is a write.
Data race
A situation in which two conflicting actions happen at the same time. “The same time” is defined by the memory model.

Two simple memory models

Before talking about the details of data races and the surprising behaviors they produce, I’ll present two simple memory models. The first is probably too strong for Python, and the second is probably too weak.

Sequential Consistency

In a sequentially-consistent concurrent execution, actions appear to happen in a global total order with each read of a particular variable seeing the value written by the last write that affected that variable. The total order for actions must be consistent with the program order. A program has a data race on a given input when one of its sequentially consistent executions puts two conflicting actions next to each other.

This is the easiest memory model for humans to understand, although it doesn’t eliminate all confusion, since operations can be split in odd places.

Happens-before consistency

The program contains a collection of synchronization actions, which in Python currently include lock acquires and releases and thread starts and joins. Synchronization actions happen in a global total order that is consistent with the program order (they don’t have to happen in a total order, but it simplifies the description of the model). A lock release synchronizes with all later acquires of the same lock. Similarly, given t = threading.Thread(target=worker):

  • A call to t.start() synchronizes with the first statement in worker().
  • The return from worker() synchronizes with the return from t.join().
  • If the return from t.start() happens before (see below) a call to t.isAlive() that returns False, the return from worker() synchronizes with that call.

We call the source of the synchronizes-with edge a release operation on the relevant variable, and we call the target an acquire operation.

The happens before order is the transitive closure of the program order with the synchronizes-with edges. That is, action A happens before action B if:

  • A falls before B in the program order (which means they run in the same thread)
  • A synchronizes with B
  • You can get to B by following happens-before edges from A.

An execution of a program is happens-before consistent if each read R sees the value of a write W to the same variable such that:

  • R does not happen before W, and
  • There is no other write V that overwrote W before R got a chance to see it. (That is, it can’t be the case that W happens before V happens before R.)

You have a data race if two conflicting actions aren’t related by happens-before.

An example

Let’s use the rules from the happens-before model to prove that the following program prints “[7]”:

class Queue:
    def __init__(self):
        self.l = []
        self.cond = threading.Condition()

    def get():
        with self.cond:
            while not self.l:
                self.cond.wait()
            ret = self.l[0]
            self.l = self.l[1:]
            return ret

    def put(x):
        with self.cond:
            self.l.append(x)
            self.cond.notify()

myqueue = Queue()

def worker1():
    x = [7]
    myqueue.put(x)

def worker2():
    y = myqueue.get()
    print y

thread1 = threading.Thread(target=worker1)
thread2 = threading.Thread(target=worker2)
thread2.start()
thread1.start()
  1. Because myqueue is initialized in the main thread before thread1 or thread2 is started, that initialization happens before worker1 and worker2 begin running, so there’s no way for either to raise a NameError, and both myqueue.l and myqueue.cond are set to their final objects.
  2. The initialization of x in worker1 happens before it calls myqueue.put(), which happens before it calls myqueue.l.append(x), which happens before the call to myqueue.cond.release(), all because they run in the same thread.
  3. In worker2, myqueue.cond will be released and re-acquired until myqueue.l contains a value (x). The call to myqueue.cond.release() in worker1 happens before that last call to myqueue.cond.acquire() in worker2.
  4. That last call to myqueue.cond.acquire() happens before myqueue.get() reads myqueue.l, which happens before myqueue.get() returns, which happens before print y, again all because they run in the same thread.
  5. Because happens-before is transitive, the list initially stored in x in thread1 is initialized before it is printed in thread2.

Usually, we wouldn’t need to look all the way into a thread-safe queue’s implementation in order to prove that uses were safe. Its interface would specify that puts happen before gets, and we’d reason directly from that.

Surprising behaviors with races

Lots of strange things can happen when code has data races. It’s easy to avoid all of these problems by just protecting shared variables with locks. This is not a complete list of race hazards; it’s just a collection that seem relevant to Python.

In all of these examples, variables starting with r are local variables, and other variables are shared between threads.

Zombie values

This example comes from the Java memory model:

Initially p is q and p.x == 0.
Thread 1 Thread 2
r1 = p r6 = p
r2 = r1.x r6.x = 3
r3 = q
r4 = r3.x
r5 = r1.x

Can produce r2 == r5 == 0 but r4 == 3, proving that p.x went from 0 to 3 and back to 0.

A good compiler would like to optimize out the redundant load of p.x in initializing r5 by just re-using the value already loaded into r2. We get the strange result if thread 1 sees memory in this order:

Evaluation Computes Why
r1 = p
r2 = r1.x r2 == 0
r3 = q r3 is p
p.x = 3 Side-effect of thread 2
r4 = r3.x r4 == 3
r5 = r2 r5 == 0 Optimized from r5 = r1.x because r2 == r1.x.

Inconsistent Orderings

From N2177: Sequential Consistency for Atomics, and also known as Independent Read of Independent Write (IRIW).

Initially, a == b == 0.
Thread 1 Thread 2 Thread 3 Thread 4
r1 = a r3 = b a = 1 b = 1
r2 = b r4 = a

We may get r1 == r3 == 1 and r2 == r4 == 0, proving both that a was written before b (thread 1’s data), and that b was written before a (thread 2’s data). See Special Relativity for a real-world example.

This can happen if thread 1 and thread 3 are running on processors that are close to each other, but far away from the processors that threads 2 and 4 are running on and the writes are not being transmitted all the way across the machine before becoming visible to nearby threads.

Neither acquire/release semantics nor explicit memory barriers can help with this. Making the orders consistent without locking requires detailed knowledge of the architecture’s memory model, but Java requires it for volatiles so we could use documentation aimed at its implementers.

A happens-before race that’s not a sequentially-consistent race

From the POPL paper about the Java memory model [#JMM-popl].

Initially, x == y == 0.
Thread 1 Thread 2
r1 = x r2 = y
if r1 != 0: if r2 != 0:
y = 42 x = 42

Can r1 == r2 == 42???

In a sequentially-consistent execution, there’s no way to get an adjacent read and write to the same variable, so the program should be considered correctly synchronized (albeit fragile), and should only produce r1 == r2 == 0. However, the following execution is happens-before consistent:

Statement Value Thread
r1 = x 42 1
if r1 != 0: true 1
y = 42 1
r2 = y 42 2
if r2 != 0: true 2
x = 42 2

WTF, you are asking yourself. Because there were no inter-thread happens-before edges in the original program, the read of x in thread 1 can see any of the writes from thread 2, even if they only happened because the read saw them. There are data races in the happens-before model.

We don’t want to allow this, so the happens-before model isn’t enough for Python. One rule we could add to happens-before that would prevent this execution is:

If there are no data races in any sequentially-consistent execution of a program, the program should have sequentially consistent semantics.

Java gets this rule as a theorem, but Python may not want all of the machinery you need to prove it.

Self-justifying values

Also from the POPL paper about the Java memory model [#JMM-popl].

Initially, x == y == 0.
Thread 1 Thread 2
r1 = x r2 = y
y = r1 x = r2

Can x == y == 42???

In a sequentially consistent execution, no. In a happens-before consistent execution, yes: The read of x in thread 1 is allowed to see the value written in thread 2 because there are no happens-before relations between the threads. This could happen if the compiler or processor transforms the code into:

Thread 1 Thread 2
y = 42 r2 = y
r1 = x x = r2
if r1 != 42:
y = r1

It can produce a security hole if the speculated value is a secret object, or points to the memory that an object used to occupy. Java cares a lot about such security holes, but Python may not.

Uninitialized values (direct)

From several classic double-checked locking examples.

Initially, d == None.
Thread 1 Thread 2
while not d: pass d = [3, 4]
assert d[1] == 4

This could raise an IndexError, fail the assertion, or, without some care in the implementation, cause a crash or other undefined behavior.

Thread 2 may actually be implemented as:

r1 = list()
r1.append(3)
r1.append(4)
d = r1

Because the assignment to d and the item assignments are independent, the compiler and processor may optimize that to:

r1 = list()
d = r1
r1.append(3)
r1.append(4)

Which is obviously incorrect and explains the IndexError. If we then look deeper into the implementation of r1.append(3), we may find that it and d[1] cannot run concurrently without causing their own race conditions. In CPython (without the GIL), those race conditions would produce undefined behavior.

There’s also a subtle issue on the reading side that can cause the value of d[1] to be out of date. Somewhere in the implementation of list, it stores its contents as an array in memory. This array may happen to be in thread 1’s cache. If thread 1’s processor reloads d from main memory without reloading the memory that ought to contain the values 3 and 4, it could see stale values instead. As far as I know, this can only actually happen on Alphas and maybe Itaniums, and we probably have to prevent it anyway to avoid crashes.

Uninitialized values (flag)

From several more double-checked locking examples.

Initially, d == dict() and initialized == False.
Thread 1 Thread 2
while not initialized: pass d[‘a’] = 3
r1 = d[‘a’] initialized = True
r2 = r1 == 3
assert r2

This could raise a KeyError, fail the assertion, or, without some care in the implementation, cause a crash or other undefined behavior.

Because d and initialized are independent (except in the programmer’s mind), the compiler and processor can rearrange these almost arbitrarily, except that thread 1’s assertion has to stay after the loop.

Inconsistent guarantees from relying on data dependencies

This is a problem with Java final variables and the proposed data-dependency ordering in C++0x.

First execute:
g = []
def Init():
    g.extend([1,2,3])
    return [1,2,3]
h = None

Then in two threads:

Thread 1 Thread 2
while not h: pass r1 = Init()
assert h == [1,2,3] freeze(r1)
assert h == g h = r1

If h has semantics similar to a Java final variable (except for being write-once), then even though the first assertion is guaranteed to succeed, the second could fail.

Data-dependent guarantees like those final provides only work if the access is through the final variable. It’s not even safe to access the same object through a different route. Unfortunately, because of how processors work, final’s guarantees are only cheap when they’re weak.

The rules for Python

The first rule is that Python interpreters can’t crash due to race conditions in user code. For CPython, this means that race conditions can’t make it down into C. For Jython, it means that NullPointerExceptions can’t escape the interpreter.

Presumably we also want a model at least as strong as happens-before consistency because it lets us write a simple description of how concurrent queues and thread launching and joining work.

Other rules are more debatable, so I’ll present each one with pros and cons.

Data-race-free programs are sequentially consistent

We’d like programmers to be able to reason about their programs as if they were sequentially consistent. Since it’s hard to tell whether you’ve written a happens-before race, we only want to require programmers to prevent sequential races. The Java model does this through a complicated definition of causality, but if we don’t want to include that, we can just assert this property directly.

No security holes from out-of-thin-air reads

If the program produces a self-justifying value, it could expose access to an object that the user would rather the program not see. Again, Java’s model handles this with the causality definition. We might be able to prevent these security problems by banning speculative writes to shared variables, but I don’t have a proof of that, and Python may not need those security guarantees anyway.

Restrict reorderings instead of defining happens-before

The .NET [#CLR-msdn] and x86 [#x86-model] memory models are based on defining which reorderings compilers may allow. I think that it’s easier to program to a happens-before model than to reason about all of the possible reorderings of a program, and it’s easier to insert enough happens-before edges to make a program correct, than to insert enough memory fences to do the same thing. So, although we could layer some reordering restrictions on top of the happens-before base, I don’t think Python’s memory model should be entirely reordering restrictions.

Atomic, unordered assignments

Assignments of primitive types are already atomic. If you assign 3<<72 + 5 to a variable, no thread can see only part of the value. Jeremy Manson suggested that we extend this to all objects. This allows compilers to reorder operations to optimize them, without allowing some of the more confusing uninitialized values. The basic idea here is that when you assign a shared variable, readers can’t see any changes made to the new value before the assignment, or to the old value after the assignment. So, if we have a program like:

Initially, (d.a, d.b) == (1, 2), and (e.c, e.d) == (3, 4). We also have class Obj(object): pass.
Thread 1 Thread 2
r1 = Obj() r3 = d
r1.a = 3 r4, r5 = r3.a, r3.b
r1.b = 4 r6 = e
d = r1 r7, r8 = r6.c, r6.d
r2 = Obj()
r2.c = 6
r2.d = 7
e = r2

(r4, r5) can be (1, 2) or (3, 4) but nothing else, and (r7, r8) can be either (3, 4) or (6, 7) but nothing else. Unlike if writes were releases and reads were acquires, it’s legal for thread 2 to see (e.c, e.d) == (6, 7) and (d.a, d.b) == (1, 2) (out of order).

This allows the compiler a lot of flexibility to optimize without allowing users to see some strange values. However, because it relies on data dependencies, it introduces some surprises of its own. For example, the compiler could freely optimize the above example to:

Thread 1 Thread 2
r1 = Obj() r3 = d
r2 = Obj() r6 = e
r1.a = 3 r4, r7 = r3.a, r6.c
r2.c = 6 r5, r8 = r3.b, r6.d
r2.d = 7
e = r2
r1.b = 4
d = r1

As long as it didn’t let the initialization of e move above any of the initializations of members of r2, and similarly for d and r1.

This also helps to ground happens-before consistency. To see the problem, imagine that the user unsafely publishes a reference to an object as soon as she gets it. The model needs to constrain what values can be read through that reference. Java says that every field is initialized to 0 before anyone sees the object for the first time, but Python would have trouble defining “every field”. If instead we say that assignments to shared variables have to see a value at least as up to date as when the assignment happened, then we don’t run into any trouble with early publication.

Two tiers of guarantees

Most other languages with any guarantees for unlocked variables distinguish between ordinary variables and volatile/atomic variables. They provide many more guarantees for the volatile ones. Python can’t easily do this because we don’t declare variables. This may or may not matter, since python locks aren’t significantly more expensive than ordinary python code. If we want to get those tiers back, we could:

  1. Introduce a set of atomic types similar to Java’s [5] or C++’s [6]. Unfortunately, we couldn’t assign to them with =.
  2. Without requiring variable declarations, we could also specify that all of the fields on a given object are atomic.
  3. Extend the __slots__ mechanism [7] with a parallel __volatiles__ list, and maybe a __finals__ list.

Sequential Consistency

We could just adopt sequential consistency for Python. This avoids all of the hazards mentioned above, but it prohibits lots of optimizations too. As far as I know, this is the current model of CPython, but if CPython learned to optimize out some variable reads, it would lose this property.

If we adopt this, Jython’s dict implementation may no longer be able to use ConcurrentHashMap because that only promises to create appropriate happens-before edges, not to be sequentially consistent (although maybe the fact that Java volatiles are totally ordered carries over). Both Jython and IronPython would probably need to use AtomicReferenceArray or the equivalent for any __slots__ arrays.

Adapt the x86 model

The x86 model is:

  1. Loads are not reordered with other loads.
  2. Stores are not reordered with other stores.
  3. Stores are not reordered with older loads.
  4. Loads may be reordered with older stores to different locations but not with older stores to the same location.
  5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
  6. In a multiprocessor system, stores to the same location have a total order.
  7. In a multiprocessor system, locked instructions have a total order.
  8. Loads and stores are not reordered with locked instructions.

In acquire/release terminology, this appears to say that every store is a release and every load is an acquire. This is slightly weaker than sequential consistency, in that it allows inconsistent orderings, but it disallows zombie values and the compiler optimizations that produce them. We would probably want to weaken the model somehow to explicitly allow compilers to eliminate redundant variable reads. The x86 model may also be expensive to implement on other platforms, although because x86 is so common, that may not matter much.

Upgrading or downgrading to an alternate model

We can adopt an initial memory model without totally restricting future implementations. If we start with a weak model and want to get stronger later, we would only have to change the implementations, not programs. Individual implementations could also guarantee a stronger memory model than the language demands, although that could hurt interoperability. On the other hand, if we start with a strong model and want to weaken it later, we can add a from __future__ import weak_memory statement to declare that some modules are safe.

Implementation Details

The required model is weaker than any particular implementation. This section tries to document the actual guarantees each implementation provides, and should be updated as the implementations change.

CPython

Uses the GIL to guarantee that other threads don’t see funny reorderings, and does few enough optimizations that I believe it’s actually sequentially consistent at the bytecode level. Threads can switch between any two bytecodes (instead of only between statements), so two threads that concurrently execute:

i = i + 1

with i initially 0 could easily end up with i==1 instead of the expected i==2. If they execute:

i += 1

instead, CPython 2.6 will always give the right answer, but it’s easy to imagine another implementation in which this statement won’t be atomic.

PyPy

Also uses a GIL, but probably does enough optimization to violate sequential consistency. I know very little about this implementation.

Jython

Provides true concurrency under the Java memory model and stores all object fields (except for those in __slots__?) in a ConcurrentHashMap, which provides fairly strong ordering guarantees. Local variables in a function may have fewer guarantees, which would become visible if they were captured into a closure that was then passed to another thread.

IronPython

Provides true concurrency under the CLR memory model, which probably protects it from uninitialized values. IronPython uses a locked map to store object fields, providing at least as many guarantees as Jython.

References

Acknowledgements

Thanks to Jeremy Manson and Alex Martelli for detailed discussions on what this PEP should look like.


Source: https://github.com/python/peps/blob/main/pep-0583.rst

Last modified: 2019-03-01 19:39:16 GMT