Memory Consistency Models for Programming Languages

A memory (consistency) model for a programming language specifies what behavior is allowed if parallel threads access the same data. With Java and C++ two mainstream language have a memory model. The too-long-didnt-read version is: "Sequential Consistency for data race free programs". As a programmer, if you properly use synchronized and locks and stuff, then your program is data race free. This means you can assume Sequential Consistency (SC) as a model, when you debug your parallelism problems. This is great, because SC is relatively simple.

Sequential Consistency

This model was published by Leslie Lamport in 1979 in How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. The paper is less than two pages long. Lamport defines SC with two requirements:

Requirement R1: Each processor issues memory requests in the order specified by its program.

Requirement R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue.

R1 is primarly about the single-threaded parts, where memory access must correspond directly to the program. In modern literature the phrase "program order" is used and you find equivalent requirements in all memory models. R1 says nothing about parallel code executed in different threads.

R2 defines a global FIFO which synchronizes all memory accesses. This makes it (relatively) easy to reason about a parallel program. You can single-step through it, like "thread A does this, then thread B does that".

So what is the problem with this model? Performance. Today, CPUs have caches, pipelining, out-of-order execution, and whatnot. Compilers optimize, reorder, and squeeze your program. SC would prevent a lot performance potential.

Seriously, only compiler developers and performance engineers need to read the stuff below. You will not be able to unread the Lovecraftian horrors lingering beyond the locks. You could read things nobody has every written out of thin air.

The two important examples are Java and C++. Java cares about security and safety. For example, you must not dupe the security manager via leaks in the memory model. This means that even when there are data races, Java must specify the semantics, while C++11 declares that undefined behavior and gives no semantics. Also C++ cares much more for performance so it provides more atomics, some with weaker guarantees than Java-volatile. For details, Foundations of the C++ Concurrency Memory Model describes why C++ differs from the Java memory model.

Data Races

While the models are different, they both maintain the "Sequential Consistency for data race free programs" property. The more formal version is: "A program whose SC executions have no data races must have only SC executions". So, before you look into the actual models, you have to understand data races. (Don't confuse data races with race conditions. Explanation of difference.)

A correct definition is: Two unsynchronized memory accesses to the same address and at least one of them writes. This means if you carefully put synchronization everywhere, there are no races, thus the memory models guarantee SC.

Look at an example. Let us assume, some flag x=0 initially. Now in parallel set the flag by writing x=1 and repeatedly read x in another thread. This is a data race. Unsynchronized write is of course faster than synchronizing the write, so we would prefer that. However, the memory model does not guarantee SC in this case. What does that mean? Although we read x again and again, we might never see it change to 1.

Another example: Assume two flags x and y, both set to 0 initially. Now two threads:

x = 1	y = 1
r1 = y	r2 = x

In a sequentially consistent world, there are three possible outcomes for r1 and r2.

x = 1 y = 1 r1 = y r2 = x	x = 1 r1 = y y = 1 r2 = x	y = 1 r2 = x x = 1 r1 = y
r1=1 and r2=1	r1=0 and r2=1	r1=1 and r2=0

The result r1=0 and r2=0 is not possible in SC. However, remember that our memory models only guarantee SC, if there are no data races. Since we have a data race, non-SC executions are possible. Thus, r1=0 and r2=0 is possible. How? For example, the compiler might reorder the instructions, the CPU might reorder the instructions, or the reads are from an outdated cache.

Could we simply forbid or remove data races? Yes we can. For example, in Rust "data races are mostly prevented through Rusts ownership system", so they are forbidden and the programmer will be thwarted by the type system. Alternatively, a compiler could implicitly put synchronization instructions where potentially necessary. Now data races are impossible, but of course performance suffers due to all those additional instructions. Java and C++ consider both options unacceptable. While researchers try to make the compiler smarter, a performance regression is expected currently. For example, "Compiler techniques for high performance sequentially consistent Java programs" reports 10% to 26% slowdown depending on the hardware.

Values out of thin air

This is the most weird part about memory models. Some models allow for value to appear out of thin air, which means values which are impossible in SC executions. How can this happen? It usually requires a cyclic dependency like this:

r1 = x	r2 = y
if (r1 == 42)	if (r2 == 42)
y = r1	x = r2

If we assume x=y=0 initially, r1=42 or r2=42 is impossible. The reasoning how it might happen: Compiler changes the program, so it stores y speculatively. The hardware sees the speculation and acts on it. Two threads see each others speculation, act on it, and it becomes reality. As far as I know this as not been observed in practice.

However, there was a similar issue with GCC 4.3. The code snippet looks like a normal use of mutex_trylock. How else would you use it?

int res = pthread_mutex_trylock(&mutex);
if (res == 0)
    ++acquires_count;

It is compiled via GCC 4.3 -O1 into:

call    pthread_mutex_trylock
cmpl    $1, %eax                ; test res
movl    acquires_count, %edx    ; load
adcl    $0, %edx                ; maybe add 1
movl    %edx, acquires_count    ; store

You can see that the load and store of acquires_count is done even if the condition is false. The condition only controls if it is incremented before the store. If a concurrent thread holds the lock and changes acquires_count, the store might overwrite the change.

For more thinking about out-of-thin-air, read Outlawing Ghosts: Avoiding Out-of-Thin-Air Results. The problem with out-of-thin-air is that it is hard to define. These behavior are by definition gaps in the specification. Closing those gaps (if possible) requires a lot of effort for little gain.

Java

The JSR-133: Java Memory Model and Thread Specification is the definitive text for Java. On 40 pages it explains Javas current memory model. First a sequential consistent model is provided, which is simple, but unnecessarily slow. Then a weak model is shown, which is efficient, but impractical to impossible to use. Finally, the proposed model is described as being between those extrema.

Work on the Java memory model is ongoing. For example, my collegue Lochbihler suggested some changes and also Aspinall and Ševčík. Note that these publications build a formalization and machine-checked proofs, which makes them much more trustworthy.

C/C++

For C++ there is the C++14 Standard, If you search for "memory model", you will find the description of the virtual machine for which C++ is defined instead of anything about concurrency. Instead look to §1.10 "Multi-threaded executions and data races". And some additional parts in §29 and §30.

This is prosaic language, though. Formal mathematical language is more precise. So instead, I recommend Mathematizing C++ Concurrency and the followup Clarifying and Compiling C/C++ Concurrency: from C++11 to POWER.

A more practical introduction is from Herb Sutter in his two-part talk "atomic<> Weapons" at "C++ and Beyond 2012".

Generic

Everybody agrees that "Sequential consistency for data race free programs" is the important slogan for programmers. Memory models disagree about details in pathological cases (which are hideous to debug if they do happen though). So maybe we can define something generic?

This is what A Theory of Memory Models did. It provides a framework which can be instantiated into the C++ or the Java memory model or even more variants. This shows how much of memory models we agree on.

Thanks to Dr. Andreas Lochbihler for proofreading.