blob: 85ba46ea25774fd43689fb8055b73486cb446e34 [file] [log] [blame]
% memorder/memorder.tex
\QuickQuizChapter{chp:memorder:Memory Ordering}{Memory Ordering}
\OriginallyPublished{Chapter}{chp:memorder:Memory Ordering}{Linux Kernel Memory Barriers}{kernel.org}{Howells2009membartxt}
\OriginallyPublished{Chapter}{chp:memorder:Memory Ordering}{A formal kernel memory-ordering model}{Linux Weekly News}{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2}
%
\Epigraph{The art of progress is to preserve order amid change and to preserve change amid order.}{\emph{Alfred North Whitehead}}
Causality and sequencing are deeply intuitive, and hackers often
tend to have a much stronger grasp of these concepts than does
the general population.
These intuitions can be extremely powerful tools when writing, analyzing,
and debugging both sequential code and parallel code that makes
use of standard mutual-exclusion mechanisms, such as locking and
RCU.
Unfortunately, these intuitions break down completely in face of
code that fails to use standard mechanisms, one important example
of course being the code that implements these standard mechanisms.
This chapter will help you gain an understanding of memory ordering
sufficient to implement synchronization primitives and other
performance-critical code.
@@@ roadmap @@@
\section{Ordering: Why and How?}
\label{sec:memorder:Ordering: Why and How?}
One motivation for memory ordering can be seen in the trivial-seeming
litmus test in
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
(\path{C-SB+o-o+o-o.litmus}),
which at first glance might
appear to guarantee that the \co{exists} clause never triggers.\footnote{
Purists would instead say that the \co{exists} clause is
never \emph{satisfied}, but we use ``trigger'' here by
analogy with assertions.}
After all, if \nbco{0:r2=0} as shown in the \co{exists} clause,\footnote{
That is, Thread~\co{P0()}'s instance of local variable \co{r2}
equals zero.
See Section~\ref{sec:formal:Anatomy of a Litmus Test}
for documentation of litmus-test nomenclature.}
we might hope that Thread~\co{P0()}'s
load from~\co{x1} must have happened before Thread~\co{P1()}'s store to~\co{x1},
which might raise
further hopes that Thread~\co{P1()}'s load from~\co{x0} must happen after
Thread~\co{P0()}'s store to~\co{x0}, so that \nbco{1:r2=2},
thus not triggering the \co{exists} clause.
The example is symmetric, so similar reasoning might lead
us to hope that \nbco{1:r2=0} guarantees that \nbco{0:r2=2}.
Unfortunately, the lack of memory barriers dashes these hopes.
The CPU is within its rights to reorder
the statements within both Thread~\co{P0()} and Thread~\co{P1()},
even on relatively strongly ordered systems such as x86.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-SB+o-o+o-o
{
}
P0(int *x0, int *x1)
{
int r2;
WRITE_ONCE(*x0, 2);
r2 = READ_ONCE(*x1);
}
P1(int *x0, int *x1)
{
int r2;
WRITE_ONCE(*x1, 2);
r2 = READ_ONCE(*x0);
}
exists (1:r2=0 /\ 0:r2=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Memory Misordering: Store-Buffering Litmus Test}
\label{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
\end{listing}
\QuickQuiz{}
The compiler can also reorder Thread~\co{P0()}'s and
Thread~\co{P1()}'s memory accesses in
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test},
right?
\QuickQuizAnswer{
In general, compiler optimizations carry out more extensive
and profound reorderings than CPUs can.
However, in this case, the volatile accesses in
\co{READ_ONCE()} and \co{WRITE_ONCE()}
prevent the compiler from reordering.
And also from doing much else as well, so the examples in this
section will be making heavy use of
\co{READ_ONCE()} and \co{WRITE_ONCE()}.
% @@@ forward reference to a section describing READ_ONCE()
% @@@ and WRITE_ONCE().
} \QuickQuizEnd
This willingness to reorder can be confirmed using tools such as
\co{litmus7}~\cite{Alglave:2014:HCM:2594291.2594347},
which found that the counter-intuitive ordering happened
314 times out of 100,000,000 trials on my x86 laptop.
Oddly enough, the perfectly legal outcome where both loads return the
value 2 occurred less frequently, in this case, only 167 times.\footnote{
Please note that results are sensitive to the exact hardware
configuration,
how heavily the system is loaded, and much else besides.}
The lesson here is clear: Increased counterintuitivity does not necessarily
imply decreased probability!
% Run on June 23, 2017:
% litmus7 -r 1000 -carch X86 C-SB+o-o+o-o.litmus
% Test C-SB+o-o+o-o Allowed
% Histogram (4 states)
% 314 *>0:r2=0; 1:r2=0;
% 49999625:>0:r2=2; 1:r2=0;
% 49999894:>0:r2=0; 1:r2=2;
% 167 :>0:r2=2; 1:r2=2;
The following sections show exactly where this intuition breaks down,
and then puts forward a mental model of memory ordering that can help
you avoid these pitfalls.
Section~\ref{sec:memorder:Why Hardware Misordering?}
gives a brief overview of why hardware misorders memory accesses, and then
Section~\ref{sec:memorder:How to Force Ordering?}
gives an equally brief overview of how you can thwart evil hardware
misordering schemes.
Finally, Section~\ref{sec:memorder:Basic Rules of Thumb}
lists some basic rules of thum, which will be further refined in
later sections.
\subsection{Why Hardware Misordering?}
\label{sec:memorder:Why Hardware Misordering?}
But why does memory misordering happen in the first place?
Can't CPUs keep track of ordering on their own?
Isn't that why we have computers in the first place, to keep track of things?
Many people do indeed expect their computers to keep track of things,
but many also insist that they keep track of things quickly.
However, as seen in Chapter~\ref{chp:Hardware and its Habits},
one difficulty that modern computer-system vendors face is that
the main memory cannot keep up with the CPU---modern CPUs can execute
hundreds of instructions in the time required to fetch a single variable
from memory.
CPUs therefore sport increasingly large caches, as seen back in
Figure~\ref{fig:cpu:System Hardware Architecture}, which means that
although the first load by a given CPU from a given variable will
result in an expensive \emph{cache miss} as was discussed in
Section~\ref{sec:cpu:Cache Misses}, subsequent
repeated loads from that variable by that CPU might execute
very quickly because the initial cache miss will have loaded that
variable into that CPU's cache.
However, it is also necessary to accommodate frequent concurrent stores
from multiple CPUs to a set of shared variables.
In cache-coherent systems, if the caches hold multiple copies of a given
variable, all the copies of that variable must have the same value.
This works extremely well for concurrent loads, but not so well for
concurrent stores: Each store must do something about all
copies of the old value (another cache miss!), which, given the finite
speed of light and the atomic nature of matter, will be slower
than impatient software hackers would like.
\begin{figure}[tb]
\centering
\resizebox{2.5in}{!}{\includegraphics{memorder/SystemArchSB}}
\caption{System Architecture With Store Buffers}
\label{fig:memorder:System Architecture With Store Buffers}
\end{figure}
CPUs therefore come equipped with store buffers, as shown in
Figure~\ref{fig:memorder:System Architecture With Store Buffers}.
When a given CPU does a store to a variable that
is not present in that CPU's cache, then the new value
is instead placed in that CPU's store buffer.
The CPU can then proceed immediately, without having to wait for the
store to do something about all the old values of that variable
residing in other CPUs' caches.
\begin{figure}[htb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}}
\caption{CPUs Can Do Things Out of Order}
\ContributedBy{Figure}{fig:memorder:CPUs Can Do Things Out of Order}{Melissa Broussard}
\end{figure}
Although store buffers can greatly increase performance,
they can cause instructions and memory references to execute out
of order, which can in turn cause serious confusion, as illustrated in
Figure~\ref{fig:memorder:CPUs Can Do Things Out of Order}.
In particular, these store buffers can cause the memory misordering
shown in the store-buffering litmus test in
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
\begin{table*}
\small
\centering\OneColumnHSpace{-0.1in}
\begin{tabular}{r||l|l|l||l|l|l}
& \multicolumn{3}{c||}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
\cline{2-7}
& Instruction & Store Buffer & Cache &
Instruction & Store Buffer & Cache \\
\hline
\hline
1 & (Initial state) & & \tco{x1==0} &
(Initial state) & & \tco{x0==0} \\
\hline
2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
\hline
3 & \tco{r2 = x1;} (0) & \tco{x0==2} & \tco{x1==0} &
\tco{r2 = x0;} (0) & \tco{x1==2} & \tco{x0==0} \\
\hline
4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
\hline
5 & (Finish store) & & \tco{x0==2} &
(Finish store) & & \tco{x1==2} \\
\end{tabular}
\caption{Memory Misordering: Store-Buffering Sequence of Events}
\label{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
\end{table*}
Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
shows how this memory misordering can happen.
Row~1 shows the initial state, where CPU~0 has \co{x1} in its cache
and CPU~1 has \co{x0} in its cache, both variables having a value of zero.
Row~2 shows the state change due to each CPU's store (lines~9 and~18 of
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
Because neither CPU has the stored-to variable in its cache, both CPUs
record their stores in their respective store buffers.
\QuickQuiz{}
But wait!!!
On row~2 of
Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
both \co{x0} and \co{x1} each have two values at the same time,
namely zero and two.
How can that possibly work???
\QuickQuizAnswer{
There is an underlying cache-coherence protocol that straightens
things out, which will be discussed later.
% @@@ Add forward reference.
But if you think that a given variable having two values at
the same time is surprising, just wait until you get to
Section~\ref{sec:memorder:Variables With Multiple Values}!
} \QuickQuizEnd
Row~3 shows the two loads (lines~10 and~19 of
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
Because the variable being loaded by each CPU is in that CPU's cache,
each load immediately returns the cached value, which in both cases
is zero.
But the CPUs are not done yet: Sooner or later, they must empty their
store buffers.
Because caches move data around in relatively large blocks called
\emph{cachelines}, and because each cacheline can hold several
variables, each CPU must get the cacheline into its own cache so
that it can update the portion of that cacheline corresponding
to the variable in its store buffer, but without disturbing any
other part of the cacheline.
Each CPU must also ensure that the cacheline is not present in any other
CPU's cache, for which a read-invalidate operation is used.
As shown on row~4, after both read-invalidate operations complete,
the two CPUs have traded cachelines, so that CPU~0's cache now contains
\co{x0} and CPU~1's cache now contains \co{x1}.
Once these two variables are in their new homes, each CPU can flush
its store buffer into the corresponding cache line, leaving each
variable with its final value as shown on row~5.
\QuickQuiz{}
But don't the values also need to be flushed from the cache
to main memory?
\QuickQuizAnswer{
Perhaps surprisingly, not necessarily!
On some systems,
if the two variables are being used heavily, they might
be bounced back and forth between the CPUs' caches and never
land in main memory.
} \QuickQuizEnd
In summary, store buffers are needed to allow CPUs to handle
store instructions efficiently, but they can result in
counter-intuitive memory misordering.
But what do you do if your algorithm really needs its memory
references to be ordered?
For example, suppose that you are communicating with a driver using
a pair of flags, one that says whether or not the driver is running
and the other that says whether there is a request pending for that
driver.
The requester needs to set the request-pending flag, then check
the driver-running flag, and if false, wake it up.
Once the driver has serviced all the pending requests that it knows about,
it needs to clear its driver-running flag, then check the request-pending
flag to see if it needs to restart.
This very reasonable approach cannot work unless there is some way
to make sure that the hardware processes the stores and loads in order.
This is the subject of the next section.
\subsection{How to Force Ordering?}
\label{sec:memorder:How to Force Ordering?}
It turns out that there are compiler directives and standard
synchronization primitives (such as locking and RCU)
that are responsible for maintaining the illusion of ordering through use of
\emph{memory barriers} (for example, \co{smp_mb()} in the Linux kernel).
These memory barriers can be explicit instructions, as they are on
ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions,
as they often are on x86.
Since these standard synchronization primitives preserve the illusion of
ordering, your path of least resistance is to simply use these primitives,
thus allowing you to stop reading this section.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-SB+o-mb-o+o-mb-o
{
}
P0(int *x0, int *x1)
{
int r2;
WRITE_ONCE(*x0, 2);
smp_mb();
r2 = READ_ONCE(*x1);
}
P1(int *x0, int *x1)
{
int r2;
WRITE_ONCE(*x1, 2);
smp_mb();
r2 = READ_ONCE(*x0);
}
exists (1:r2=0 /\ 0:r2=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Memory Ordering: Store-Buffering Litmus Test}
\label{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
\end{listing}
However, if you need to implement the synchronization primitives
themselves, or if you are simply interested in understanding how memory
ordering works, read on!
The first stop on the journey is
Listing~\ref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
(\path{C-SB+o-mb-o+o-mb-o.litmus}),
which places an \co{smp_mb()} Linux-kernel full memory barrier between
the store and load in both \co{P0()} and \co{P1()}, but is otherwise
identical to
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
% Test C-SB+o-mb-o+o-mb-o Allowed
% Histogram (3 states)
% 49553298:>0:r2=2; 1:r2=0;
% 49636449:>0:r2=0; 1:r2=2;
% 810253:>0:r2=2; 1:r2=2;
% No
These barriers prevent the counter-intuitive outcome from happening
on 100,000,000 trials on my x86 laptop.
Interestingly enough, the added overhead due to these barriers causes the
legal outcome where both loads return the value two to happen more
than 800,000 times, as opposed to only 167 times for the
barrier-free code in
Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
\begin{table*}
\small
\centering\OneColumnHSpace{-0.1in}
\begin{tabular}{r||l|l|l||l|l|l}
& \multicolumn{3}{c||}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
\cline{2-7}
& Instruction & Store Buffer & Cache &
Instruction & Store Buffer & Cache \\
\hline
\hline
1 & (Initial state) & & \tco{x1==0} &
(Initial state) & & \tco{x0==0} \\
\hline
2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
\hline
3 & \tco{smp_mb();} & \tco{x0==2} & \tco{x1==0} &
\tco{smp_mb();} & \tco{x1==2} & \tco{x0==0} \\
\hline
4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
\hline
5 & (Finish store) & & \tco{x0==2} &
(Finish store) & & \tco{x1==2} \\
\hline
6 & \tco{r2 = x1;} (2) & & \tco{x1==2} &
\tco{r2 = x0;} (2) & & \tco{x0==2} \\
\end{tabular}
\caption{Memory Ordering: Store-Buffering Sequence of Events}
\label{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}
\end{table*}
These barriers have a profound effect on ordering, as can be seen in
Table~\ref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}.
Although the first two rows are the same as in
Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
and although the \co{smp_mb()} instructions on row~3
do not change state
in and of themselves, they do cause the stores to complete
(rows~4 and~5) before the
loads (row~6), which rules out the counter-intuitive outcome shown in
Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}.
Note that variables \co{x0} and \co{x1} each still have more than one
value on row~2, however, as promised earlier, the \co{smp_mb()}
instances straighten things out in the end.
\begin{table*}
\small
\centering\OneColumnHSpace{-0.7in}
\begin{tabular}{l||c||c|c|c|c||c|c|c|c|c|c|c}
& & \multicolumn{4}{c||}{Prior Ordered Operation} &
\multicolumn{7}{c}{Subsequent Ordered Operation} \\
\cline{3-6} \cline{7-13}
Operation Providing Ordering & C &
Self & R & W & RMW & Self & R & W & DR & DW & RMW & SV \\
\hline
\hline
Store, for example, \tco{WRITE_ONCE()} & &
Y & & & & & & & & & & Y \\
\hline
Load, for example, \tco{READ_ONCE()} & &
Y & & & & & & & & Y & & Y \\
\hline
Unsuccessful RMW operation & &
Y & & & & & & & & Y & & Y \\
\hline
\tco{smp_read_barrier_depends()} & &
& Y & & & & & & Y & Y & & \\
\hline
\tco{*_dereference()} & &
Y & & & & & & & Y & Y & & Y \\
\hline
Successful \tco{*_acquire()} & &
R & & & & & Y & Y & Y & Y & Y & Y \\
\hline
Successful \tco{*_release()} & C &
& Y & Y & Y & W & & & & & & Y \\
\hline
\tco{smp_rmb()} & &
& Y & & R & & Y & & Y & & R & \\
\hline
\tco{smp_wmb()} & &
& & Y & W & & & Y & & Y & W & \\
\hline
\tco{smp_mb()} and \tco{synchronize_rcu()} & CP &
& Y & Y & Y & & Y & Y & Y & Y & Y & \\
\hline
Successful full-strength non-\tco{void} RMW & CP &
Y & Y & Y & Y & Y & Y & Y & Y & Y & Y & Y \\
\hline
\tco{smp_mb__before_atomic()} & CP &
& Y & Y & Y & & a & a & a & a & Y & \\
\hline
\tco{smp_mb__after_atomic()} & CP &
& a & a & Y & & Y & Y & Y & Y & & \\
\multicolumn{13}{c}{~} \\
\multicolumn{1}{r}{Key:} &
\multicolumn{12}{l}{C: Ordering is cumulative} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{P: Ordering propagates} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{R: Read, for example, \tco{READ_ONCE()}, or read portion of RMW} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{W: Write, for example, \tco{WRITE_ONCE()}, or write portion of RMW} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{Y: Provides the specified ordering} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{a: Provides specified ordering given intervening RMW atomic operation} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{DR: Dependent read (address dependency, Section~\ref{sec:memorder:Address Dependencies})} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{DW: Dependent write (address, data, or control dependency, Sections~\ref{sec:memorder:Address Dependencies}--\ref{sec:memorder:Control Dependencies})} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{RMW: Atomic read-modify-write operation} \\
\multicolumn{1}{r}{} &
\multicolumn{12}{l}{SV: Same-variable access} \\
\end{tabular}
\caption{Linux-Kernel Memory-Ordering Cheat Sheet}
\label{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
\end{table*}
Although full barriers such as \co{smp_mb()} have extremely strong
ordering guarantees, their strength comes at a high price.
A great many situations can be handled with much weaker ordering guarantees
that use much cheaper memory-ordering instructions, or, in some case, no
memory-ordering instructions at all.
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
provides a cheatsheet of the Linux kernel's ordering primitives and their
guarantees.
Each row corresponds to a primitive or category of primitives that might
or might not provide ordering, with the columns labeled
``Prior Ordered Operation'' and ``Subsequent Ordered Operation''
being the operations that might (or might not) be ordered against.
Cells containing ``Y'' indicate that ordering is supplied unconditionally,
while other characters indicate that ordering is supplied only partially or
conditionally.
Blank cells indicate that no ordering is supplied.
The \co{*_acquire} row covers \co{smp_load_acquire()},
\co{cmpxchg_acquire()}, \co{xchg_acquire()}, and so on;
the \co{*_release} row covers \co{smp_store_release()},
\co{cmpxchg_release()}, \co{xchg_release()}, and so on; and
the ``Successful full-strength non-\co{void} RMW'' row covers
\co{atomic_add_return()}, \co{atomic_add_unless()}, \co{atomic_dec_and_test()},
\co{cmpxchg()}, \co{xchg()}, and so on.
The ``Successful'' qualifiers apply to primitives such as
\co{atomic_add_unless()}, \co{cmpxchg_acquire()}, and \co{cmpxchg_release()},
which have no effect on either memory or on ordering when they indicate
failure, as indicated by the earlier ``Unsuccessful RMW operation'' row.
Column ``C'' indicates cumulativity and propagation, as explained in
Sections~\ref{sec:memorder:Cumulativity}
and~\ref{sec:memorder:Propagation}.
In the meantime, this column can usually be ignored when there
are at most two threads involved.
\QuickQuiz{}
The rows in
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
seem quite random and confused.
Whatever is the conceptual basis of this table???
\QuickQuizAnswer{
The rows correspond roughly to hardware mechanisms of increasing
power and overhead.
The \co{WRITE_ONCE()} row captures the fact that accesses to
a single variable are always fully ordered, as indicated by
the ``SV''column.
Note that all other operations providing ordering against an
access to a
specific variable also provide this same-variable ordering.
The \co{READ_ONCE()} row captures the fact that (as of 2017) compilers
and CPUs do not indulge in user-visible speculative stores, so that
any store whose address, data, or execution depends on a prior load
will happen after that load completes.
At least assuming that these dependencies have been constructed
carefully as described in
Sections~\ref{sec:memorder:Address- and Data-Dependency Restrictions}
and~\ref{sec:memorder:Control-Dependency Restrictions}.
The ``Unsuccessful RMW operation'' row captures the fact that
even an unsuccessful RMW has done a read, and that read is
every bit as good as a \co{READ_ONCE()}.
The \co{smp_read_barrier_depends()} row captures the fact that, with the
notable exception of DEC Alpha, compilers and CPUs do not indulge
in user-visible breakage of address dependencies, again assuming
that these dependencies have been constructed carefully as described in
Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}.
The \co{*_dereference()} row captures the address and data
dependency ordering provided by \co{lockless_dereference()},
\co{rcu_dereference()}, and friends.
The ``Successful \co{*_acquire()}'' row captures the fact that many
CPUs have special ``acquire'' forms of loads and of atomic RMW
instructions,
and that many other CPUs have light-weight memory-barrier
instructions that order prior loads against subsequent loads
and stores.
The ``Successful \co{*_release()}'' row captures the fact that many
CPUs have special ``release'' forms of stores and of atomic RMW
instructions, and that many other CPUs have light-weight memory-barrier
instructions that order prior loads and stores against
subsequent stores.
The \co{smp_rmb()} row captures the fact that many CPUs have
light-weight memory-barrier instructions that order prior loads against
subsequent loads.
Similarly,
the \co{smp_wmb()} row captures the fact that many CPUs have
light-weight memory-barrier instructions that order prior stores against
subsequent stores.
None of the ordering operations thus far require prior stores to be
ordered against subsequent loads, which means that these operations
need not interfere with store buffers, whose main purpose in life
is in fact to reorder prior stores against subsequent loads.
The light-weight nature of these operations is precisely due to
their policy of store-buffer non-interference.
However, as noted earlier, it is sometimes necessary to interfere
with the store buffer in order to prevent prior stores from being
reordered against later stores, which brings us to the remaining
rows in this table.
The \co{smp_mb())} row corresponds to the full memory barrier
available on most platforms, with Itanium being the exception
that proves the rule.
The ``Successful full-strength non-\co{void} RMW'' row captures
the fact that on some platforms (such as x86) atomic RMW instructions
provide full ordering both before and after.
The Linux kernel therefore requires that full-strength non-\co{void}
atomic RMW operations provide full ordering in cases where these
operations succeed.
(Full-strength atomic RMW operation's names do not end in
\co{_relaxed}, \co{_acquire}, or \co{_release}.)
However, the Linux kernel does not require that \co{void} atomic
RMW operations provide any ordering whatsoever, with the
canonical example being \co{atomic_inc()}.
Therefore, these operations, along with failing non-\co{void}
atomic RMW operations may be preceded by \co{smp_mb__before_atomic()}
and followed by \co{smp_mb__after_atomic()} to provide full
ordering for any accesses preceding or following both.
No ordering need be provided for accesses between the
\co{smp_mb__before_atomic()} (or, similarly, the
\co{smp_mb__after_atomic()}) and the atomic RMW operation, as
indicated by the ``a'' entries on the \co{smp_mb__before_atomic()}
and \co{smp_mb__after_atomic()} rows of the table.
In short, any randomness in the table is due to the properties
of the underlying hardware, which are constrained by nothing other
than the laws of physics, as was explained back in
Chapter~\ref{chp:Hardware and its Habits}.
% forward reference to rf co fr section
} \QuickQuizEnd
It is important to note that this table is just a cheat sheet,
and is therefore in no way a replacement for a good understanding
of memory ordering.
To begin building such an understanding, the next section will
present some basic rules of thumb.
\subsection{Basic Rules of Thumb}
\label{sec:memorder:Basic Rules of Thumb}
This section presents some basic rules of thumb that are ``good and
sufficient'' for a great many situations.
In fact, you could write a great deal of concurrent code having
excellent performance and scalability without needing anything more
than these rules of thumb.
\QuickQuiz{}
But how can I know that a given project can be designed
and coded within the confines of these rules of thumb?
\QuickQuizAnswer{
Much of the purpose of the remainder of this chapter is
to answer exactly that question!
} \QuickQuizEnd
\paragraph{A given thread sees its own accesses in order.}
This rule assumes that loads and stores from/to shared variables use
\co{READ_ONCE()} and \co{WRITE_ONCE()}, respectively.
Otherwise, the compiler can profoundly scramble\footnote{
Many compiler writers prefer the word ``optimize'' instead of
``scramble'', but we all have our preferences.}
your code, and sometimes the CPU can do a bit of scrambling as well.
% @@@ Itanium forward reference?
\begin{figure}[htb]
\centering
\resizebox{3in}{!}{\includegraphics{memorder/memorybarrier}}
\caption{Memory Barriers Provide Conditional If-Then Ordering}
\label{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
\end{figure}
\paragraph{Ordering has conditional if-then semantics.}
Figure~\ref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
illustrates this for memory barriers.
Assuming that both memory barriers are strong enough (and when in doubt, you
can always use \co{smp_mb()}), if CPU~1's access Y1 happens after CPU~0's
access Y0, then CPU~1's access X1 is guaranteed to happen after CPU~0's
access X1.
\QuickQuiz{}
How can you tell which memory barriers are strong enough for
a given use case?
\QuickQuizAnswer{
Ah, that is a deep question whose answer requires most of the
rest of this chapter.
} \QuickQuizEnd
Listing~\ref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
is a case in point.
The \co{smp_mb()} on line~10 and~20 serve as the barriers,
the store to \co{x0} on line~9 as X0, the load from \co{x1} on line~11
as Y0, the store to \co{x1} on line~19 as Y1, and the load from
\co{x0} as X1.
Applying the if-then rule step by step, we know that the store to
\co{x1} on line~19 happens after the load from \co{x1} on line~11 if
\co{P0()}'s local variable \co{r2} is set to the value zero.
The if-then rule would then state that the load from \co{x0} on
line~21 happens after the store to \co{x0} on line~9.
In other words,
\co{P1()}'s local variable \co{r2} is guaranteed
to end up with the value two \emph{only if}
\co{P0()}'s local variable \co{r2} ends up with the value zero.
Again, memory ordering guarantees are conditional, not absolute.
Although
Figure~\ref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
specifically mentions memory barriers, the same rule applies to the
rest of the Linux kernel's ordering operations.
\paragraph{Ordering operations must be paired.}
If you carefully order the operations in one thread, but then fail to do
so in another thread, then there is no ordering.
Both threads must provide ordering for the if-then rule to apply.\footnote{
In Section~\ref{sec:memorder:Propagation}, pairing will be
generalized to cycles.}
\paragraph{Ordering operations almost never speed things up.}
If you find yourself tempted to add a memory barrier in an attempt
to force a prior store to be flushed to memory faster, resist!
Adding ordering usually slows things down.
Of course, there are situations where adding instructions speeds things
up, but careful benchmarking is required in such cases.
And even then, it is quite possible that although you sped things up
a little bit on \emph{your} system, you might well have slowed things
down significantly on your users' systems.
Or on your future system.
\paragraph{Ordering operations are not magic.}
When your program is failing due to some race condition, it is often
tempting to toss in a few memory-ordering operations in an attempt
to barrier your bugs out of existence.
A far better reaction is to use higher-level primitives in a carefully
designed manner.
With concurrent programming, it is almost always easier to design
your bugs out of existence than to hack them out of existence!
\paragraph{These are only rough rules of thumb.}
Although these rules of thumb cover the vast majority of situations
seen in actual practice, as with any set of rules of thumb, they
do have their limits.
The next section will demonstrate some of these limits by introducing
trick-and-trap litmus tests that are intended to insult your
intuition while increasing your understanding.
These litmus tests will also illuminate many of the concepts
represented by the Linux-kernel memory-ordering cheat sheet shown in
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
Section~\ref{sec:memorder:Where is Memory Ordering Needed?} will
circle back to this cheat sheet in light of learnings from all the
intervening tricks and traps.
\section{Tricks and Traps}
\label{sec:memorder:Tricks and Traps}
Now that you know that hardware can reorder memory accesses and that you
can prevent it from doing so, the next step is to get you to admit
that your intuition has a problem.
This painful task is taken up by
Section~\ref{sec:memorder:Variables With Multiple Values},
which presents some code demonstrating that scalar variables can
take on multiple values simultaneously,
and by
Sections~\ref{sec:memorder:Memory-Reference Reordering} through
\ref{sec:memorder:Multicopy Atomicity},
which show a series of intuitively correct code fragments that fail miserably
on real hardware.
Once your intuition has made it through the grieving process, later
sections will summarize the basic rules that memory ordering follows.
But first, let's take a quick look at just how many values a single
variable might have at a single moment of time.
\subsection{Variables With Multiple Values}
\label{sec:memorder:Variables With Multiple Values}
It is natural to think of a variable as taking on a well-defined
sequence of values in a well-defined, global order.
Unfortunately, the next stop on the journey says ``goodbye'' to this comforting fiction.
Hopefully, you already started to say ``goodbye'' in response to row~2 of
Tables~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
and~\ref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events},
and if so, the purpose of this section is to drive this point home.
To this end, consider the program fragment shown in
Listing~\ref{lst:memorder:Software Logic Analyzer}.
This code fragment is executed in parallel by several CPUs.
Line~1 sets a shared variable to the current CPU's ID, line~2
initializes several variables from a \co{gettb()} function that
delivers the value of a fine-grained hardware ``timebase'' counter that is
synchronized among all CPUs (not available from all CPU architectures,
unfortunately!), and the loop from lines~3-8 records the length of
time that the variable retains the value that this CPU assigned to it.
Of course, one of the CPUs will ``win'', and would thus never exit
the loop if not for the check on lines~6-7.
\QuickQuiz{}
What assumption is the code fragment
in Listing~\ref{lst:memorder:Software Logic Analyzer}
making that might not be valid on real hardware?
\QuickQuizAnswer{
The code assumes that as soon as a given CPU stops
seeing its own value, it will immediately see the
final agreed-upon value.
On real hardware, some of the CPUs might well see several
intermediate results before converging on the final value.
The actual code used to produce the data in the figures
discussed later in this section was therefore somewhat more
complex.
} \QuickQuizEnd
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 state.variable = mycpu;
2 lasttb = oldtb = firsttb = gettb();
3 while (state.variable == mycpu) {
4 lasttb = oldtb;
5 oldtb = gettb();
6 if (lasttb - firsttb > 1000)
7 break;
8 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Software Logic Analyzer}
\label{lst:memorder:Software Logic Analyzer}
\end{listing}
Upon exit from the loop, \co{firsttb} will hold a timestamp
taken shortly after the assignment and \co{lasttb} will hold
a timestamp taken before the last sampling of the shared variable
that still retained the assigned value, or a value equal to \co{firsttb}
if the shared variable had changed before entry into the loop.
This allows us to plot each CPU's view of the value of \co{state.variable}
over a 532-nanosecond time period, as shown in
Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}.
This data was collected in 2006 on 1.5\,GHz POWER5 system with 8 cores,
each containing a pair of hardware threads.
CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test.
The timebase counter period was about 5.32\,ns, sufficiently fine-grained
to allow observations of intermediate cache states.
\begin{figure}[htb]
\centering
\resizebox{3in}{!}{\includegraphics{memorder/MoreThanOneValue}}
\caption{A Variable With Multiple Simultaneous Values}
\label{fig:memorder:A Variable With Multiple Simultaneous Values}
\end{figure}
Each horizontal bar represents the observations of a given CPU over time,
with the black regions to the left indicating the time before the
corresponding CPU's first measurement.
During the first 5\,ns, only CPU~3 has an opinion about the value of the
variable.
During the next 10\,ns, CPUs~2 and~3 disagree on the value of the variable,
but thereafter agree that the value is~``2'', which is in fact
the final agreed-upon value.
However, CPU~1 believes that the value is~``1'' for almost 300\,ns, and
CPU~4 believes that the value is~``4'' for almost 500\,ns.
\QuickQuiz{}
How could CPUs possibly have different views of the
value of a single variable \emph{at the same time?}
\QuickQuizAnswer{
As discussed in
Section~\ref{sec:memorder:Why Hardware Misordering?},
many CPUs have store buffers that record the values of
recent stores, which do not become globally visible until
the corresponding cache line makes its way to the CPU.
Therefore, it is quite possible for each CPU to see a
different value for a given variable at a single point
in time---and for main memory to hold yet another value.
One of the reasons that memory barriers were invented was
to allow software to deal gracefully with situations like
this one.
} \QuickQuizEnd
\QuickQuiz{}
Why do CPUs~2 and~3 come to agreement so quickly, when it
takes so long for CPUs~1 and~4 to come to the party?
\QuickQuizAnswer{
CPUs~2 and~3 are a pair of hardware threads on the same
core, sharing the same cache hierarchy, and therefore have
very low communications latencies.
This is a NUMA, or, more accurately, a NUCA effect.
This leads to the question of why CPUs~2 and~3 ever disagree
at all.
One possible reason is that they each might have a small amount
of private cache in addition to a larger shared cache.
Another possible reason is instruction reordering, given the
short 10-nanosecond duration of the disagreement and the
total lack of memory-ordering operations in the code fragment.
} \QuickQuizEnd
And if you think that the situation with four CPUs was intriguing, consider
Figure~\ref{fig:memorder:A Variable With More Simultaneous Values},
which shows the same situation, but with 15~CPUs each assigning their
number to a single shared variable at time~$t=0$. Both diagrams in the
figure are drawn in the same way as
Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}.
The only difference is that the unit of horizontal axis is timebase ticks,
with each tick lasting about 5.3~nanoseconds.
The entire sequence therefore lasts a bit longer than the events recorded in
Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values},
consistent with the increase in number of CPUs.
The upper diagram shows the overall picture, while the lower one shows
the zoom-up of first 50~timebase ticks.
Again, CPU~0 coordinates the test, so does not record any values.
\begin{figure*}
\centering
\resizebox{5in}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}}
\caption{A Variable With More Simultaneous Values}
\ContributedBy{Figure}{fig:memorder:A Variable With More Simultaneous Values}{Akira Yokosawa}
\end{figure*}
All CPUs eventually agree on the final value of~9, but not before
the values~15 and~12 take early leads.
Note that there are fourteen different opinions on the variable's value
at time~21 indicated by the vertical line in the lower diagram.
Note also that all CPUs see sequences whose orderings are consistent with
the directed graph shown in
Figure~\ref{fig:memorder:Possible Global Orders With More Simultaneous Values}.
Nevertheless, both figures underscore the importance of
proper use of memory-ordering operations, such as memory barriers,
for code that cares about memory ordering.
\begin{figure}[htb]
\centering
\resizebox{2.0in}{!}{\includegraphics{memorder/store15tred}}
\caption{Possible Global Orders With More Simultaneous Values}
\label{fig:memorder:Possible Global Orders With More Simultaneous Values}
\end{figure}
How many values can a single variable take on at a single point in
time?
As many as one per store buffer in the system!
We have therefore entered a regime where we must bid a fond farewell to
comfortable intuitions about values of variables and the passage of time.
This is the regime where memory-ordering operations are needed.
All that aside, it is important to remember the lessons from
Chapters~\ref{chp:Hardware and its Habits}
and~\ref{cha:Partitioning and Synchronization Design}.
Having all CPUs store concurrently to the same variable
is absolutely no way to design a parallel program, at least
not if performance and scalability are at all important to you.
Unfortunately, memory ordering has many other ways of insulting your
intuition, and not all of these ways conflict with performance and
scalability.
The next section will give an overview of reordering of unrelated
memory reference.
\subsection{Memory-Reference Reordering}
\label{sec:memorder:Memory-Reference Reordering}
Section~\ref{sec:memorder:Why Hardware Misordering?}
showed that even relatively strongly ordered systems like x86
can reorder prior stores with later loads, at least when the
store and load are to different variables.
This section builds on that result, looking at the other combinations of
loads and stores.
% @@@ Rationale for further reordering.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-wmb-o+o-o
{
}
P0(int* x0, int* x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, 2);
}
P1(int* x0, int* x1) {
int r2;
int r3;
r2 = READ_ONCE(*x1);
r3 = READ_ONCE(*x0);
}
exists (1:r2=2 /\ 1:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Message-Passing Litmus Test (No Ordering)}
\label{lst:memorder:Message-Passing Litmus Test (No Ordering)}
\end{listing}
\subsubsection{Load Followed By Load}
\label{sec:memorder:Load Followed By Load}
Listing~\ref{lst:memorder:Message-Passing Litmus Test (No Ordering)}
(\path{C-MP+o-wmb-o+o-o.litmus})
shows the classic \emph{message-passing} litmus test, where \co{x0} is
the message and \co{x1} is a flag indicating whether or not a message
is available.
In this test, the \co{smp_wmb()} forces \co{P0()} stores to be ordered,
but no ordering is specified for the loads.
Relatively strongly ordered architectures, such as x86, do enforce ordering.
However, weakly ordered architectures often do
not~\cite{JadeAlglave2011ppcmem}.
Therefore, the \co{exists} clause on line~25 of the listing \emph{can}
trigger.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-wmb-o+o-rmb-o
{
}
P0(int* x0, int* x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, 2);
}
P1(int* x0, int* x1) {
int r2;
int r3;
r2 = READ_ONCE(*x1);
smp_rmb();
r3 = READ_ONCE(*x0);
}
exists (1:r2=2 /\ 1:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Enforcing Order of Message-Passing Litmus Test}
\label{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
\end{listing}
Thus, portable code relying on ordering in this case must
add explicit ordering, for example, the \co{smp_rmb()} shown on
line~20 of
Listing~\ref{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
(\path{C-MP+o-wmb-o+o-rmb-o.litmus}), which prevents
the \co{exists} clause from triggering.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+o-o+o-o
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x1);
WRITE_ONCE(*x0, 2);
}
P1(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x0);
WRITE_ONCE(*x1, 2);
}
exists (1:r2=2 /\ 0:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Load-Buffering Litmus Test (No Ordering)}
\label{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
\end{listing}
\subsubsection{Load Followed By Store}
\label{sec:memorder:Load Followed By Store}
Listing~\ref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
(\path{C-LB+o-o+o-o.litmus})
shows the classic \emph{load-buffering} litmus test.
Although relatively strongly ordered systems such as x86
or the IBM Mainframe do not reorder prior loads with subsequent stores,
more weakly ordered architectures really do allow such
reordering~\cite{JadeAlglave2011ppcmem}.
Therefore, the \co{exists} clause on line~22 really can trigger.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+o-r+a-o
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x1);
smp_store_release(x0, 2);
}
P1(int *x0, int *x1)
{
int r2;
r2 = smp_load_acquire(x0);
WRITE_ONCE(*x1, 2);
}
exists (1:r2=2 /\ 0:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Enforcing Ordering of Load-Buffering Litmus Test}
\label{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
\end{listing}
Although it is rare for actual hardware to
exhibit this reordering~\cite{LucMaranget2017aarch64},
portable code must nevertheless enforce any required ordering, for example,
as shown in
Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
(\path{C-LB+o-r+a-o.litmus}).
This guarantees that the \co{exists} clause on line~22 never triggers.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-o+o-rmb-o
{
}
P0(int* x0, int* x1) {
WRITE_ONCE(*x0, 2);
WRITE_ONCE(*x1, 2);
}
P1(int* x0, int* x1) {
int r2;
int r3;
r2 = READ_ONCE(*x1);
smp_rmb();
r3 = READ_ONCE(*x0);
}
exists (1:r2=2 /\ 1:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Message-Passing Litmus Test, No Writer Ordering (No Ordering)}
\label{lst:memorder:Message-Passing Litmus Test, No Writer Ordering (No Ordering)}
\end{listing}
\subsubsection{Store Followed By Store}
\label{sec:memorder:Store Followed By Store}
Listing~\ref{lst:memorder:Message-Passing Litmus Test, No Writer Ordering (No Ordering)}
(\path{C-MP+o-o+o-rmb-o.litmus})
once again shows the classic message-passing litmus test, but without
explicit ordering for \co{P0()}'s stores and with the \co{smp_rmb()}
providing ordering for \co{P1()}'s loads.
Again, the relatively strongly ordered architectures do enforce ordering,
but weakly ordered architectures do not necessarily do
so~\cite{JadeAlglave2011ppcmem}, which means that the
\co{exists} clause can trigger.
Therefore, portable code must explicitly order the stores, for
example, as shown in
Listing~\ref{lst:memorder:Enforcing Order of Message-Passing Litmus Test},
thus preventing the \co{exists} clause from triggering.
\subsection{Address Dependencies}
\label{sec:memorder:Address Dependencies}
An \emph{address dependency} occurs when the value returned by a load
instruction is used to compute the address used by a later memory-reference
instruction.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-wmb-o+o-addr-o
{
int y=1;
int *x1 = &y;
}
P0(int* x0, int** x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, x0);
}
P1(int** x1) {
int *r2;
int r3;
r2 = READ_ONCE(*x1);
r3 = READ_ONCE(*r2);
}
exists (1:r2=x0 /\ 1:r3=1)
\end{verbbox}
}
\centering
\theverbbox
\caption{Message-Passing Address-Dependency Litmus Test (No Ordering)}
\label{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)}
\end{listing}
Listing~\ref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)}
(\path{C-MP+o-wmb-o+o-addr-o.litmus})
shows a linked variant of the message-passing pattern.
The head pointer is \co{x1}, which initially
references the \co{int} variable \co{y} (line~5), which is in turn
initialized to the value $1$ (line~4).
\co{P0()} updates head pointer \co{x1} to reference \co{x0} (line~12),
but only afer initializing it to $2$ (line~10) and forcing ordering
(line~11).
\co{P1()} picks up the head pointer \co{x1} (line~21), and then loads
the referenced value (line~22).
There is thus an address dependency from the load on line~21 to the
load on line~22.
In this case, the value returned by line~21 is exactly the address
used by line~22, but many variations are possible,
including field access using the C-language \co{->} operator,
addition, subtraction, and array indexing.
One might hope that line~21's load from the head pointer would be ordered
before line~22's dereference.
However, this is not the case on DEC Alpha, which can in effect use
a speculated value for the dependent load, as described in more detail in
Section~\ref{sec:memorder:Alpha}.
Therefore,
Listing~\ref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)}'s
\co{exists} clause \emph{can} trigger.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-wmb-o+ld-addr-o
{
int y=1;
int *x1 = &y;
}
P0(int* x0, int** x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, x0);
}
P1(int** x1) {
int *r2;
int r3;
r2 = lockless_dereference(*x1);
r3 = READ_ONCE(*r2);
}
exists (1:r2=x0 /\ 1:r3=1)
\end{verbbox}
}
\centering
\theverbbox
\caption{Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
\label{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
\end{listing}
Listing~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
(\path{C-MP+o-wmb-o+ld-addr-o.litmus})
shows how to make this work portably, even on DEC Alpha, by
replacing line~21's \co{READ_ONCE()} with \co{lockless_dereference()},
which acts like \co{READ_ONCE()} on all platforms other than DEC Alpha,
where it acts like a \co{READ_ONCE()} followed by an \co{smp_mb()},
thereby forcing the required ordering on all platforms, in turn
preventing the \co{exists} clause from triggering.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-S+o-wmb-o+o-addr-o
{
int y=1;
int *x1 = &y;
}
P0(int* x0, int** x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, x0);
}
P1(int** x1) {
int *r2;
r2 = READ_ONCE(*x1);
WRITE_ONCE(*r2, 3);
}
exists (1:r2=x0 /\ x0=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{S Address-Dependency Litmus Test}
\label{lst:memorder:S Address-Dependency Litmus Test}
\end{listing}
But what happens if the dependent operation is a store rather than
a load, for example, in the \emph{S}
litmus test~\cite{JadeAlglave2011ppcmem} shown in
Listing~\ref{lst:memorder:S Address-Dependency Litmus Test}
(\path{C-S+o-wmb-o+o-addr-o.litmus})?
Because no production-quality platform speculates stores,
it is not possible for the \co{WRITE_ONCE()} on line~10 to overwrite
the \co{WRITE_ONCE()} on line~21, meaning that the \co{exists}
clause on line~25 cannot trigger, even on DEC Alpha, even
without the \co{lockless_dereference()} that is required in the
dependent-load case.
\QuickQuiz{}
But how do we know that \emph{all} platforms really avoid
triggering the \co{exists} clauses in
Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
and~\ref{lst:memorder:S Address-Dependency Litmus Test}?
\QuickQuizAnswer{
Answering this requires identifying three major groups of platforms:
(1)~Total-store-order (TSO) platforms,
(2)~Weakly ordered platorms, and
(3)~DEC Alpha.
The TSO platforms order all pairs of memory references except for
prior stores against later loads.
Because the address dependency on lines~21 and~22 of
Listing~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
is instead a load followed by another load, TSO platforms preserve
this address dependency.
They also preserve the address dependency on lines~20 and~21 of
Listing~\ref{lst:memorder:S Address-Dependency Litmus Test}
because this is a load followed by a store.
Because address dependencies must start with a load, TSO platforms
implicitly but completely respect them.
Weakly ordered platforms don't necessarily maintain ordering of
unrelated accesses.
However, the address dependencies in
Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}
and~\ref{lst:memorder:S Address-Dependency Litmus Test}
are not unrelated: There is an address dependency.
The hardware tracks dependencies and maintains the needed
ordering.
There is one (famous) exception to this rule for weakly ordered
platforms, and that exception is DEC Alpha for load-to-load
address dependencies.
And this is why DEC Alpha requires the explicit memory barrier
supplied for it by the \co{lockless_dereference()} on line~21 of
Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}.
However, DEC Alpha does track load-to-store address dependencies,
which is why line~20 of
Listings~\ref{lst:memorder:S Address-Dependency Litmus Test}
does not have a \co{lockless_dereference()}.
To sum up, current platforms either respect address dependencies
implicitly, as is the case for TSO platforms (x86, mainframe,
SPARC,~...), have hardware tracking for address dependencies
(ARM, PowerPC, MIPS,~...), or have the required memory barriers
supplied by \co{lockless_dereference()} (DEC Alpha).
} \QuickQuizEnd
However, it is important to note that address dependencies can
be fragile and easily broken by compiler optimizations, as discussed in
Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}.
\subsection{Data Dependencies}
\label{sec:memorder:Data Reordering}
A \emph{data dependency} occurs when the value returned by a load
instruction is used to compute the data stored by a later store
instruction.
Note well the ``data'' above: If the value returned by a load
was instead used to compute the address used by a later store
instruction, that would instead be an address dependency.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+o-r+o-data-o
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x1);
smp_store_release(x0, 2);
}
P1(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x0);
WRITE_ONCE(*x1, r2);
}
exists (1:r2=2 /\ 0:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Load-Buffering Data-Dependency Litmus Test}
\label{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
\end{listing}
Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
(\path{C-LB+o-r+o-data-o.litmus})
is similar to
Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test},
except that \co{P1()}'s ordering between lines~18 and~19 is
enforced not by an acquire load, but instead by a data dependency:
The value loaded by line~18 is what line~19 stores.
The ordering provided by this data dependency is sufficient to prevent
the \co{exists} clause from triggering.
\QuickQuiz{}
Why doesn't line~18 of
Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
need a \co{lockless_dereference()}?
\QuickQuizAnswer{
Data dependencies are always load-to-store dependencies, and
so all platforms respect them, even DEC Alpha, and for the
same reasons that they respect load-to-store address dependencies.
} \QuickQuizEnd
Just as with address dependencies, data dependencies are
fragile and can be easily broken by compiler optimizations, as discussed in
Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}.
In fact, data dependencies can be even more fragile than are address
dependencies.
The reason for this is that address dependencies normally involve
pointer values.
In contrast, as shown in
Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test},
it is tempting to carry data dependencies through integral values,
which the compiler has much more freedom to optimize into nonexistence.
For but one example, if the integer loaded was multiplied by the constant
zero, the compiler would know that the result was zero, and could therefore
substitute the constant zero for the value loaded, thus breaking
the dependency.
\QuickQuiz{}
But wait!!!
Line~18 of
Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
uses \co{READ_ONCE()}, which marks the load as volatile,
which means that the compiler absolutely must emit the load
instruction even if the value is later multiplied by zero.
So do you really need to work so hard to keep the compiler from
breaking your data dependencies?
\QuickQuizAnswer{
Yes, the compiler absolutely must emit a load instruction for
a volatile load.
But if you multiply the value loaded by zero, the compiler is
well within its rights to substitute a constant zero for the
result of that multiplication, which will break the data
dependency on many platforms.
Worse yet, if the dependent store does not use \co{WRITE_ONCE()},
the compiler could hoist it above the load, which would cause
even TSO platforms to fail to provide ordering.
} \QuickQuizEnd
In short, you can rely on data dependencies, but only if you take care
to prevent your compiler from breaking them.
\subsection{Control Dependencies}
\label{sec:memorder:Control Dependencies}
A \emph{control dependency} occurs when the value returned by a load
instruction is tested to determine whether or not a later store instruction
is executed.
Note well the ``later store instruction'': Many platforms do not respect
load-to-load control dependencies.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+o-r+o-ctrl-o
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x1);
smp_store_release(x0, 2);
}
P1(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x0);
if (r2 >= 0)
WRITE_ONCE(*x1, 2);
}
exists (1:r2=2 /\ 0:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Load-Buffering Control-Dependency Litmus Test}
\label{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
\end{listing}
Listing~\ref{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
(\path{C-LB+o-r+o-ctrl-o.litmus})
shows another load-buffering example, this time using a control
dependency (line~19) to order the load on line~18 and the store on
line~20.
The ordering is sufficient to prevent the \co{exists} from triggering.
However, control dependencies are even more susceptible to being optimized
out of existence than are data dependencies, and
Section~\ref{sec:memorder:Control-Dependency Restrictions}
describes some of the rules that must be followed in order to prevent
your compiler from breaking your control dependencies.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP+o-r+o-ctrl-o
{
}
P0(int* x0, int* x1) {
WRITE_ONCE(*x0, 2);
smp_store_release(x1, 2);
}
P1(int* x0, int* x1) {
int r2;
int r3 = 0;
r2 = READ_ONCE(*x1);
if (r2 >= 0)
r3 = READ_ONCE(*x0);
}
exists (1:r2=2 /\ 1:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Message-Passing Control-Dependency Litmus Test (No Ordering)}
\label{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
\end{listing}
It is worth reiterating that control dependencies provide ordering only
from loads to stores.
Therefore, the load-to-load control dependency shown on lines~17-19 of
Listing~\ref{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
(\path{C-MP+o-r+o-ctrl-o.litmus})
does \emph{not} provide ordering, and therefore does \emph{not}
prevent the \co{exists} clause from triggering.
In summary, control dependencies can be useful, but they are
high-maintenance items.
You should therefore use them only when performance considerations
permit no other solution.
\QuickQuiz{}
Wouldn't control dependencies be more robust if they were
mandated by language standards???
\QuickQuizAnswer{
In the fullness of time, perhaps they will be so mandated.
} \QuickQuizEnd
\subsection{Cache Coherence}
\label{sec:memorder:Cache Coherence}
On cache-coherent platforms, all CPUs agree on the order of loads and
stores to a given variable.
Fortunately, when \co{READ_ONCE()} and \co{WRITE_ONCE()} are used,
almost all platforms are cache-coherent, as indicated by the ``SV''
column of the cheat sheet shown in
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
Unfortunately, this property is so popular that it has been named
multiple times, with ``single-variable SC'',\footnote{
Recall that SC stands for sequentially consistent.}
``single-copy atomic''~\cite{Stone:1995:SP:623262.623912},
and just plain ``coherence''~\cite{JadeAlglave2011ppcmem}
having seen use.
Rather than further compound the confusion by inventing yet another term
for this concept, this book uses ``cache coherence'' and ``coherence''
interchangeably.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-CCIRIW+o+o+o-o+o-o
{
int x = 0;
}
P0(int *x)
{
WRITE_ONCE(*x, 1);
}
P1(int *x)
{
WRITE_ONCE(*x, 2);
}
P2(int *x)
{
int r1;
int r2;
r1 = READ_ONCE(*x);
r2 = READ_ONCE(*x);
}
P3(int *x)
{
int r3;
int r4;
r3 = READ_ONCE(*x);
r4 = READ_ONCE(*x);
}
exists(2:r1=1 /\ 2:r2=2 /\ 3:r3=2 /\ 3:r4=1)
\end{verbbox}
}
\centering
\theverbbox
\caption{Cache-Coherent IRIW Litmus Test}
\label{lst:memorder:Cache-Coherent IRIW Litmus Test}
\end{listing}
Listing~\ref{lst:memorder:Cache-Coherent IRIW Litmus Test}
(\path{C-CCIRIW+o+o+o-o+o-o.litmus})
shows a litmus test that tests for cache coherence,
where ``IRIW'' stands
for ``independent reads of independent writes''.
Because this litmus test uses only one variable,
\co{P2()} and \co{P3()} must agree
on the order of \co{P0()}'s and \co{P1()}'s stores.
In other words, if \co{P2()} believes that \co{P0()}'s store
came first, then \co{P3()} had better not believe that
\co{P1()}'s store came first.
And in fact the \co{exists} clause on line~35 will trigger if this
situation arises.
It is tempting to speculate that different-sized overlapping loads
and stores to a single region of memory (as might be set up using
the C-language \co{union} keyword) would provide similar ordering
guarantees.
However, Flur et al.~discovered some surprisingly simple
litmus tests that demonstrate that these guarantees can be violated on
real hardware~\cite{Flur:2017:MCA:3093333.3009839}.
It is therefore necessary to restrict code to non-overlapping
same-sized aligned accesses to a given variable, at least if portability
is a consideration.\footnote{
There is reason to believe that using atomic RMW operations
(for example, \co{xchg()}) for all the stores will
provide sequentially consistent ordering, but this has not
yet been proven either way.}
Adding more variables and threads increases the scope for reordering
and other counterintuitive behavior, as discussed in the next section.
\subsection{Multicopy Atomicity}
\label{sec:memorder:Multicopy Atomicity}
\begin{figure}[tb]
\centering
\resizebox{3.0in}{!}{\includegraphics{memorder/SystemArchBus}}
\caption{Global System Bus And Multi-Copy Atomicity}
\label{fig:memorder:Global System Bus And Multi-Copy Atomicity}
\end{figure}
Threads running on a \emph{multicopy atomic}~\cite{Stone:1995:SP:623262.623912}
platform are guaranteed
to agree on the order of stores, even to different variables.
A useful mental model of such a system is the single-bus architecture
shown in
Figure~\ref{fig:memorder:Global System Bus And Multi-Copy Atomicity}.
If each store resulted in a message on the bus, and if the bus could
accommodate only one store at a time, then any pair of CPUs would
agree on the order of all stores that they observed.
Unfortunately, building a computer system as shown in the figure,
without store buffers or even caches, would result in glacial computation.
CPU vendors interested in providing multicopy atomicity have therefore
instead provided the slightly weaker
\emph{other-multicopy atomicity}~\cite[Section B2.3]{ARMv8A:2017},
which excludes the CPU doing a given store from the requirement that all
CPUs agree on the order of all stores.
This means that if only a subset of CPUs are doing stores, the
other CPUs will agree on the order of stores, hence the ``other''
in ``other-multicopy atomicity''.
Unlike multicopy-atomic platforms, within other-multicopy-atomic platforms,
the CPU doing the store is permitted to observe its
store early, which allows its later loads to obtain the newly stored
value directly from the store buffer.
This in turn avoids abysmal performance.
\QuickQuiz{}
Can you give a specific example showing different behavior for
multicopy atomic on the one hand and other-multicopy atomic
on the other?
\QuickQuizAnswer{
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-MP-OMCA+o-o-o+o-rmb-o
{
}
P0(int *x, int *y)
{
int r0;
WRITE_ONCE(*x, 1);
r0 = READ_ONCE(*x);
WRITE_ONCE(*y, r0);
}
P1(int *x, int *y)
{
int r1;
int r2;
r1 = READ_ONCE(*y);
smp_rmb();
r2 = READ_ONCE(*x);
}
exists (1:r1=1 /\ 1:r2=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
\label{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
\end{listing}
Listing~\ref{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
(\path{C-MP-OMCA+o-o-o+o-rmb-o.litmus})
shows such a test.
On a multicopy-atomic platform, \co{P0()}'s store to \co{x} on
line~10 must become visible to both \co{P0()} and \co{P1()}
simultaneously.
Because this store becomes visible to \co{P0()} on line~11, before
\co{P0()}'s store to \co{y} on line~12, \co{P0()}'s store to
\co{x} must become visible before its store to \co{y} everywhere,
including \co{P1()}.
Therefore, if \co{P1()}'s load from \co{y} on line~20 returns the
value 1, so must its load from \co{x} on line~22, given that
the \co{smp_rmb()} on line~21 forces these two loads to execute
in order.
Therefore, the \co{exists} clause on line~25 cannot trigger on a
multicopy-atomic platform.
In contrast, on an other-multicopy-atomic platform, \co{P0()}
could see its own store early, so that there would be no constraint
on the order of visibility of the two stores from to \co{P1()},
which in turn allows the \co{exists} clause to trigger.
} \QuickQuizEnd
Perhaps there will come a day when all platforms provide some flavor
of multi-copy atomicity, but
in the meantime, non-multicopy-atomic platforms do exist, and so software
does need to deal with them.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-WRC-o+o-data-o+o-rmb-o
{
}
P0(int *x)
{
WRITE_ONCE(*x, 1);
}
P1(int *x, int* y)
{
int r1;
r1 = READ_ONCE(*x);
WRITE_ONCE(*y, r1);
}
P2(int *x, int* y)
{
int r2;
int r3;
r2 = READ_ONCE(*y);
smp_rmb();
r3 = READ_ONCE(*x);
}
exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{WRC Litmus Test With Dependencies (No Ordering)}
\label{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
\end{listing}
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
(\path{C-WWC+o+o-data-o+o-addr-o.litmus})
demonstrates multicopy atomicity, that is, on a multicopy-atomic platform,
the \co{exists} clause on line~29 cannot trigger.
In contrast, on a non-multicopy-atomic
platform this \co{exists} clause can trigger, despite
\co{P1()}'s accesses being ordered by a data dependency and \co{P2()}'s
accesses being ordered by an \co{smp_rmb()}.
Recall that the definition of multicopy atomicity requires that all
threads agree on the order of stores, which can be thought of as
all stores reaching all threads at the same time.
Therefore, a non-multicopy-atomic platform can have a store reach
different threads at different times.
In particular, \co{P0()}'s store might reach \co{P1()} long before it
reaches \co{P2()}, which raises the possibility that \co{P1()}'s store
might reach \co{P2()} before \co{P0()}'s store does.
\begin{figure}[tb]
\centering
\resizebox{3.0in}{!}{\includegraphics{memorder/NonMCAplatform}}
\caption{Shared Store Buffers And Multi-Copy Atomicity}
\label{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}
\end{figure}
This leads to the question of why a real system constrained by the
usual laws of physics would ever trigger the \co{exists} clause of
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}.
The cartoonish diagram of a such a real system is shown in
Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.
CPU~0 and CPU~1 share a store buffer, as do CPUs~2 and~3.
This means that CPU~1 can load a value out of the store buffer, thus
potentially immediately seeing a value stored by CPU~0.
In contrast, CPUs~2 and~3 will have to wait for the corresponding cache
line to carry this new value to them.
\QuickQuiz{}
Then who would even \emph{think} of designing a system with shared
store buffers???
\QuickQuizAnswer{
This is in fact a very natural design for any system having
multiple hardware threads per core.
Natural from a hardware point of view, that is!
} \QuickQuizEnd
\begin{table*}
\small
\centering\OneColumnHSpace{-0.8in}
\begin{tabular}{r||l|l|l|l||l|l|l}
& \tco{P0()} & \multicolumn{2}{c|}{\tco{P0()} \& \tco{P1()}} & \tco{P1()} & \multicolumn{3}{c}{\tco{P2()}} \\
\cline{2-8}
& Instruction & Store Buffer & Cache & Instruction &
Instruction & Store Buffer & Cache \\
\hline
\hline
1 & (Initial state) & & \tco{y==0} &
(Initial state) &
(Initial state) & & \tco{x==0} \\
\hline
2 & \tco{x = 1;} & \tco{x==1} & \tco{y==0} &
& & & \tco{x==0} \\
\hline
3 & (Read-Invalidate \tco{x}) & \tco{x==1} & \tco{y==0} & \tco{r1 = x} (1)
& & & \tco{x==0} \\
\hline
4 & & \tco{x==1} \tco{y==1} & \tco{y==0} & \tco{y = r1}
& \tco{r2 = y} & & \tco{x==0} \\
\hline
5 & & \tco{x==1} & \tco{y==1} & (Finish store)
& (Read \tco{y}) & & \tco{x==0} \\
\hline
6 & (Respond \tco{y}) & \tco{x==1} & \tco{y==1} &
& (\tco{r2==1}) & & \tco{x==0} \tco{y==1} \\
\hline
7 & & \tco{x==1} & \tco{y==1} &
& \tco{smp_rmb()} & & \tco{x==0} \tco{y==1} \\
\hline
8 & & \tco{x==1} & \tco{y==1} &
& \tco{r3 = x (0)} & & \tco{x==0} \tco{y==1} \\
\hline
9 & & \tco{x==1} & \tco{x==0} \tco{y==1} &
& (Respond \tco{x}) & & \tco{y==1} \\
\hline
10 & (Finish store) & & \tco{x==1} \tco{y==1} &
& & & \tco{y==1} \\
\end{tabular}
\caption{Memory Ordering: WRC Sequence of Events}
\label{tab:memorder:Memory Ordering: WRC Sequence of Events}
\end{table*}
Table~\ref{tab:memorder:Memory Ordering: WRC Sequence of Events}
shows one sequence of events that can result in the \co{exists} clause in
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
triggering.
This sequence of events will depend critically on \co{P0()} and
\co{P1()} sharing both cache and a store buffer in the manner shown in
Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.
\QuickQuiz{}
But just how is it fair that P0 and P1 must share a store buffer
and a cache, but P2 gets one each of its very own???
\QuickQuizAnswer{
Presumably there is a P3, as is in fact shown in
Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity},
that shares P2's store buffer and cache.
But not necessarily.
Some platforms allow different cores to disable different numbers
of threads, allowing the hardware to adjust to the needs of the
workload at hand.
For example, a single-threaded critical-path portion of the workload
might be assigned to a core with only one thread enabled, thus
allowing the single thread running that portion of the workload
to use the entire capabilities of that core.
Other more highly parallel but cache-miss-prone portions of the
workload might be assigned to cores with all hardware threads
enabled to provide improved throughput.
This improved throughput could be due to the fact that while one
hardware thread is stalled on a cache miss, the other hardware
threads can make forward progress.
In such cases, performance requirements override quaint human
notions of fairness.
} \QuickQuizEnd
Row~1 shows the initial state, with the initial value of \co{y} in
P0's and P1's shared cache, and the initial value of \co{x} in
P2's cache.
Row~2 shows the immediate effect of \co{P0()} executing its store on line~8.
Because the cacheline containing \co{x} is not in \co{P0()}'s and \co{P1()}'s
shared cache, the new value (\co{1}) is stored in the shared store buffer.
Row~3 shows two transitions.
First, \co{P0()} issues a read-invalidate operation to fetch the cacheline
containing \co{x} so that it can flush the new value for \co{x} out of
the shared store buffer.
Second, \co{P1()} loads from \co{x} (line~15), an operation that completes
immediately because the new value of \co{x} is immediately available
from the shared store buffer.
Row~4 also shows two transitions.
First, it shows the immediate effect of \co{P1()} executing its store to
\co{y} (line~16), placing the new value into the shared store buffer.
Second, it shows the start of \co{P2()}'s load from \co{y} (line~24).
Row~5 continues the tradition of showing two transitions.
First, it shows \co{P1()} complete its store to \co{y}, flushing
from the shared store buffer to the cache.
Second, it shows \co{P2()} request the cacheline containing \co{y}.
Row~6 shows \co{P2()} receive the cacheline containing \co{y}, allowing
it to finish its load into \co{r2}, which takes on the value \co{1}.
Row~7 shows \co{P2()} execute its \co{smp_rmb()} (line~25), thus keeping
its two loads ordered.
Row~8 shows \co{P2()} execute its load from \co{x}, which immediately
returns with the value zero from \co{P2()}'s cache.
Row~9 shows \co{P2()} \emph{finally} responding to \co{P0()}'s request for
the cacheline containing \co{x}, which was made way back up on row~3.
Finally, row~10 shows \co{P0()} finish its store, flushing its value of
\co{x} from the shared store buffer to the shared cache.
Note well that the \co{exists} clause on line~29 has triggered.
The values of \co{r1} and \co{r2} are both the value one, and
the final value of \co{r3} the value zero.
This strange result occurred because \co{P0()}'s new value of \co{x} was
communicated to \co{P1()} long before it was communicated to \co{P2()}.
\QuickQuiz{}
Referring to
Table~\ref{tab:memorder:Memory Ordering: WRC Sequence of Events},
why on earth would \co{P0()}'s store take so long to complete when
\co{P1()}'s store complete so quickly?
In other words, does the \co{exists} clause on line~32 of
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
really trigger on real systems?
\QuickQuizAnswer{
You need to face the fact that it really can trigger.
Akira Yokosawa used the \co{litmus7} tool to run this litmus test
on a Power8 system.
Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause.
Thus, triggering the \co{exists} clause is not merely a one-in-a-million
occurrence, but rather a one-in-a-hundred-million occurrence.
But it nevertheless really does trigger on real systems.
} \QuickQuizEnd
This counter-intuitive result happens because although dependencies
do provide ordering, they provide it only within the confines of their
own thread.
This three-thread example requires stronger ordering, which
is the subject of
Sections~\ref{sec:memorder:Cumulativity}
through~\ref{sec:memorder:Release-Acquire Chains}.
\subsubsection{Cumulativity}
\label{sec:memorder:Cumulativity}
The three-thread example shown in
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
requires \emph{cumulative} ordering, or \emph{cumulativity}.
A cumulative memory-ordering operation orders not just any given
access preceding it, but also earlier accesses by any thread to that
same variable.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-WRC+o+o-r+a-o.litmus
{
}
P0(int *x)
{
WRITE_ONCE(*x, 1);
}
P1(int *x, int* y)
{
int r1;
r1 = READ_ONCE(*x);
smp_store_release(y, r1);
}
P2(int *x, int* y)
{
int r2;
int r3;
r2 = smp_load_acquire(y);
r3 = READ_ONCE(*x);
}
exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{WRC Litmus Test With Release}
\label{lst:memorder:WRC Litmus Test With Release}
\end{listing}
Dependencies do not provide cumulativity,
which is why the ``C'' column is blank for
both the \co{READ_ONCE()} and the \co{smp_read_barrier_depends()} rows
of Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
However, as indicated by the ``C'' in their ``C'' column,
release operations do provide cumulativity.
Therefore,
Listing~\ref{lst:memorder:WRC Litmus Test With Release}
(\path{C-WWC+o+o-r+o-addr-o.litmus})
substitutes a release operation for
Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}'s
data dependency.
Because the release operation is cumulative, its ordering applies not only to
Listing~\ref{lst:memorder:WRC Litmus Test With Release}'s
load from \co{x} by \co{P1()} on line~15, but also to the store to \co{x}
by \co{P0()} on line~8---but only if that load returns the value stored,
which matches the \co{1:r1=1} in the \co{exists} clause on line~28.
This means that \co{P2()}'s load-acquire suffices to force the
load from \co{x} on line~25 to happen after the store on line~8, so
the value returned is one, which does not match \co{2:r3=0}, which
in turn prevents the \co{exists} clause from triggering.
\begin{figure*}[htbp]
\centering
\includegraphics{memorder/memorybarriercum}
\caption{Cumulativity}
\label{fig:memorder:Cumulativity}
\end{figure*}
These ordering constraints are depicted graphically in
Figure~\ref{fig:memorder:Cumulativity}.
Note also that cumulativity is not limited to a single step back in time.
If there was another load from \co{x} or store to \co{x} from any thread
that came before the store on line~13, that prior load or store would also
be ordered before the store on line~32, though only if both \co{r1} and
\co{r2} both end up containing the address of \co{x}.
In short, use of cumulative ordering operations can suppress
non-multicopy-atomic behaviors in some situations.
Cumulativity nevertheless has limits, which are examined in the next section.
\subsubsection{Propagation}
\label{sec:memorder:Propagation}
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-W+RWC+o-r+a-o+o-mb-o
{
int x = 0;
int y = 0;
int z = 0;
}
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_store_release(y, 1);
}
P1(int *y, int *z)
{
int r1;
int r2;
r1 = smp_load_acquire(y);
r2 = READ_ONCE(*z);
}
P2(int *z, int *x)
{
int r3;
WRITE_ONCE(*z, 1);
smp_mb();
r3 = READ_ONCE(*x);
}
exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{W+RWC Litmus Test With Release (No Ordering)}
\label{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
\end{listing}
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
(\path{C-W+RWC+o-r+a-o+o-mb-o.litmus})
shows the limitations of cumulativity and of store-release,
even with a full memory barrier helping out.
The problem is that although the \co{smp_store_release()} on
line~12 has cumulativity, and although that cumulativity does
order \co{P2()}'s load on line~30, the \co{smp_store_release()}'s
ordering cannot propagate through the combination of \co{P1()}'s
load (line~21) and \co{P2()}'s store (line~28).
This means that the \co{exists} clause on line~33 really can trigger.
\QuickQuiz{}
But it is not necessary to worry about propagation unless
there are at least three threads in the litmus test, right?
\QuickQuizAnswer{
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-R+o-wmb-o+o-mb-o
{
}
P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 1);
smp_wmb();
WRITE_ONCE(*x1, 1);
}
P1(int *x0, int *x1)
{
int r2;
WRITE_ONCE(*x1, 2);
smp_mb();
r2 = READ_ONCE(*x0);
}
exists (1:r2=0 /\ x1=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{R Litmus Test With Write Memory Barrier (No Ordering)}
\label{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
\end{listing}
Wrong.
Listing~\ref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
(\path{C-R+o-wmb-o+o-mb-o.litmus})
shows a two-thread litmus test that requires propagation due to
the fact that it only has store-to-store and load-to-store
links between its pair of threads.
Even though \co{P0()} is fully ordered by the \co{smp_wmb()} and
\co{P1()} is fully ordered by the \co{smp_mb()}, the
counter-temporal nature of the links means that
the \co{exists} clause on line~22 really can trigger.
To prevent this triggering, the \co{smp_wmb()} on line~8
must become an \co{smp_mb()}, bringing propagation into play
twice, once for each non-temporal link.
} \QuickQuizEnd
\begin{figure}[htbp]
\centering
\resizebox{\columnwidth}{!}{\includegraphics{memorder/fr}}
\caption{Load-to-Store is Counter-Temporal}
\label{fig:memorder:Load-to-Store is Counter-Temporal}
\end{figure}
This situation might seem completely counter-intuitive, but keep
in mind that the speed of light is finite and computers are of
non-zero size.
It therefore takes time for the effect of the \co{P2()}'s store to
\co{z} to propagate to \co{P1()}, which in turn means that it is possible
that \co{P1()}'s read from \co{z} happens much later in time, but
nevertheless still sees the old value of zero.
This situation is depicted in
Figure~\ref{fig:memorder:Load-to-Store is Counter-Temporal}:
Just because a load sees the old value does \emph{not} mean that
this load executed at an earlier time than did the store of the
new value.
Note that
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
also shows the limitations of memory-barrier pairing, given that
there are not two but three processes.
These more complex litmus tests can instead be said to have \emph{cycles},
where memory-barrier pairing is the special case of a two-thread cycle.
The cycle in
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
goes through \co{P0()} (lines~11 and~12), \co{P1()} (lines~20 and~21),
\co{P2()} (lines~28, 29, and~30), and back to \co{P0()} (line~11).
The \co{exists} clause delineates this cycle:
the \co{1:r1=1} indicates that the \co{smp_load_acquire()} on line~20
returned the value stored by the \co{smp_store_release()} on line~12,
the \co{1:r2=0} indicates that the \co{WRITE_ONCE()} on line~28 came
too late to affect the value returned by the \co{READ_ONCE()} on line~21,
and finally the \co{2:r3=0} indicates that the
\co{WRITE_ONCE()} on line~11 came too late to affect the value returned
by the \co{READ_ONCE()} on line~30.
In this case, the fact that the \co{exists} clause can trigger means that
the cycle is said to be \emph{allowed}.
In contrast, in cases where the \co{exists} clause cannot trigger,
the cycle is said to be \emph{prohibited}.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-W+RWC+o-mb-o+a-o+o-mb-o
{
int x = 0;
int y = 0;
int z = 0;
}
P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_mb();
WRITE_ONCE(*y, 1);
}
P1(int *y, int *z)
{
int r1;
int r2;
r1 = smp_load_acquire(y);
r2 = READ_ONCE(*z);
}
P2(int *z, int *x)
{
int r3;
WRITE_ONCE(*z, 1);
smp_mb();
r3 = READ_ONCE(*x);
}
exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{W+WRC Litmus Test With More Barriers}
\label{lst:memorder:W+WRC Litmus Test With More Barriers}
\end{listing}
But what if we need to keep the \co{exists} clause on line~33 of
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}?
One solution is to replace \co{P0()}'s \co{smp_store_release()}
with an \co{smp_mb()}, which
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
shows to have not only cumulativity, but also propagation.
The result is shown in
Listing~\ref{lst:memorder:W+WRC Litmus Test With More Barriers}
(\path{C-W+RWC+o-mb-o+a-o+o-mb-o.litmus}).
\QuickQuiz{}
But given that \co{smp_mb()} has the propagation property,
why doesn't the \co{smp_mb()} on line~29 of
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
prevent the \co{exists} clause from triggering?
\QuickQuizAnswer{
As a rough rule of thumb, the \co{smp_mb()} barrier's
propagation property is sufficient to maintain ordering
through only one store-to-load link between
processes.
Unfortunately,
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
has not one but two store-to-load links, with the
first being from the \co{READ_ONCE()} on line~21 to the
\co{WRITE_ONCE()} on line~28 and the second being from
the \co{READ_ONCE()} on line~30 to the \co{WRITE_ONCE()}
on line~11.
Therefore, preventing the \co{exists} clause from triggering
should be expected to require not one but two
instances of \co{smp_mb()}.
As a special exception to this rule of thumb, a release-acquire
chain can have one load-to-store link between processes
and still prohibit the cycle.
} \QuickQuizEnd
\begin{figure}[tbp]
\centering
\resizebox{\columnwidth}{!}{\includegraphics{memorder/co}}
\caption{Store-to-Store is Counter-Temporal}
\label{fig:memorder:Store-to-Store is Counter-Temporal}
\end{figure}
For completeness,
Figure~\ref{fig:memorder:Store-to-Store is Counter-Temporal}
shows that the ``winning'' store among a group of stores to the
same variable is not necessarily the store that started last.
This should not come as a surprise to anyone who carefully examined
Figure~\ref{fig:memorder:A Variable With More Simultaneous Values}.
But sometimes time is on our side, as shown in the next section.
\subsubsection{Happens-Before}
\label{sec:memorder:Happens-Before}
\begin{figure}[tbp]
\centering
\resizebox{\columnwidth}{!}{\includegraphics{memorder/rf}}
\caption{Store-to-Load is Temporal}
\label{fig:memorder:Store-to-Load is Temporal}
\end{figure}
As shown in
Figure~\ref{fig:memorder:Store-to-Load is Temporal},
on non-speculative platforms, if a load returns the value from a
particular store, then, courtesy of the finite speed of light and
the non-zero size of modern computing systems, the store absolutely
has to have executed at an earlier time than did the load.
This means that carefully constructed programs can rely on the
passage of time itself as an memory-ordering operation.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+a-o+o-data-o+o-data-o
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = smp_load_acquire(x0);
WRITE_ONCE(*x1, 2);
}
P1(int *x1, int *x2)
{
int r2;
r2 = READ_ONCE(*x1);
WRITE_ONCE(*x2, r2);
}
P2(int *x2, int *x0)
{
int r2;
r2 = READ_ONCE(*x2);
WRITE_ONCE(*x0, r2);
}
exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{LB Litmus Test With One Acquire}
\label{lst:memorder:LB Litmus Test With One Acquire}
\end{listing}
Of course, just the passage of time by itself is not enough, as
was seen in
Listing~\ref{lst:memorder:Load-Buffering Litmus Test (No Ordering)},
which has nothing but store-to-load links and still can
trigger its \co{exists} clause.
However, as long as each thread provides even the weakest possible
ordering, \co{exists} clause would not be able to trigger.
For example,
Listing~\ref{lst:memorder:LB Litmus Test With One Acquire}
(\path{C-LB+a-o+o-data-o+o-data-o.litmus})
shows \co{P0()} ordered with an \co{smp_load_acquire()} and
both \co{P1()} and \co{P2()} ordered with data dependencies.
These orderings, which are close to the top of
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet},
suffice to prevent the \co{exists} clause from triggering.
\QuickQuiz{}
Can you construct a litmus test similar to that in
Listing~\ref{lst:memorder:LB Litmus Test With One Acquire}
that uses \emph{only} dependencies?
\QuickQuizAnswer{
Listing~\ref{lst:memorder:LB Litmus Test With No Acquires}
shows a somewhat nonsensical but very real example.
Creating a more useful (but still real) litmus test is left
as an exercise for the reader.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+o-data-o+o-data-o+o-data-o
{
int x0=0;
int x1=1;
int x2=2;
}
P0(int *x0, int *x1)
{
int r2;
r2 = READ_ONCE(*x0);
WRITE_ONCE(*x1, r2);
}
P1(int *x1, int *x2)
{
int r2;
r2 = READ_ONCE(*x1);
WRITE_ONCE(*x2, r2);
}
P2(int *x2, int *x0)
{
int r2;
r2 = READ_ONCE(*x2);
WRITE_ONCE(*x0, r2);
}
exists (0:r2=2 /\ 1:r2=0 /\ 2:r2=1)
\end{verbbox}
}
\centering
\theverbbox
\caption{LB Litmus Test With No Acquires}
\label{lst:memorder:LB Litmus Test With No Acquires}
\end{listing}
} \QuickQuizEnd
An important, to say nothing of more useful, use of time for ordering
memory accesses is covered in the next section.
\subsubsection{Release-Acquire Chains}
\label{sec:memorder:Release-Acquire Chains}
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-LB+a-r+a-r+a-r+a-r
{
}
P0(int *x0, int *x1)
{
int r2;
r2 = smp_load_acquire(x0);
smp_store_release(x1, 2);
}
P1(int *x1, int *x2)
{
int r2;
r2 = smp_load_acquire(x1);
smp_store_release(x2, 2);
}
P2(int *x2, int *x3)
{
int r2;
r2 = smp_load_acquire(x2);
smp_store_release(x3, 2);
}
P3(int *x3, int *x0)
{
int r2;
r2 = smp_load_acquire(x3);
smp_store_release(x0, 2);
}
exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2 /\ 3:r2=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Long LB Release-Acquire Chain}
\label{lst:memorder:Long LB Release-Acquire Chain}
\end{listing}
A minimal release-acquire chain was shown in
Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
(\path{C-LB+a-r+a-r+a-r+a-r.litmus}),
but these chains can be much longer, as shown in
Listing~\ref{lst:memorder:Long LB Release-Acquire Chain}.
The longer the release-acquire chain, the more benefit is gained
from the passage of time, so that no matter how many threads are
involved, the corresponding \co{exists} clause cannot trigger.
Give or take the increased computational demands of the larger number
of threads, of course!
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-ISA2+o-r+a-r+a-r+a-o
{
}
P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 2);
smp_store_release(x1, 2);
}
P1(int *x1, int *x2)
{
int r2;
r2 = smp_load_acquire(x1);
smp_store_release(x2, 2);
}
P2(int *x2, int *x3)
{
int r2;
r2 = smp_load_acquire(x2);
smp_store_release(x3, 2);
}
P3(int *x3, int *x0)
{
int r1;
int r2;
r1 = smp_load_acquire(x3);
r2 = READ_ONCE(*x0);
}
exists (1:r2=2 /\ 2:r2=2 /\ 3:r1=2 /\ 3:r2=0)
\end{verbbox}
}
\centering
\theverbbox
\caption{Long ISA2 Release-Acquire Chain}
\label{lst:memorder:Long ISA2 Release-Acquire Chain}
\end{listing}
Although release-acquire chains are inherently store-to-load creatures,
it turns out that they can tolerate one load-to-store step, despite
such steps being counter-temporal, as shown in
Figure~\ref{fig:memorder:Load-to-Store is Counter-Temporal}.
For example,
Listing~\ref{lst:memorder:Long ISA2 Release-Acquire Chain}
(\path{C-ISA2+o-r+a-r+a-r+a-o.litmus})
shows a three-step release-acquire chain, but where \co{P3()}'s
final access is a \co{READ_ONCE()} from \co{x0}, which is
accessed via \co{WRITE_ONCE()} by \co{P0()}, forming a non-temporal
load-to-store link between these two processes.
However, because \co{P0()}'s \co{smp_store_release()} (line~12)
is cumulative, if \co{P3()}'s \co{READ_ONCE()} returns zero,
this cumulativity will force the \co{READ_ONCE()} to be ordered
before \co{P0()}'s \co{smp_store_release()}.
In addition, the release-acquire chain (lines~12, 20, 21, 28, 29, and~37)
forces \co{P3()}'s \co{READ_ONCE()} to be ordered after \co{P0()}'s
\co{smp_store_release()}.
Because \co{P3()}'s \co{READ_ONCE()} cannot be both before and after
\co{P0()}'s \co{smp_store_release()}, either or both of two things must
be true:
\begin{enumerate}
\item \co{P3()}'s \co{READ_ONCE()} came after \co{P0()}'s
\co{WRITE_ONCE()}, so that the \co{READ_ONCE()} returned
the value two, so that the \co{exists} clause's \co{3:r2=0}
is false.
\item The release-acquire chain did not form, that is, one or more
of the \co{exists} clause's \co{1:r2=2}, \co{2:r2=2}, or \co{3:r1=2}
is false.
\end{enumerate}
Either way, the \co{exists} clause cannot trigger, despite this litmus
test containing a notorious load-to-store link between
\co{P3()} and \co{P0()}.
But never forget that release-acquire chains can tolerate only one
load-to-store link, as was seen in
Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
C C-Z6.2+o-r+a-r+a-r+a-o
{
}
P0(int *x0, int *x1)
{
WRITE_ONCE(*x0, 2);
smp_store_release(x1, 2);
}
P1(int *x1, int *x2)
{
int r2;
r2 = smp_load_acquire(x1);
smp_store_release(x2, 2);
}
P2(int *x2, int *x3)
{
int r2;
r2 = smp_load_acquire(x2);
smp_store_release(x3, 2);
}
P3(int *x3, int *x0)
{
int r2;
r2 = smp_load_acquire(x3);
WRITE_ONCE(*x0, 3);
}
exists (1:r2=2 /\ 2:r2=2 /\ 3:r2=2 /\ x0=2)
\end{verbbox}
}
\centering
\theverbbox
\caption{Long Z6.2 Release-Acquire Chain}
\label{lst:memorder:Long Z6.2 Release-Acquire Chain}
\end{listing}
Release-acquire chains can also tolerate a single store-to-store step,
as shown in
Listing~\ref{lst:memorder:Long Z6.2 Release-Acquire Chain}
(\path{C-Z6.2+o-r+a-r+a-r+a-o.litmus}).
As with the previous example, \co{smp_store_release()}'s cumulativity
combined with the temporal nature of the release-acquire chain
prevents the \co{exists} clause on line~36 from triggering.
But beware: Adding a second store-to-store step would allow the correspondingly
updated \co{exists} clause to trigger.
% @@@ QQ on having a release-acquire chain with one load-to-store and
% @@@ one store-to-store link.
\QuickQuiz{}
There are store-to-load links, load-to-store links, and
store-to-store links.
But what about load-to-load links?
\QuickQuizAnswer{
The problem with the concept of load-to-load links is that
if the two loads from the same variable return the same
value, there is no way to determine their ordering.
The only way to determine their ordering is if they return
different values, in which case there had to have been an
intervening store.
And that intervening store means that there is no load-to-load
link, but rather a load-to-store link followed by a
store-to-load link.
} \QuickQuizEnd
In short, properly constructed release-acquire chains form a peaceful
island of intuitive bliss surrounded by a strongly counter-intuitive
sea of more complex memory-ordering constraints.
% Exercises?
% Hardware details from Appendix?
\section{Compile-Time Consternation}
\label{sec:memorder:Compile-Time Consternation}
Most languages, including C, were developed on uniprocessor systems
by people with little or no parallel-programming experience.
As a results, unless explicitly told otherwise, these languages assume
that nothing else is reading or writing memory.
This in turn means that these languages' compilers' optimizers
are ready, willing, and oh so able to make dramatic changes to the
order, number, and sizes of memory references that your program
executes.
In fact, the reordering carried out by hardware can seem quite tame
by comparison.
This section will help you tame your compiler, thus avoiding a great
deal of compile-time consternation.
Section~\ref{sec:memorder:Memory-Reference Restrictions}
describes how to keep the compiler from destructively optimizing
your code's memory references,
Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}
describes how to protect address and data dependencies,
and finally,
Section~\ref{sec:memorder:Control-Dependency Restrictions}
describes how to protect those delicate control dependencies.
\subsection{Memory-Reference Restrictions}
\label{sec:memorder:Memory-Reference Restrictions}
Again, unless told otherwise, compilers assume that nothing else
is affecting the variables being accessed by the generated code.
This assumption is not simply some design error, but is instead enshrined in
various standards.\footnote{
Or perhaps it is a standardized design error.}
This assumption means that compilers are within their rights
(as defined by the standards) to optimize the following code
so as to hoist the load from \co{a} out of the loop, at least
in cases where the compiler can prove that \co{do_something()}
does not modify \co{a}:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 while (a)
2 do_something();
\end{verbatim}
\end{minipage}
\vspace{5pt}
The optimized code might look something like this, essentially
fusing an arbitrarily large number of intended loads into a single
actual load:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 if (a)
2 for (;;)
3 do_something();
\end{verbatim}
\end{minipage}
\vspace{5pt}
This optimization might come as a fatal surprise to code elsewhere
that expected to terminate this loop by storing a zero to \co{a}.
Prevent this by using \co{READ_ONCE()}, for example, as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 while (READ_ONCE(a))
2 do_something();
\end{verbatim}
\end{minipage}
\vspace{5pt}
\co{READ_ONCE()} marks the load with a \co{volatile} cast.
Now \co{volatile} was originally designed for accessing
memory-mapped I/O (MMIO) registers, which are accessed using
the same load and store instructions that are used when
accessing normal memory.
However, MMIO registers need not act at all like normal memory.
Storing a value to an MMIO register does not necessarily mean
that a subsequent load from that register will return the value
stored.
Loading from an MMIO register might well have side effects,
for example, changing the device state or affecting the response
to subsequent loads and stores involving other MMIO registers.
Loads and stores of different sizes to the same MMIO address
might well have different effects.
This means that, even on a uniprocessor system, changing the
order, number, or size of MMIO accesses must be strictly
forbidden.
And this is exactly the purpose of the C-language \co{volatile} keyword,
to constrain the compiler to allow implementation of reliable device
drivers.
This is why \co{READ_ONCE()} prevents the destructive hoisting of
the load from \co{a} out of the loop: Doing so changes the number
of \co{volatile} loads from \co{a}, so this optimization is disallowed.
However, note well that \co{volatile} does absolutely nothing to
constrain the hardware.
Therefore, if the code following the loop needs to see the result of
any memory references preceding the store of zero that terminated
the loop, you will instead need to use something like
\co{smp_store_release()} to store the zero and \co{smp_load_acquire()}
in the loop condition.
But if all you need is to reliably control the loop without any
other ordering, \co{READ_ONCE()} can do the job.
Compilers can also replicate loads.
For example, consider this all-too-real code fragment:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 tmp = p;
2 if (tmp != NULL && tmp <= q)
3 do_something(tmp);
\end{verbatim}
\end{minipage}
\vspace{5pt}
Here the intent is that the \co{do_something()} function is never
passed a \co{NULL} pointer or a pointer that is greater than \co{q}.
However, the compiler is within its rights to transform this into the
following:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 if (p != NULL && p <= q)
2 do_something(p);
\end{verbatim}
\end{minipage}
\vspace{5pt}
In this transformed code, the value of \co{p} is loaded three
separate times.
This transformation might seem silly at first glance, but it is quite
tempting to compilers when the surrounding code has consumed all of the
machine registers.
It is possible that the current value of \co{p} passes the test on
line~1, but that some other thread stores \co{NULL} to \co{p} before
line~2 executes, and the resulting \co{NULL} pointer could be a fatal
surprise to \co{do_something()}.\footnote{
Your editor made exactly this mistake in the DYNIX/ptx
kernel's memory allocator in the early 1990s.
Tracking down the bug consumed a holiday weekend not just
for your editor, but also for several of his colleagues.
In short, this is not a new problem.}
Compilers can also fuse stores.
The most infamous example is probably the progress-bar example
shown below:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 while (!am_done()) {
2 do_something(p);
3 progress++;
4 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
If the compiler used a feedback-driven optimizer, it might well
notice that the store to the shared variable \co{progress} was
quite expensive, resulting in the following well-intentioned
optimization:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 while (!am_done()) {
2 do_something(p);
3 tmp++;
4 }
5 progress = tmp;
\end{verbatim}
\end{minipage}
\vspace{5pt}
This might well slightly increase performance, but the poor user
watching the progress bar might be forgiven for harboring significant
ill will towards this particular optimization.
The progress bar will after all be stuck at zero for a long time,
then jump at the very end.
The following code will usually prevent this problem:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 while (!am_done()) {
2 do_something(p);
3 WRITE_ONCE(progress, progress + 1);
4 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Exceptions can occur if the compiler is able to analyze \co{do_something()}
and learn that it has no accesses to atomic or \co{volatile} variables.
In these cases the compiler could produce two loops, one invoking
\co{do_something()} and the other incrementing \co{progress}.
It may be necessary to replace the \co{WRITE_ONCE()} with something
like \co{smp_store_release()} in the unlikely event that this occurs.
It is important to note that although the compiler is forbidden from
changing the number, size, or order of \co{volatile} accesses, it is
perfectly within its rights to reorder normal accesses with unrelated
\co{volatile} accesses.
Oddly enough, the compiler is within its rights to use a variable
as temporary storage just before a store to that variable, thus
inventing stores to that variable.
Fortunately, most compilers avoid this sort of thing, at least outside
of stack variables.
Nevertheless, using \co{WRITE_ONCE()} (or declaring the variable
\co{volatile}) should prevent this sort of thing.
But take care: If you have a translation unit that uses that variable,
and never makes a volatile access to it, the compiler has no way of
knowing that it needs to be careful.
The previous examples involved compiler optimizations that
changed the number of accesses.
Now, it might seem that preventing the compiler from changing the
order of accesses is an act of futility, given that the underlying
hardware is free to reorder them.
However, modern machines have \emph{exact exceptions} and
\emph{exact interrupts}, meaning that any interrupt or exception will
appear to have happened at a specific place in the instruction
stream, so that the handler will see the effect of all prior
instructions, but won't see the effect of any subsequent instructions.
\co{READ_ONCE()} and \co{WRITE_ONCE()} can therefore be used to
control communication between interrupted code and interrupt handlers.
This leaves changes to the size of accesses, which is known as
\emph{load tearing} and \emph{store tearing} when the actual size
is smaller than desired.
For example, storing the constant \co{0x00010002} into a 32-bit
variable might seem quite safe.
However, there are CPUs that can store small immediate values
directly into memory, and on such CPUs, the compiler can be expected
to split this into two 16-bit stores in order to avoid the overhead
of explicitly forming the 32-bit constant.
This could come as a fatal surprise to another thread concurrently
loading from this variable, which might not expect to see the
result of a half-completed store.
Use of \co{READ_ONCE()} and \co{WRITE_ONCE()} prevent the compiler
from engaging in load tearing and store tearing, respectively.
In short, use of \co{READ_ONCE()}, \co{WRITE_ONCE()}, and
\co{volatile} are valuable tools in preventing the compiler from
optimizing your parallel algorithm out of existence.
Compilers are starting to provide other mechanisms for avoiding
load and store tearing, for example, \co{memory_order_relaxed}
atomic loads and stores, however, \co{volatile} is still needed
to avoid fusing and splitting of accesses.
\subsection{Address- and Data-Dependency Restrictions}
\label{sec:memorder:Address- and Data-Dependency Restrictions}
Compilers do not understand either address or data dependencies,
although there are efforts underway to teach them, or at the very
least, standardize the process of teaching
them~\cite{PaulEMcKennneyConsumeP0190R0,PaulEMcKenney2017markconsumeP0462R1}.
In the meantime, it is necessary to be very careful in order to avoid
giving your compiler the opportunity to break your dependencies.
\subsubsection{Give your dependency chain a good start}
The load that heads your dependency chain must use proper
ordering, for example, \co{lockless_dereference()},
\co{rcu_dereference()}, or
a \co{READ_ONCE()} followed by \co{smp_read_barrier_depends()}.
Failure to follow this rule can have serious side effects:
\begin{enumerate}
\item On DEC Alpha, a dependent load might not be ordered with
the load heading the dependency chain, as described in
Section~\ref{sec:memorder:Alpha}.
\item If the load heading the dependency chain is a
C11 non-volatile \co{memory_order_relaxed} load,
the compiler could omit the load, for example, by using a value
that it loaded in the past.
\item If the load heading the dependency chain is a plain load,
the compiler can omit the load, again by using a value
that it loaded in the past.
Worse yet, it could load twice instead of once, so that
different parts of your code use different values---and
real compilers do this, especially when under register
pressure.
\item The value loaded by the head of the dependency chain must
be a pointer.
In theory, yes, you could load an integer, perhaps to use
it as an array index.
In practice, the compiler knows too much about integers,
and thus has way too many opportunities to break your
dependency chain.
\end{enumerate}
\subsubsection{Avoid arithmetic dependency breakage}
Although it is just fine to do some arithmetic operations on a pointer in
your dependency chain, you need to be careful to avoid giving the
compiler too much information.
After all, if the compiler learns enough to determine the exact value
of the pointer, it can and sometimes will use that exact value instead
of the pointer itself.
As soon as the compiler does that, the dependency is broken and all
ordering is lost.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 int reserve_int;
2 int *gp;
3 int *p;
4
5 p = rcu_dereference(gp);
6 if (p == &reserve_int)
7 handle_reserve(p);
8 do_something_with(*p); /* buggy! */
\end{verbbox}
}
\centering
\theverbbox
\caption{Breakable Dependencies With Comparisons}
\label{lst:memorder:Breakable Dependencies With Comparisons}
\end{listing}
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 int reserve_int;
2 int *gp;
3 int *p;
4
5 p = rcu_dereference(gp);
6 if (p == &reserve_int) {
7 handle_reserve(&reserve_int);
8 do_something_with(reserve_int); /* buggy! */
9 } else {
10 do_something_with(*p); /* OK! */
11 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Broken Dependencies With Comparisons}
\label{lst:memorder:Broken Dependencies With Comparisons}
\end{listing}
\begin{enumerate}
\item Although it is permissible to compute offsets from a
pointer, these offsets must not result in total cancellation.
For example, given a \co{char} pointer \co{cp},
\co{cp-(uintptr_t)cp)} will cancel and can allow the compiler
to break your dependency chain.
On the other hand, canceling offset values with each other
is perfectly safe and legal.
For example, if \co{a} and \co{b} are equal, \co{cp+a-b}
is an identity function, including preserving the dependency.
\item Comparisons can break dependencies.
Listing~\ref{lst:memorder:Breakable Dependencies With Comparisons}
shows how this can happen.
Here global pointer \co{gp} points to a dynamically allocated
integer, but if memory is low, it might instead point to
the \co{reserve_int} variable.
This \co{reserve_int} case might need special handling, as
shown on lines~6 and~7 of the figure.
But the compiler could reasonably transform this code into
the form shown in
Listing~\ref{lst:memorder:Broken Dependencies With Comparisons},
especially on systems where instructions with absolute
addresses run faster than instructions using addresses
supplied in registers.
However, there is clearly no ordering between the pointer
load on line~5 and the dereference on line~8.
Please note that this is simply an example: There are a great
many other ways to break dependency chains with comparisons.
\end{enumerate}
\QuickQuiz{}
Why can't you simply dereference the pointer before comparing it
to \co{&reserve_int} on line~6 of
Listing~\ref{lst:memorder:Breakable Dependencies With Comparisons}?
\QuickQuizAnswer{
For first, it might be necessary to invoke
\co{handle_reserve()} before \co{do_something_with()}.
But more relevant to memory ordering, the compiler is often within
its rights to hoist the comparison ahead of the dereferences,
which would allow the compiler to use \co{&reserve_int} instead
of the varaiable \co{p} that the hardware has tagged with
a dependency.
} \QuickQuizEnd
\QuickQuiz{}
But it should be safe to compare two pointer variables,
right? After all, the compiler doesn't know the value
of either, so how can it possibly learn anything from the
comparison?
\QuickQuizAnswer{
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 int *gp1;
2 int *gp2;
3 int *p;
4 int *q;
5
6 p = rcu_dereference(gp1);
7 q = READ_ONCE(gp2);
8 if (p == q)
9 handle_equality(p);
10 do_something_with(*p);
\end{verbbox}
}
\centering
\theverbbox
\caption{Breakable Dependencies With Non-Constant Comparisons}
\label{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}
\end{listing}
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 int *gp1;
2 int *gp2;
3 int *p;
4 int *q;
5
6 p = rcu_dereference(gp1);
7 q = READ_ONCE(gp2);
8 if (p == q) {
9 handle_equality(q);
10 do_something_with(*q);
11 } else {
12 do_something_with(*p);
13 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Broken Dependencies With Non-Constant Comparisons}
\label{lst:memorder:Broken Dependencies With Non-Constant Comparisons}
\end{listing}
Unfortunately, the compiler really can learn enough to
break your dependency chain, for example, as shown in
Listing~\ref{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}.
The compiler is within its rights to transform this code
into that shown in
Listing~\ref{lst:memorder:Broken Dependencies With Non-Constant Comparisons},
and might well make this transformation due to register pressure
if \co{handle_equality()} was inlined and needed a lot of registers.
Line~10 of this transformed code uses \co{q}, which although
equal to \co{p}, is not necessarily tagged by the hardware as
carrying a dependency.
Therefore, this transformed code does not necessarily guarantee
that line~10 is ordered after line~6.\footnote{
Kudos to Linus Torvalds for providing this example.}
} \QuickQuizEnd
% @@@ Quick quiz showing other ways to break dependencies by comparison
\subsubsection{Safe comparison of dependent pointers}
It turns out that there are several safe ways to compare dependent
pointers:
\begin{enumerate}
\item Comparisons against the \co{NULL} pointer.
In this case, all the compiler can learn is that the pointer
is \co{NULL}, in which case you are not allowed to
dereference it in any case.
\item The dependent pointer is never dereferenced, whether before or
after the comparison.
\item The dependent pointer is compared to a pointer that references
objects that were last modified a very long time ago, where
the safest value of ``a very long time ago'' is at compile time.
\end{enumerate}
% @@@ convert to \co{uintptr_t} to set/clear bits and convert back.
% @@@ More here
In short, some care is required in order to ensure that dependency
chains in your source code are still dependency chains once the
compiler has gotten done with them.
\subsection{Control-Dependency Restrictions}
\label{sec:memorder:Control-Dependency Restrictions}
Control dependencies are especially tricky because current compilers
do not understand them and can easily break them.
The rules and examples in this section are intended to help you
prevent your compiler's ignorance from breaking your code.
A load-load control dependency requires a full read memory barrier,
not simply a data dependency barrier.
Consider the following bit of code:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q) {
3 <data dependency barrier>
4 q = READ_ONCE(y);
5 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
This will not have the desired effect because there is no actual data
dependency, but rather a control dependency that the CPU may short-circuit
by attempting to predict the outcome in advance, so that other CPUs see
the load from~\co{y} as having happened before the load from~\co{x}.
In such a case what's actually required is:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q) {
3 <read barrier>
4 q = READ_ONCE(y);
5 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
However, stores are not speculated.
This means that ordering \emph{is} provided for load-store control
dependencies, as in the following example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q)
3 WRITE_ONCE(y, 1);
\end{verbatim}
\end{minipage}
\vspace{5pt}
Control dependencies pair normally with other types of ordering operations.
That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()}
are optional!
Without the \co{READ_ONCE()}, the compiler might combine the load
from~\co{x} with other loads from~\co{x}.
Without the \co{WRITE_ONCE()}, the compiler might combine the store
to~\co{y} with other stores to~\co{y}.
Either can result in highly counterintuitive effects on ordering.
Worse yet, if the compiler is able to prove (say) that the value of
variable~\co{x} is always non-zero, it would be well within its rights
to optimize the original example by eliminating the ``\co{if}'' statement
as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
\end{verbatim}
\end{minipage}
\vspace{5pt}
It is tempting to try to enforce ordering on identical stores on both
branches of the ``\co{if}'' statement as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q) {
3 barrier();
4 WRITE_ONCE(y, 1);
5 do_something();
6 } else {
7 barrier();
8 WRITE_ONCE(y, 1);
9 do_something_else();
10 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Unfortunately, current compilers will transform this as follows at high
optimization levels:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 barrier();
3 WRITE_ONCE(y, 1); /* BUG: No ordering!!! */
4 if (q) {
5 do_something();
6 } else {
7 do_something_else();
8 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Now there is no conditional between the load from~\co{x} and the store
to~\co{y}, which means that the CPU is within its rights to reorder them:
The conditional is absolutely required, and must be present in the
assembly code even after all compiler optimizations have been applied.
Therefore, if you need ordering in this example, you need explicit
memory-ordering operations, for example, a release store:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q) {
3 smp_store_release(&y, 1);
4 do_something();
5 } else {
6 smp_store_release(&y, 1);
7 do_something_else();
8 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
The initial \co{READ_ONCE()} is still required to prevent the compiler from
proving the value of~\co{x}.
In addition, you need to be careful what you do with the local variable~%
\co{q},
otherwise the compiler might be able to guess the value and again remove
the needed conditional.
For example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q % MAX) {
3 WRITE_ONCE(y, 1);
4 do_something();
5 } else {
6 WRITE_ONCE(y, 2);
7 do_something_else();
8 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 2);
3 do_something_else();
\end{verbatim}
\end{minipage}
\vspace{5pt}
Given this transformation, the CPU is not required to respect the ordering
between the load from variable~\co{x} and the store to variable~\co{y}.
It is tempting to add a \co{barrier()} to constrain the compiler,
but this does not help.
The conditional is gone, and the \co{barrier()} won't bring it back.
Therefore, if you are relying on this ordering, you should make sure
that \co{MAX} is greater than one, perhaps as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 BUILD_BUG_ON(MAX <= 1);
3 if (q % MAX) {
4 WRITE_ONCE(y, 1);
5 do_something();
6 } else {
7 WRITE_ONCE(y, 2);
8 do_something_else();
9 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Please note once again that the stores to~\co{y} differ.
If they were identical, as noted earlier, the compiler could pull this
store outside of the ``\co{if}'' statement.
You must also avoid excessive reliance on boolean short-circuit evaluation.
Consider this example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q || 1 > 0)
3 WRITE_ONCE(y, 1);
\end{verbatim}
\end{minipage}
\vspace{5pt}
Because the first condition cannot fault and the second condition is
always true, the compiler can transform this example as following,
defeating control dependency:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 WRITE_ONCE(y, 1);
\end{verbatim}
\end{minipage}
\vspace{5pt}
This example underscores the need to ensure that the compiler cannot
out-guess your code.
More generally, although \co{READ_ONCE()} does force
the compiler to actually emit code for a given load, it does not force
the compiler to use the results.
In addition, control dependencies apply only to the then-clause and
else-clause of the if-statement in question.
In particular, it does
not necessarily apply to code following the if-statement:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 q = READ_ONCE(x);
2 if (q) {
3 WRITE_ONCE(y, 1);
4 } else {
5 WRITE_ONCE(y, 2);
6 }
7 WRITE_ONCE(z, 1); /* BUG: No ordering. */
\end{verbatim}
\end{minipage}
\vspace{5pt}
It is tempting to argue that there in fact is ordering because the
compiler cannot reorder volatile accesses and also cannot reorder
the writes to~\co{y} with the condition.
Unfortunately for this line
of reasoning, the compiler might compile the two writes to~\co{y} as
conditional-move instructions, as in this fanciful pseudo-assembly
language:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 ld r1,x
2 cmp r1,$0
3 cmov,ne r4,$1
4 cmov,eq r4,$2
5 st r4,y
6 st $1,z
\end{verbatim}
\end{minipage}
\vspace{5pt}
A weakly ordered CPU would have no dependency of any sort between the load
from~\co{x} and the store to~\co{z}.
The control dependencies would extend
only to the pair of cmov instructions and the store depending on them.
In short, control dependencies apply only to the stores in the ``\co{then}''
and ``\co{else}'' of the ``\co{if}'' in question (including functions
invoked by those two clauses), not to code following that ``\co{if}''.
Finally, control dependencies do \emph{not} provide cumulativity.\footnote{
Refer to Section~\ref{sec:memorder:Cumulativity} for
the meaning of cumulativity.}
This is demonstrated by two related examples, with the initial values
of~\co{x} and~\co{y} both being zero:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\tt
\scriptsize
\begin{tabular}{l|p{1.5in}}
\nf{CPU 0} & \nf{CPU 1} \\
\hline
\tco{r1 = READ_ONCE(x);} &
\tco{r2 = READ_ONCE(y);} \\
if (r1 > 0) &
if (r2 > 0) \\
~~~\tco{WRITE_ONCE(y, 1);} &
~~~\tco{WRITE_ONCE(x, 1);} \\
\multicolumn{2}{l}{~} \\
\multicolumn{2}{l}{\tco{assert(!(r1 == 1 && r2 == 1));}} \\
\end{tabular}
\end{minipage}
\vspace{5pt}
The above two-CPU example will never trigger the \co{assert()}.
However, if control dependencies guaranteed transitivity (which they do
not), then adding the following CPU would guarantee a related assertion:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\tt
\scriptsize
\begin{tabular}{l}
\nf{CPU 2} \\
\hline
\tco{WRITE_ONCE(x, 2);} \\
\multicolumn{1}{l}{~} \\
\multicolumn{1}{l}{\tco{assert(!(r1 == 2 && r2 == 1 && x == 2));}} \\
\end{tabular}
\end{minipage}
\vspace{5pt}
But because control dependencies do \emph{not} provide transitivity, the above
assertion can fail after the combined three-CPU example completes.
If you need the three-CPU example to provide ordering, you will need
\co{smp_mb()} between the loads and stores in the CPU~0 and CPU~1 code
fragments, that is, just before or just after the ``\co{if}'' statements.
Furthermore, the original two-CPU example is very fragile and should be avoided.
The two-CPU example is known as LB (load buffering) and the three-CPU
example as WWC~\cite{Maranget2012TutorialARMPower}.
The following list of rules summarizes the lessons of this section:
\begin{enumerate}
\item Compilers do not understand control dependencies, so it is
your job to make sure that the compiler cannot break your code.
\item Control dependencies can order prior loads against later stores.
However, they do \emph{not} guarantee any other sort of ordering:
Not prior loads against later loads, nor prior stores against
later anything.
If you need these other forms of ordering, use \co{smp_rmb()},
\co{smp_wmb()}, or, in the case of prior stores and later loads,
\co{smp_mb()}.
\item If both legs of the ``\co{if}'' statement begin with identical stores
to the same variable, then those stores must be ordered,
either by preceding both of them with \co{smp_mb()} or by using
\co{smp_store_release()} to carry out the stores.
Please note that it is \emph{not} sufficient to use \co{barrier()}
at beginning of each leg of the ``\co{if}'' statement because, as shown
by the example above, optimizing compilers can destroy the control
dependency while respecting the letter of the \co{barrier()} law.
\item Control dependencies require at least one run-time conditional
between the prior load and the subsequent store, and this
conditional must involve the prior load.
If the compiler is able to optimize the conditional away, it
will have also optimized away the ordering.
Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help
to preserve the needed conditional.
\item Control dependencies require that the compiler avoid reordering
the dependency into nonexistence.
Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or
\co{atomic64_read()} can help to preserve your control
dependency.
\item Control dependencies apply only to the ``\co{then}'' and
``\co{else}'' of the ``\co{if}'' containing the control
dependency, including any functions that these two clauses call.
Control dependencies do \emph{not} apply to code following the
``\co{if}'' containing the control dependency.
\item Control dependencies pair normally with other types of
memory-ordering operations.
\item Control dependencies do \emph{not} provide transitivity.
If you need transitivity, use \co{smp_mb()}.
\end{enumerate}
In short, many popular languages were designed primarily with
single-threaded use in mind.
Successfully using these languages to construct multi-threaded software
requires that you pay special attention to your memory references and
dependencies.
\section{Hardware Specifics}
\label{sec:memorder:Hardware Specifics}
\OriginallyPublished{Appendix}{sec:memorder:Hardware Specifics}{Memory-Barrier Instructions For Specific CPUs}{Linux Journal}{PaulMcKenney2005i,PaulMcKenney2005j}
\begin{table*}
\small
\centering
\renewcommand*{\arraystretch}{1.2}\OneColumnHSpace{-.6in}
\begin{tabular}{l|p{2in}|c|c|c|c|c|c|c|c|c|c|c}
\multicolumn{2}{l|}{~}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{Alpha}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{ARMv8}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{ARMv7-A/R}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{Itanium}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{MIPS}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{(PA-RISC)}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{PA-RISC CPUs}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{POWER}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{SPARC TSO}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{x86}
\end{picture}
& \begin{picture}(6,60)(0,0)
\rotatebox{90}{z~Systems}
\end{picture}
\\
\hline
% Alpha ARMv8 ARMv7 Itanium MIPS PA-RISC -CPUs PPC SPARC x86 z Systems
\hline
Memory Ordering
& Loads Reordered After Loads or Stores?
& Y & Y & Y & Y & Y & Y & ~ & Y & ~ & ~ & ~ \\
\cline{2-13}
& Stores Reordered After Stores?
& Y & Y & Y & Y & Y & Y & ~ & Y & ~ & ~ & ~ \\
\cline{2-13}
& Stores Reordered After Loads?
& Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & Y \\
\cline{2-13}
& Atomic Instructions Reordered With Loads or Stores?
& Y & Y & Y & ~ & Y & ~ & ~ & Y & ~ & ~ & ~ \\
\cline{2-13}
& Dependent Loads Reordered?
& Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\
\cline{2-13}
& Dependent Stores Reordered?
& ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\
\cline{2-13}
& Non-Sequentially Consistent?
& Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & Y \\
\cline{2-13}
& Non-Multicopy Atomic?
& Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & ~ \\
\cline{2-13}
& Non-Other-Multicopy Atomic?
& Y & ~ & Y & Y & Y & Y & ~ & Y & Y & Y & ~ \\
\cline{2-13}
& Non-Cache Coherent?
& ~ & ~ & ~ & Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\
\hline
\hline
Instructions
& Load-Acquire/Store-Release?
& F & i & F & I & ? & ? & ~ & b & ~ & ~ & ~ \\
\cline{2-13}
& Atomic RMW Instruction Type?
& L & L & L & C & L & ? & ? & L & ? & C & C \\
\cline{2-13}
& Incoherent Instruction Cache/Pipeline?
& Y & Y & Y & Y & Y & ~ & ~ & Y & Y & Y & Y \\
\end{tabular}
\renewcommand*{\arraystretch}{1}
\begin{tabular}{llcl}
~ & ~ & ~ & ~\\
{ \bf Key: } & ~ & ~ & ~ \\
~ & \multicolumn{3}{l}{Load-Acquire/Store-Release?} \\
~ & ~ & b: & Lightweight memory barrier \\
~ & ~ & F: & Full memory barrier \\
~ & ~ & i: & Instruction with lightweight ordering \\
~ & ~ & I: & Instruction with heavyweight ordering \\
~ &\multicolumn{3}{l}{Atomic RMW Instruction Type?} \\
~ & ~ & C: & Compare-and-exchange instruction \\
~ & ~ & L: & Load-linked/store-conditional instruction \\
\end{tabular}
\caption{Summary of Memory Ordering}
\label{tab:memorder:Summary of Memory Ordering}
\end{table*}
Each CPU has its own peculiar approach to memory ordering, which
can make portability a challenge, as indicated by
Table~\ref{tab:memorder:Summary of Memory Ordering}.
In fact, some software environments simply prohibit
direct use of memory-ordering operations, restricting the programmer
to mutual-exclusion primitives that incorporate them to the extent that
they are required. Please note that this section is not intended to be
a reference manual
covering all (or even most) aspects of each CPU family, but rather
a high-level overview giving a rough comparison.
For full details, see the reference manual for the CPU of interest.
Getting back to
Table~\ref{tab:memorder:Summary of Memory Ordering},
the first group of rows look at memory-ordering
properties and the second group looks at instruction properties.
The first three rows indicate whether a given CPU allows the four
possible combinations of loads and stores to be reordered, as discussed
in
Sections~\ref{sec:memorder:Ordering: Why and How?} and
\ref{sec:memorder:Load Followed By Load}--\ref{sec:memorder:Store Followed By Store}.
The next row (``Atomic Instructions Reordered With Loads or Stores?'')
indicates whether a given CPU allows loads and stores
to be reordered with atomic instructions.
The fifth and sixth rows cover reordering and dependencies,
which was covered in
Sections~\ref{sec:memorder:Address Dependencies}--\ref{sec:memorder:Control Dependencies}
and which is explained in more detail in
Section~\ref{sec:memorder:Alpha}.
The short version is that Alpha requires memory barriers for readers
as well as updaters of linked data structures.
The next row, ``Non-Sequentially Consistent'', indicates whether
the CPU's normal load and store instructions are constrained by
sequential consistency.
Almost all are not constrained in this way for performance reasons.
The next two rows cover multicopy atomicity, which was defined in
Section~\ref{sec:memorder:Multicopy Atomicity}.
The first is full-up (and rare) multicopy atomicity, and the second is the
weaker other-multicopy atomicity.
The next row, ``Non-Cache Coherent'', covers accesses from multiple
threads to a single variable, which was discussed in
Section~\ref{sec:memorder:Cache Coherence}.
The final three rows cover instruction-level choices and issues.
The first row indicates how each CPU implements load-acquire
and store-release, the second row classifies CPUs by atomic-instruction
type, and the third and final row
indicates whether a given CPU has an incoherent
instruction cache and pipeline.
Such CPUs require special instructions be executed for self-modifying
code.
Parenthesized CPU names indicate modes that are architecturally allowed,
but rarely used in practice.
The common ``just say no'' approach to memory-ordering operations
can be eminently reasonable where it applies,
but there are environments, such as the Linux kernel, where direct
use of memory-ordering operations is required.
Therefore,
Linux provides a carefully chosen least-common-denominator
set of memory-ordering primitives, which are as follows:
\begin{description}
\item [\tco{smp_mb()}] (full memory barrier) that orders both loads and
stores.
This means that loads and stores preceding the memory barrier
will be committed to memory before any loads and stores
following the memory barrier.
\item [\tco{smp_rmb()}] (read memory barrier) that orders only loads.
\item [\tco{smp_wmb()}] (write memory barrier) that orders only stores.
\item [\tco{smp_read_barrier_depends()}] that forces subsequent operations
that depend on prior operations to be ordered.
This primitive is a no-op on all platforms except Alpha, but
is normally not used directly, but rather as part of
something like \co{lockless_dereference()} or
\co{rcu_dereference()}.
\item [\tco{smp_mb__before_atomic()}] that forces ordering of accesses
preceding the \co{smp_mb__before_atomic()} against accesses following
a later RMW atomic operation.
This is a noop on systems that fully order atomic RMW operatings.
\item [\tco{smp_mb__after_atomic()}] that forces ordering of accesses
preceding an earlier RMW atomic operation against accesses
following the \co{smp_mb__after_atomic()}.
This is also a noop on systems that fully order atomic RMW operatings.
\item [\tco{mmiowb()}] that forces ordering on MMIO writes that are guarded
by global spinlocks, and is more thoroughly described
in a 2016 LWN article on MMIO~\cite{PaulEMcKenney2016LinuxKernelMMIO}.
\end{description}
The \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()}
primitives also force
the compiler to eschew any optimizations that would have the effect
of reordering memory optimizations across the barriers.
The \co{smp_read_barrier_depends()} primitive has a similar effect, but
only on Alpha CPUs.
These primitives generate code only in SMP kernels, however, each
also has a UP version ({\tt mb()}, {\tt rmb()}, {\tt wmb()},
and \co{read_barrier_depends()},
respectively) that generate a memory barrier even in UP kernels. The \co{smp_}
versions should be used in most cases. However, these latter primitives
are useful when writing drivers, because MMIO accesses must remain
ordered even in UP kernels. In absence of memory-ordering operations,
both CPUs and compilers would happily rearrange these accesses, which at
best would make the device act strangely, and could crash your kernel or,
in some cases, even damage your hardware.
So most kernel programmers need not worry about the memory-ordering
peculiarities of each and every CPU, as long as they stick to these
interfaces.
If you are working deep in a given CPU's architecture-specific code,
of course, all bets are off.
Furthermore,
all of Linux's locking primitives (spinlocks, reader-writer locks,
semaphores, RCU, \ldots) include any needed ordering primitives.
So if you are working with code that uses these primitives, you don't
even need to worry about Linux's memory-ordering primitives.
That said, deep knowledge of each CPU's memory-consistency model
can be very helpful when debugging, to say nothing of when writing
architecture-specific code or synchronization primitives.
Besides, they say that a little knowledge is a very dangerous thing.
Just imagine the damage you could do with a lot of knowledge!
For those who wish to understand more about individual CPUs'
memory consistency models, the next sections describes those of the
most popular and prominent CPUs.
Although nothing can replace actually reading a given CPU's documentation,
these sections give a good overview.
\subsection{Alpha}
\label{sec:memorder:Alpha}
It may seem strange to say much of anything about a CPU whose end of life
has long since past, but Alpha is interesting because, with one of
the weakest memory ordering models, it reorders memory operations quite
aggressively.
It therefore has defined many of the Linux-kernel memory-ordering primitives,
which must work on all CPUs, including Alpha.
Understanding Alpha is therefore surprisingly important to the Linux kernel
hacker.
The difference between Alpha and the other CPUs is illustrated by the
code shown in
Listing~\ref{lst:memorder:Insert and Lock-Free Search}.
This \co{smp_wmb()} on line~9 of this figure
guarantees that the element initialization
in lines~6-8 is executed before the element is added to the
list on line~10, so that the lock-free search will work correctly.
That is, it makes this guarantee on all CPUs {\em except} Alpha.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
8 p->data = data;
9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16 struct el *p;
17 p = READ_ONCE(head.next);
18 while (p != &head) {
19 /* BUG ON ALPHA!!! */
20 if (p->key == key) {
21 return (p);
22 }
23 p = READ_ONCE(p->next);
24 };
25 return (NULL);
26 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Insert and Lock-Free Search}
\label{lst:memorder:Insert and Lock-Free Search}
\end{listing}
Alpha has extremely weak memory ordering
such that the code on line~20 of
Listing~\ref{lst:memorder:Insert and Lock-Free Search} could see the old
garbage values that were present before the initialization on lines~6-8.
Figure~\ref{fig:memorder:Why smp-read-barrier-depends() is Required}
shows how this can happen on
an aggressively parallel machine with partitioned caches, so that
alternating cache lines are processed by the different partitions
of the caches.
For example, the load of \co{head.next} on line~17 of
Listing~\ref{lst:memorder:Insert and Lock-Free Search}
might access cache bank~0,
and the load of \co{p->key} on line~20 and of \co{p->next} on line~23
might access cache bank~1.
On Alpha, the \co{smp_wmb()} will guarantee that the cache invalidations
performed by lines~6-8 of
Listing~\ref{lst:memorder:Insert and Lock-Free Search}
(for \co{p->next}, \co{p->key}, and \co{p->data}) will reach
the interconnect before that of line~10 (for \co{head.next}), but
makes absolutely no guarantee about the order of
propagation through the reading CPU's cache banks.
For example, it is possible that the reading CPU's cache bank~1 is very
busy, but cache bank~0 is idle.
This could result in the cache invalidations for the new element
(\co{p->next}, \co{p->key}, and \co{p->data}) being
delayed, so that the reading CPU loads the new value for \co{head.next},
but loads the old cached values for \co{p->key} and \co{p->next}.
Yes, this does mean that Alpha can in effect fetch
the data pointed to {\em before} it fetches the pointer itself, strange
but true.
See the documentation~\cite{Compaq01,WilliamPugh2000Gharachorloo}
called out earlier for more information,
or if you think that I am just making all this up.\footnote{
Of course, the astute reader will have already recognized that
Alpha is nowhere near as mean and nasty as it could be,
the (thankfully) mythical architecture in
Section~\ref{sec:app:whymb:Ordering-Hostile Architecture}
being a case in point.}
The benefit of this unusual approach to ordering is that Alpha can use
simpler cache hardware, which in turn permitted higher clock frequency
in Alpha's heyday.
\begin{figure}[tbp]
\centering
\resizebox{\columnwidth}{!}{\includegraphics{memorder/Alpha}}
\caption{Why \tco{smp_read_barrier_depends()} is Required}
\label{fig:memorder:Why smp-read-barrier-depends() is Required}
\end{figure}
One could place an \co{smp_rmb()} primitive
between the pointer fetch and dereference in order to force Alpha
to order the pointer fetch with the later dependent load.
However, this imposes unneeded overhead on systems (such as ARM
Itanium, PPC, and SPARC) that respect data dependencies on the read side.
A \co{smp_read_barrier_depends()} primitive has therefore been added to the
Linux kernel to eliminate overhead on these systems.
This primitive could be inserted in place of line~19 of
Listing~\ref{lst:memorder:Insert and Lock-Free Search},
but it is better to use the \co{rcu_dereference()} wrapper macro
as shown on lines~17 and~22 of
Listing~\ref{lst:memorder:Safe Insert and Lock-Free Search}.
It is also possible to implement a software barrier
that could be used in place of \co{smp_wmb()}, which would force
all reading CPUs to see the writing CPU's writes in order.
However, this approach was deemed by the Linux community
to impose excessive overhead
on extremely weakly ordered CPUs such as Alpha, and furthermore would
not be considered a reasonable approach by those whose systems must meet
aggressive real-time response requirements.
This software barrier could be implemented by sending inter-processor
interrupts (IPIs) to all other CPUs.
Upon receipt of such an IPI, a CPU would execute a memory-barrier
instruction, implementing a memory-barrier shootdown.
Additional logic is required to avoid deadlocks.
Of course, CPUs that respect data dependencies would define such a barrier
to simply be \co{smp_wmb()}.
Perhaps this decision should be revisited in the future as Alpha
fades off into the sunset, but as of 2017 there is a surprisingly
large number of people who run recent Linux kernels on their lovingly
preserved DEC Alpha systems.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct el *insert(long key, long data)
2 {
3 struct el *p;
4 p = kmalloc(sizeof(*p), GFP_ATOMIC);
5 spin_lock(&mutex);
6 p->next = head.next;
7 p->key = key;
8 p->data = data;
9 smp_wmb();
10 head.next = p;
11 spin_unlock(&mutex);
12 }
13
14 struct el *search(long key)
15 {
16 struct el *p;
17 p = rcu_dereference(head.next);
18 while (p != &head) {
19 if (p->key == key) {
20 return (p);
21 }
22 p = rcu_dereference(p->next);
23 };
24 return (NULL);
25 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Safe Insert and Lock-Free Search}
\label{lst:memorder:Safe Insert and Lock-Free Search}
\end{listing}
The Linux memory-barrier primitives took their names from the Alpha
instructions, so \co{smp_mb()} is {\tt mb}, \co{smp_rmb()} is {\tt rmb},
and \co{smp_wmb()} is {\tt wmb}.
Alpha is the only CPU where \co{smp_read_barrier_depends()} is
an \co{smp_mb()} rather than a no-op.
\QuickQuiz{}
Why is Alpha's \co{smp_read_barrier_depends()} an
\co{smp_mb()} rather than \co{smp_rmb()}?
\QuickQuizAnswer{
First, Alpha has only \co{mb} and \co{wmb} instructions,
so \co{smp_rmb()} would be implemented by the Alpha \co{mb}
instruction in either case.
More importantly, \co{smp_read_barrier_depends()} must
order subsequent stores.
For example, consider the following code:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 p = global_pointer;
2 smp_read_barrier_depends();
3 if (do_something_with(p->a, p->b) == 0)
4 p->hey_look = 1;
\end{verbatim}
\end{minipage}
\vspace{5pt}
Here the store to \co{p->hey_look} must be ordered,
not just the loads from \co{p->a} and \co{p->b}.
} \QuickQuizEnd
For more on Alpha, see its reference manual~\cite{ALPHA2002}.
\subsection{ARMv7-A/R}
\label{sec:memorder:ARMv7-A/R}
The ARM family of CPUs is extremely popular in embedded applications,
particularly for power-constrained applications such as cellphones.
Its memory model is similar to that of Power
(see Section~\ref{sec:memorder:POWER / PowerPC}, but ARM uses a
different set of memory-barrier instructions~\cite{ARMv7A:2010}:
\begin{description}
\item [\tco{DMB}] (data memory barrier) causes the specified type of
operations to \emph{appear} to have completed before any
subsequent operations of the same type.
The ``type'' of operations can be all operations or can be
restricted to only writes (similar to the Alpha \co{wmb}
and the POWER \co{eieio} instructions).
In addition, ARM allows cache coherence to have one of three
scopes: single processor, a subset of the processors
(``inner'') and global (``outer'').
\item [\tco{DSB}] (data synchronization barrier) causes the specified
type of operations to actually complete before any subsequent
operations (of any type) are executed.
The ``type'' of operations is the same as that of \co{DMB}.
The \co{DSB} instruction was called \co{DWB} (drain write buffer
or data write barrier, your choice) in early versions of the
ARM architecture.
\item [\tco{ISB}] (instruction synchronization barrier) flushes the CPU
pipeline, so that all instructions following the \co{ISB}
are fetched only after the \co{ISB} completes.
For example, if you are writing a self-modifying program
(such as a JIT), you should execute an \co{ISB} between
generating the code and executing it.
\end{description}
None of these instructions exactly match the semantics of Linux's
\co{rmb()} primitive, which must therefore be implemented as a full
\co{DMB}.
The \co{DMB} and \co{DSB} instructions have a recursive definition
of accesses ordered before and after the barrier, which has an effect
similar to that of POWER's cumulativity.
ARM also implements control dependencies, so that if a conditional
branch depends on a load, then any store executed after that conditional
branch will be ordered after the load.
However, loads following the conditional branch will \emph{not}
be guaranteed to be ordered unless there is an \co{ISB}
instruction between the branch and the load.
Consider the following example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 r1 = x;
2 if (r1 == 0)
3 nop();
4 y = 1;
5 r2 = z;
6 ISB();
7 r3 = z;
\end{verbatim}
\end{minipage}
\vspace{5pt}
In this example, load-store control dependency ordering causes
the load from \co{x} on line~1 to be ordered before the store to
\co{y} on line~4.
However, ARM does not respect load-load control dependencies, so that
the load on line~1 might well happen \emph{after} the
load on line~5.
On the other hand, the combination of the conditional branch on line~2
and the \co{ISB} instruction on line~6 ensures that
the load on line~7 happens after the load on line~1.
Note that inserting an additional \co{ISB} instruction somewhere between
lines~3 and 4 would enforce ordering between lines~1 and 5.
\subsection{ARMv8}
\begin{figure}[tb]
\centering
\resizebox{2in}{!}{\includegraphics{cartoons/r-2014-LDLAR}}
\caption{Half Memory Barrier}
\ContributedBy{Figure}{fig:memorder:Half Memory Barrier}{Melissa Brossard}
\end{figure}
ARMv8 is ARM's 64-bit CPU~\cite{ARMv8A:2017},
in contrast to their 32-bit CPU described in
Section~\ref{sec:memorder:ARMv7-A/R}.
ARMv8's memory model closely resembles its 32-bit counterpart,
but adds load-acquire (\co{LDLARB}, \co{LDLARH}, and \co{LDLAR})
and store-release (\co{STLLRB}, \co{STLLRH}, and \co{STLLR})
instructions.
These instructions act as ``half memory barriers'', so that
ARMv8 CPUs can reorder previous accesses with a later \co{LDLAR}
instruction, but are prohibited from reordering an earlier \co{LDAR}
instruction with later accesses, as fancifully depicted in
Figure~\ref{fig:memorder:Half Memory Barrier}.
Similarly, ARMv8 CPUs can reorder an earlier \co{STLLR} instruction with
a subsequent access, but are prohibited from reordering
previous accesses with a later \co{STLLR} instruction.
As one might expect, this means that these instructions directly support
the C11 notion of load-acquire and store-release.
However, ARMv8 goes well beyond the C11 memory model by mandating that
the combination of a store-release and load-acquire act as a full
barrier under many circumstances.
For example, in ARMv8, given a store followed by a store-release followed
a load-acquire followed by a load, all to different variable, all CPUs
would agree that the initial store preceded the final load.
Interestingly enough, most TSO architectures (including x86 and the
mainframe) do not make this guarantee, as the two loads could be
reordered before the two stores.
ARMv8 also has the distinction of being the first CPU whose vendor publicly
defined its memory ordering with a formal model~\cite{ARMv8A:2017}.
\subsection{Itanium}
Itanium offers a weak consistency model, so that in absence of explicit
memory-barrier instructions, Itanium is within its rights to arbitrarily
reorder memory references~\cite{IntelItanium02v2}.
Itanium has a memory-fence instruction named {\tt mf}, but also has
``half-memory fence'' modifiers to loads, stores, and to some of its atomic
instructions~\cite{IntelItanium02v3}.
The {\tt acq} modifier prevents subsequent memory-reference instructions
from being reordered before the {\tt acq}, but permits
prior memory-reference instructions to be reordered after the {\tt acq},
similar to the ARMv8 load-acquire instructions.
Similarly, the {\tt rel} modifier prevents prior memory-reference
instructions from being reordered after the {\tt rel}, but allows
subsequent memory-reference instructions to be reordered before
the {\tt rel}.
These half-memory fences are useful for critical sections, since
it is safe to push operations into a critical section, but can be
fatal to allow them to bleed out.
However, as one of the only CPUs with this property,\footnote{
ARMv8 has recently added load-acquire and store-release instructions.}
Itanium defines
Linux's semantics of memory ordering associated with lock acquisition
and release.
Oddly enough, actual Itanium hardware is rumored to implement
both load-acquire and store-release instructions as full barriers.
Nevertheless, Itanium was the first mainstream CPU to introduce the concept
(if not the reality) of load-acquire and store-release into its
instruction set.
The Itanium {\tt mf} instruction is used for the \co{smp_rmb()},
\co{smp_mb()}, and \co{smp_wmb()} primitives in the Linux kernel.
Oh, and despite rumors to the contrary, the ``mf'' mnemonic really
does stand for ``memory fence''.
Finally, Itanium offers a global total order for ``release'' operations,
including the ``mf'' instruction.
This provides the notion of transitivity, where if a given code fragment
sees a given access as having happened, any later code fragment will
also see that earlier access as having happened.
Assuming, that is, that all the code fragments involved correctly use
memory barriers.
\subsection{MIPS}
The MIPS memory model~\cite[Table 6.6]{MIPSvII-A-2015}
appears to resemble that of ARM, Itanium, and Power,
being weakly ordered by default, but respecting dependencies.
MIPS has a wide variety of memory-barrier instructions, but ties them
not to hardware considerations, but rather to the use cases provided
by the Linux kernel and the C++11 standard~\cite{RichardSmith2015N4527}
in a manner similar to the ARM64 additions:
\begin{description}[style=nextline]
\item[\tco{SYNC}]
Full barrier for a number of hardware operations in addition
to memory references.
\item[\tco{SYNC_WMB}]
Write memory barrier, which can be used to implement the
\co{smp_wmb()} primitive in the Linux kernel.
\item[\tco{SYNC_MB}]
Full memory barrier, but only for memory operations.
This may be used to implement the Linux-kernel \co{smp_mb()}
and the C++ \co{atomic_thread_fence(memory_order_seq_cst)}.
\item[\tco{SYNC_ACQUIRE}]
Acquire memory barrier, which may be used to implement the
Linux-kernel \co{smp_load_acquire()} and the C++
\co{atomic_load_explicit(..., memory_order_acquire)}.
\item[\tco{SYNC_RELEASE}]
Release memory barrier, which may be used to implement the
Linux-kernel \co{smp_store_release()} and the C++
\co{atomic_store_explicit(..., memory_order_release)}.
\item[\tco{SYNC_RMB}]
Read memory barrier, which can be used to implement the
\co{smp_rmb()} primitive in the Linux kernel.
\end{description}
Informal discussions with MIPS architects indicates that MIPS has a
definition of transitivity or cumulativity similar to that of
ARM and Power.
However, it appears that different MIPS implementations can have
different memory-ordering properties, so it is important to consult
the documentation for the specific MIPS implementation you are using.
\subsection{PA-RISC}
Although the PA-RISC architecture permits full reordering of loads and
stores, actual CPUs run fully ordered~\cite{GerryKane96a}.
This means that the Linux kernel's memory-ordering primitives generate
no code, however, they do use the gcc {\tt memory} attribute to disable
compiler optimizations that would reorder code across the memory
barrier.
\subsection{POWER / PowerPC}
\label{sec:memorder:POWER / PowerPC}
The POWER and PowerPC\textsuperscript{\textregistered}
CPU families have a wide variety of memory-barrier
instructions~\cite{PowerPC94,MichaelLyons05a}:
\begin{description}
\item [\tco{sync}] causes all preceding operations to {\em appear to have}
completed before any subsequent operations are started.
This instruction is therefore quite expensive.
\item [\tco{lwsync}] (light-weight sync) orders loads with respect to
subsequent loads and stores, and also orders stores.
However, it does {\em not} order stores with respect to subsequent
loads.
Interestingly enough, the {\tt lwsync} instruction enforces
the same ordering as does zSeries, and coincidentally,
SPARC TSO.
The \co{lwsync} instruction may be used to implement
load-acquire and store-release operations.
\item [\tco{eieio}] (enforce in-order execution of I/O, in case you
were wondering) causes all preceding cacheable stores to appear
to have completed before all subsequent stores.
However, stores to cacheable memory are ordered separately from
stores to non-cacheable memory, which means that {\tt eieio}
will not force an MMIO store to precede a spinlock release.
\item [\tco{isync}] forces all preceding instructions to appear to have
completed before any subsequent instructions start execution.
This means that the preceding instructions must have progressed
far enough that any traps they might generate have either happened
or are guaranteed not to happen, and that any side-effects of
these instructions (for example, page-table changes) are seen by the
subsequent instructions.
\end{description}
Unfortunately, none of these instructions line up exactly with Linux's
{\tt wmb()} primitive, which requires {\em all} stores to be ordered,
but does not require the other high-overhead actions of the {\tt sync}
instruction.
But there is no choice: ppc64 versions of {\tt wmb()} and {\tt mb()} are
defined to be the heavyweight {\tt sync} instruction.
However, Linux's \co{smp_wmb()} instruction is never used for MMIO
(since a driver must carefully order MMIOs in UP as well as
SMP kernels, after all), so it is defined to be the lighter weight
{\tt eieio} instruction.
This instruction may well be unique in having a five-vowel mnemonic.
The \co{smp_mb()} instruction is also defined to be the {\tt sync}
instruction, but both \co{smp_rmb()} and \co{rmb()} are defined to
be the lighter-weight {\tt lwsync} instruction.
Power features ``cumulativity'', which can be used to obtain
transitivity.
When used properly, any code seeing the results of an earlier
code fragment will also see the accesses that this earlier code
fragment itself saw.
Much more detail is available from
McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}.
Power respects control dependencies in much the same way that ARM
does, with the exception that the Power \co{isync} instruction
is substituted for the ARM \co{ISB} instruction.
Many members of the POWER architecture have incoherent instruction
caches, so that a store to memory will not necessarily be reflected
in the instruction cache.
Thankfully, few people write self-modifying code these days, but JITs
and compilers do it all the time.
Furthermore, recompiling a recently run program looks just like
self-modifying code from the CPU's viewpoint.
The {\tt icbi} instruction (instruction cache block invalidate)
invalidates a specified cache line from
the instruction cache, and may be used in these situations.
\subsection{SPARC RMO, PSO, and TSO}
Solaris on SPARC uses TSO (total-store order), as does Linux when built for
the ``sparc'' 32-bit architecture.
However, a 64-bit Linux kernel (the ``sparc64'' architecture)
runs SPARC in RMO (relaxed-memory order) mode~\cite{SPARC94}.
The SPARC architecture also offers an intermediate PSO (partial store
order).
Any program that runs in RMO will also run in either PSO or TSO, and similarly,
a program that runs in PSO will also run in TSO.
Moving a shared-memory parallel program in the other direction may
require careful insertion of memory barriers, although, as noted earlier,
programs that make standard use of synchronization primitives need not
worry about memory barriers.
SPARC has a very flexible memory-barrier instruction~\cite{SPARC94}
that permits fine-grained control of ordering:
\begin{description}
\item [\tco{StoreStore}] orders preceding stores before subsequent stores.
(This option is used by the Linux \co{smp_wmb()} primitive.)
\item [\tco{LoadStore}] orders preceding loads before subsequent stores.
\item [\tco{StoreLoad}] orders preceding stores before subsequent loads.
\item [\tco{LoadLoad}] orders preceding loads before subsequent loads.
(This option is used by the Linux \co{smp_rmb()} primitive.)
\item [\tco{Sync}] fully completes all preceding operations before starting
any subsequent operations.
\item [\tco{MemIssue}] completes preceding memory operations before subsequent
memory operations, important for some instances of memory-mapped
I/O.
\item [\tco{Lookaside}] does the same as MemIssue,
but only applies to preceding stores
and subsequent loads, and even then only for stores and loads that
access the same memory location.
\end{description}
The Linux \co{smp_mb()} primitive uses the first four options together,
as in
\co{membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad},
thus fully ordering memory operations.
So, why is \qco{membar #MemIssue} needed?
Because a \qco{membar #StoreLoad} could permit a subsequent
load to get its value from a store buffer, which would be
disastrous if the write was to an MMIO register that induced side effects
on the value to be read.
In contrast, \qco{membar #MemIssue} would wait until the store buffers
were flushed before permitting the loads to execute,
thereby ensuring that the load actually gets its value from the MMIO register.
Drivers could instead use \qco{membar #Sync}, but the lighter-weight
\qco{membar #MemIssue} is preferred in cases where the additional function
of the more-expensive \qco{membar #Sync} are not required.
The \qco{membar #Lookaside} is a lighter-weight version of
\qco{membar #MemIssue}, which is useful when writing to a given MMIO register
affects the value that will next be read from that register.
However, the heavier-weight \qco{membar #MemIssue} must be used when
a write to a given MMIO register affects the value that will next be
read from {\em some other} MMIO register.
It is not clear why SPARC does not define \co{wmb()} to be
\qco{membar #MemIssue} and \co{smp_wmb()} to be
\qco{membar #StoreStore},
as the current definitions seem vulnerable to bugs in some drivers.
It is quite possible that all the SPARC CPUs that Linux runs on
implement a more conservative memory-ordering model than the architecture
would permit.
SPARC requires a {\tt flush} instruction be used between the time that
an instruction is stored and executed~\cite{SPARC94}.
This is needed to flush any prior value for that location from
the SPARC's instruction cache.
Note that {\tt flush} takes an address, and will flush only that address
from the instruction cache.
On SMP systems, all CPUs' caches are flushed, but there is no
convenient way to determine when the off-CPU flushes complete,
though there is a reference to an implementation note.
\subsection{x86}
Since the x86 CPUs provide ``process ordering'' so that all CPUs agree
on the order of a given CPU's writes to memory, the \co{smp_wmb()} primitive
is a no-op for the CPU~\cite{IntelXeonV3-96a}.
However, a compiler directive is required to
prevent the compiler from performing optimizations that would result
in reordering across the \co{smp_wmb()} primitive.
On the other hand, x86 CPUs have traditionally given
no ordering guarantees for loads, so
the \co{smp_mb()} and \co{smp_rmb()} primitives expand to {\tt lock;addl}.
This atomic instruction acts as a barrier to both loads and stores.
Intel has also published a memory model for
x86~\cite{Intelx86MemoryOrdering2007}.
It turns out that Intel's actual CPUs enforced tighter ordering than
was claimed in the previous specifications, so this model is in effect
simply mandating the earlier de-facto behavior.
Even more recently, Intel published an updated memory model for
x86~\cite[Section 8.2]{Intel64IA32v3A2011}, which mandates a total global order
for stores, although individual CPUs are still permitted to see their
own stores as having happened earlier than this total global order
would indicate.
This exception to the total ordering is needed to allow important
hardware optimizations involving store buffers.
In addition, memory ordering obeys causality, so that if CPU~0 sees a
store by CPU~1, then CPU~0 is guaranteed to see all stores that CPU~1
saw prior to its store.
Software may use atomic operations to override these hardware optimizations,
which is one reason that atomic operations tend to be more expensive
than their non-atomic counterparts.
This total store order is \emph{not} guaranteed on older processors.
It is also important to note that atomic instructions operating
on a given memory location should all be of the same
size~\cite[Section 8.1.2.2]{Intel64IA32v3A2011}.
For example, if you write a program where one CPU atomically increments
a byte while another CPU executes a 4-byte atomic increment on
that same location, you are on your own.
However, note that some SSE instructions are weakly ordered ({\tt clflush}
and non-temporal move instructions~\cite{IntelXeonV2b-96a}).
CPUs that have SSE can use {\tt mfence} for \co{smp_mb()},
{\tt lfence} for \co{smp_rmb()}, and {\tt sfence} for \co{smp_wmb()}.
A few versions of the x86 CPU have a mode bit that enables out-of-order
stores, and for these CPUs, \co{smp_wmb()} must also be defined to
be {\tt lock;addl}.
Although newer x86 implementations accommodate self-modifying code
without any special instructions, to be fully compatible with
past and potential future x86 implementations, a given CPU must
execute a jump instruction or a serializing instruction (e.g., \co{cpuid})
between modifying the code and executing
it~\cite[Section 8.1.3]{Intel64IA32v3A2011}.
\subsection{zSeries}
The zSeries machines make up the IBM\textsuperscript{\texttrademark}
mainframe family, previously
known as the 360, 370, and 390~\cite{IBMzSeries04a}.
Parallelism came late to zSeries, but given that these mainframes first
shipped in the mid 1960s, this is not saying much.
The \qco{bcr 15,0} instruction is used for the Linux \co{smp_mb()},
\co{smp_rmb()}, and \co{smp_wmb()} primitives.
It also has comparatively strong memory-ordering semantics, as shown in
Table~\ref{tab:memorder:Summary of Memory Ordering}, which should allow the
\co{smp_wmb()} primitive to be a {\tt nop} (and by the time you read this,
this change may well have happened).
The table actually understates the situation, as the zSeries memory model
is otherwise sequentially consistent, meaning that all CPUs
will agree on the order of unrelated stores from different CPUs.
As with most CPUs, the zSeries architecture does not guarantee a
cache-coherent instruction stream, hence,
self-modifying code must execute a serializing instruction between updating
the instructions and executing them.
That said, many actual zSeries machines do in fact accommodate self-modifying
code without serializing instructions.
The zSeries instruction set provides a large set of serializing instructions,
including compare-and-swap, some types of branches (for example, the
aforementioned \qco{bcr 15,0} instruction), and test-and-set,
among others.
\section{Where is Memory Ordering Needed?}
\label{sec:memorder:Where is Memory Ordering Needed?}
This section revisits
Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
in light of the intervening discussion.
Memory-ordering operations are only required where there is a possibility
of interaction involving at least two variables between at least two
threads.
As always, if a single-threaded program will provide sufficient
performance, why bother with parallelism?\footnote{
Hobbyists and researchers should of course feel free to ignore
this and other cautions.}
After all, avoiding parallelism also avoids the added cost of
memory-ordering operations.
If all thread-to-thread communication in a given cycle use store-to-load links
(that is, the next thread's load returning the value that the previous thread
stored), minimal ordering suffices.
Minimal ordering includes dependencies, acquires, and all stronger
ordering operations.
If all but one of the links in a given cycle is a store-to-load
link, it is sufficient to use release-acquire pairs for each of
those store-to-load links.
You can replace a given acquire with a a dependency in environments permitting
this, keeping in mind that the C11 standard's memory model does \emph{not}
permit this.
Note also that a dependency leading to a load must be headed by
a \co{lockless_dereference()} or an \co{rcu_dereference()}:
\co{READ_ONCE()} is not sufficient.
Never forget to carefully review
Sections~\ref{sec:memorder:Address- and Data-Dependency Restrictions}
and~\ref{sec:memorder:Control-Dependency Restrictions}, because
a dependency broken by your compiler is no help at all!
The two threads sharing the sole non-store-to-load link can
usually substitute \co{WRITE_ONCE()} plus \co{smp_wmb()} for
\co{smp_store_release()} on the one hand,
and \co{READ_ONCE()} plus \co{smp_rmb()} for \co{smp_load_acquire()}
on the other.
If a given cycle contains two or more non-store-to-load links (that is,
a total of two or more load-to-store and store-to-store links), you will
need at least one full barrier between each pair of non-store-to-load
links in that cycle.
Full barriers include \co{smp_mb()}, successful full-strength non-\co{void}
atomic RMW operations, and other atomic RMW operations in conjunction with
either \co{smp_mb__before_atomic()} or \co{smp_mb__after_atomic()}.
Any of RCU's grace-period-wait primitives (\co{synchronize_rcu()} and
friends) also act as full barriers, but at even greater expense than
\co{smp_mb()}.
With strength comes expense, though the overhead of full barriers
usually hurts performance more than it hurts scalability.
Note that these are the \emph{minimum} guarantees.
Different architectures may give
more substantial guarantees,
as discussed in Section~\ref{sec:memorder:Hardware Specifics},
but they may \emph{not}
be relied upon outside of code specifically designed to run only on
the corresponding architecture.
One final word of advice: Again, use of raw memory-ordering primitives is
a last resort.
It is almost always better to use existing primitives, such as locking
or RCU, that take care of memory ordering for you.