blob: 667e5bcd592e9f04f9c5884b967a3f26ad41c314 [file] [log] [blame]
% memorder/memorder.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{qqzmemorder}
\OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{the Linux kernel}{Howells2009membartxt}
\OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{Linux Weekly News}{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2}
\OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{ASPLOS '18}{Alglave:2018:FSC:3173162.3177156}
%
\Epigraph{The art of progress is to preserve order amid change and to preserve change amid order.}{Alfred North Whitehead}
Causality and sequencing are deeply intuitive, and hackers often
have a strong grasp of these concepts.
These intuitions can be quite helpful when writing, analyzing, and
debugging not only sequential code, but also parallel code that makes
use of standard mutual-exclusion mechanisms such as locking.
Unfortunately, these intuitions break down completely in complex
concurrent code, such as that in the Linux-kernel, which often uses
weakly ordered atomic operations and memory barriers.
One example of such code implements the standard mutual-exclusion
mechanisms themselves, while another example implements fast
paths that use weaker synchronization.
Insults to intuition notwithstanding, some argue that weakness is a
virtue~\cite{JadeAlglave2013-WeaknessIsVirtue}.
Vice or virtue, this chapter will help you gain an understanding of
memory ordering, that, with practice, will be sufficient to implement
synchronization primitives and performance-critical fast paths.
First, \cref{sec:memorder:Memory-Model Intuitions}
provides some reliable intuitions and useful rules of thumb, which can
often provide a useful alternative to learning the full Linux-kernel
memory model (LKMM)\@.
However, those working on fast-paths or concurrency primitives will
need the additional detail provided by the following sections.
\Cref{sec:memorder:Ordering: Why and How?}
will demonstrate that real computer systems can reorder memory references,
give some reasons why they do so, and provide some information on how
to prevent undesired reordering.
\Cref{sec:memorder:Tricks and Traps,%
sec:memorder:Compile-Time Consternation}
will cover the types of pain that hardware and compilers, respectively,
can inflict on unwary parallel programmers.
\Cref{sec:memorder:Higher-Level Primitives}
gives an overview of the benefits of modeling memory ordering at
higher levels of abstraction.
Finally,
\cref{sec:memorder:Hardware Specifics}
follows up with more detail on a few representative hardware platforms.
\QuickQuiz{
This chapter has been rewritten since the first edition,
and heavily edited since the second edition.
Did memory ordering change all \emph{that} since 2014,
let alone since 2021?
}\QuickQuizAnswer{
The earlier memory-ordering section had its roots in a pair of
Linux Journal articles~\cite{PaulMcKenney2005i,PaulMcKenney2005j}
dating back to 2005.
Since then, the C and C++ memory models~\cite{PeteBecker2011N3242}
have been formalized
(and critiqued~\cite{MarkBatty2013OOTA-WorkingNote,Boehm:2014:OGA:2618128.2618134,Vafeiadis:2015:CCO:2775051.2676995,conf/esop/BattyMNPS15,Lahav:2017:RSC:3140587.3062352,OlivierGiroux2017-P0668R1}),
executable formal memory models for computer systems have become the
norm~\cite{Maranget2012TutorialARMPower,PaulEMcKenney2011ppcmem,test6-pdf,JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306,JadeAlglave2013-cav,Alglave:2014:HCM:2594291.2594347,PaulEMcKenney2014weakaxiom,Flur:2017:MCA:3093333.3009839,ARMv8A:2017},
and there is even a memory model for the Linux
kernel~\cite{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2,Alglave:2018:FSC:3173162.3177156},
along with a paper describing differences between the C11 and
Linux memory models~\cite{PaulEMcKenney2016P0124R6-LKMM}.
The \IXacrf{kcsan}~\cite{MarcoElver2020FearDataRaceDetector1,MarcoElver2020FearDataRaceDetector2},
based in part on
RacerD~\cite{SamBlackshear2018RacerD}
and implementing \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model},
has also been added to the Linux kernel
and is now heavily used.
Finally, there are now better ways of describing the Linux-kernel
memory model (LKMM).
Given all this progress, substantial change was required.
}\QuickQuizEnd
\section{Memory-Model Intuitions}
\label{sec:memorder:Memory-Model Intuitions}
%
\epigraph{Almost all people are intelligent.
It is method that they lack.}
{F. W. Nichol}
This section is for people who would like to avoid learning all the
details of the \IXalthmr{Linux-kernel memory model (\acr{lkmm})}
{Linux kernel}{memory model}, the better
to get on with their concurrency lives.
The good news is that learning a very small fraction of LKMM can get
you most of its benefits.
% @@@ \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
% @@@ and \cref{sec:memorder:Basic Rules of Thumb},
% @@@ summarizing the intervening discussion with some appeals to transitive
% @@@ intuitions and with more sophisticated rules of thumb.
But first, it is necessary to understand that useful ordering is a
property not of any single component of the system, but of the system
as a whole.
For example, it does not help for the hardware to preserve ordering if
the compiler does not also do so.
Nor does it help for both the hardware and compiler to provide ways
to preserve ordering if developers fail to use them.
The Linux kernel adds architecture-specific assembly-language primitives
to the mix, which must also preserve the required orderings.
In short, all the components of the system must work together to provide
any needed ordering.
Next, it is necessary to understand the temporal and non-temporal
nature of communication from one thread to another when using memory
as the communications medium, as will be discussed in detail in
\cref{sec:memorder:Multicopy Atomicity}.
The key point is that although loads and stores are conceptually simple,
on real multicore hardware significant periods of time are required for
their effects to become visible to all other threads.
Strange though it might seem, this means that a given store's value
might not be returned by a load that happens later in wall-clock time,
and it also means that a given store's value might be overwritten by a
store that happens earlier in wall-clock time.
The simple and intuitive case occurs when one thread loads a value that
some other thread stored.
This straightforward cause-and-effect case exhibits temporal behavior, so
that the software can safely assume that the store instruction completed
before the load instruction started.
In real life, the load instruction might well have started quite some
time before the store instruction did, but all modern hardware must
carefully hide such cases from the software.
Software will thus see the expected temporal cause-and-effect
behavior when one thread loads a value that some other thread stores,
as will be discussed in \cref{sec:memorder:Happens-Before}.
This intuitive temporal behavior provides the basis for the next section's
transitive intuitions.
\subsection{Transitive Intuitions}
\label{sec:memorder:Transitive Intuitions}
This section lists intuitions regarding single threads or variables,
locking, release-acquire chains, RCU, and fully ordered code.
\subsubsection{Singular Intuitive Bliss}
\label{sec:memorder:Singular Intuitive Bliss}
A program that has only one variable or only one thread will see
all accesses in order.
There is quite a bit of code that can attain adequate performance
when running single-threaded on modern computer systems, but
this book is primarily about software that needs multiple CPUs.
On, then, to the next section.
\subsubsection{Locking Intuitions}
\label{sec:memorder:Locking Intuitions}
Another transitive intuition involves that much-maligned workhorse,
locking, as will be described in more detail in
\cref{sec:memorder:Locking},
to say nothing of
\cref{chp:Locking}.
This section contains a graphical description followed by a verbal
description.
\begin{figure*}
\centering
\resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/locktrans}}
\caption{Locking Intuitions}
\label{fig:memorder:Locking Intuitions}
\end{figure*}
The graphical description is shown in
\cref{fig:memorder:Locking Intuitions},
which shows a lock being acquired and released by CPUs~0, 1, and~2
in that order.
The solid black arrows depict the unlock-lock ordering.
The dotted lines emanating from them to the wide green arrows
show the effects on ordering.
In particular:
\begin{enumerate}
\item The fact that CPU~0's unlock precedes CPU~1's lock ensures that
any access executed by CPU~0 within or before its critical
section will be seen by accesses executed by CPU~1 within
and after its critical section.
\item The fact that CPU~0's unlock precedes CPU~2's lock ensures that
any access executed by CPU~0 within or before its critical
section will be seen by accesses executed by CPU~2 within
and after its critical section.
\item The fact that CPU~1's unlock precedes CPU~2's lock ensures that
any access executed by CPU~1 within or before its critical
section will be seen by accesses executed by CPU~2 within and
after its critical section.
\end{enumerate}
In short, lock-based ordering is transitive through CPUs~0, 1, and~2.
A key point is that this ordering extends beyond the critical sections,
so that everything before an earlier lock release is seen by everything
after a later lock acquisition.
For those who prefer words to diagrams, code holding a given lock will
see the accesses in all prior critical sections for that same lock,
transitively.
And if such code sees the accesses in a given critical section, it will
also see the accesses in all of that CPU's code preceding
that critical section.
In other words, when a CPU releases a given lock, all
of that lock's subsequent critical sections will see the accesses in
all of that CPU's code preceding that lock release.
Inversely, code holding a given lock will be protected from seeing the
accesses in any subsequent critical sections for that same lock, again,
transitively.
And if such code is protected against seeing the accesses in a given
critical section, it will also be protected against seeing the accesses
in all of that CPU's code following that critical section.
In other words, when a CPU acquires a given lock, all of
that lock's previous critical sections will be protected from seeing
the accesses in all of that CPU's code following that lock
acquisition.
But what does it mean to ``see accesses'' and exactly what accesses
are seen?
To start, an access is either a load or a store, possibly occurring as part
of a read-modify-write operation.
If a CPU's code prior to its release of a given lock contains
an access A to a given variable, then for an access B to that same variable
contained in any CPU's code following a later acquisition
of that same lock:
\begin{enumerate}
\item If A and B are both loads, then B will return either the same
value that A did or some later value.
\item If A is a load and B is a store, then B will overwrite either the
value loaded by A or some later value.
\item If A is a store and B is a load, then B will return either the
value stored by A or some later value.
\item If A and B are both stores, then B will overwrite either the value
stored by A or some later value.
\end{enumerate}
Here, ``some later value'' is shorthand for ``the value stored by some
intervening access''.
\QuickQuiz{
But what about reader-writer locking?
}\QuickQuizAnswer{
Reader-writer locking works the same as does exclusive locking,
with the proviso that ordering is guaranteed only between a
pair of conflicting critical sections.
This means that a read-side critical section will be ordered
before a later write-side critical section and vice versa.
But this also means that there is no guarantee of ordering
between a pair of read-side critical sections unless there is
an intervening write-side critical section.
As of mid-2023, LKMM does not yet model reader-writer locking,
but this is on the to-do list.
}\QuickQuizEnd
Locking is strongly intuitive, which is one reason why it has survived
so many attempts to eliminate it.
This is also one reason why you should use it where it applies.
\subsubsection{Release-Acquire Intuitions}
\label{sec:memorder:Release-Acquire Intuitions}
Release-acquire chains also behave in a transitively intuitive manner
not unlike that of locking, except that release-acquire chains do not
provide mutual exclusion.
This section also contains a graphical description followed by a verbal
description.
\begin{figure*}
\centering
\resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/relacqtrans}}
\caption{Release-Acquire Intuitions}
\label{fig:memorder:Release-Acquire Intuitions}
\end{figure*}
The graphical description is shown in
\cref{fig:memorder:Release-Acquire Intuitions},
which shows a release-acquire chain extending through CPUs~0, 1, and~2.
The solid black arrows depict the release-acquire ordering.
The dotted lines emanating from them to the wide green arrows show the
effects on ordering.
\begin{enumerate}
\item The fact that CPU~0's release of A is read by CPU~1's acquire of A
ensures that any accesses executed by CPU~0 prior to its release
will be seen by any accesses executed by CPU~1 after its acquire.
\item The fact that CPU~1's release of B is read by CPU~2's acquire of B
ensures that any accesses executed by CPU~1 prior to its release
will be seen by any accesses executed by CPU~2 after its acquire.
\item Note also that CPU~0's release of A is read by CPU~1's acquire of
A, which precedes CPU 1's release of B, which is read by CPU~2's
acquire of B\@.
Taken together, all this ensures that any accesses executed by
CPU~0 prior to its release will be seen by any accesses executed
by CPU~2 after its acquire.
\end{enumerate}
This illustrates that properly constructed release-acquire ordering is
transitive through CPUs~0, 1, and~2, and in fact may be extended through
as many CPUs as needed.\footnote{
But please note that stray stores to either A or B can break
the release-acquire chain, as illustrated by
\cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}.}
For those who prefer words to diagrams, when an acquire loads the value
stored by a release, discussed in
\cref{sec:memorder:Release-Acquire Chains},
then the code following that release will see all accesses preceding
the acquire.
More precisely, if CPU~0 does an acquire that loads the value stored by
CPU~1's release, than all the subsequent accesses executed by CPU~0 will
see the all of CPU~1's accesses prior to its release.
Similarly, the accesses preceding that release access will be protected
from seeing the accesses following the acquire access.
(More precision is left as an exercise to the reader.)
Releases and acquires can be chained, for example CPU~0's release
stores the value loaded by CPU~1's acquire, a later release by CPU~1
stores the value loaded by CPU~2's acquire, and so on.
The accesses following a given acquire will see the accesses preceding
each prior release in the chain, and, inversely, the accesses preceding a
given release will be protected from seeing the accesses following each
later acquire in the chain.
Some long-chain examples will be illustrated by
\cref{lst:memorder:Long LB Release-Acquire Chain,%
lst:memorder:Long ISA2 Release-Acquire Chain,%
lst:memorder:Long Z6.2 Release-Acquire Chain}
on
\cpagerefrange{lst:memorder:Long LB Release-Acquire Chain}{lst:memorder:Long Z6.2 Release-Acquire Chain},
respectively.
The seeing and not seeing of accesses works the same way as described in
\cref{sec:memorder:Locking Intuitions}.
However, as will be illustrated by
\cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
on
\cpageref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)},
the acquire access must load exactly what was stored by the release access.
Any intervening store that is not itself part of that same release-acquire
chain will break the chain.
And again, release-acquire chains do not provide mutual exclusion.
Nevertheless, properly constructed release-acquire chains are transitive,
intuitive, and useful.
\subsubsection{RCU Intuitions}
\label{sec:memorder:RCU Intuitions}
As noted in \cref{sec:defer:RCU Fundamentals} on
\cpageref{sec:defer:RCU Fundamentals}, RCU provides a number
of ordering guarantees.
The first is the publish-subscribe mechanism described in
\cref{sec:defer:Publish-Subscribe Mechanism}
on
\cpageref{sec:defer:Publish-Subscribe Mechanism}.
This resembles the acquire-release chains discussed in the previous
section, but substitutes a member of the \co{rcu_dereference()} family
of primitives for the \co{smp_load_acquire()}.
Unlike \co{smp_load_acquire()}, the ordering implied by
\co{rcu_dereference()} applies only to subsequent accesses that
dereference the pointer returned by that \co{rcu_dereference()}
as shown in
\cref{fig:defer:Publication/Subscription Constraints}
on
\cpageref{fig:defer:Publication/Subscription Constraints}.
The second guarantee says that if any part of an RCU read-side
critical section precedes the beginning of a grace period, then
the entirety of that critical section precedes the end of that
grace period, as shown in
\cref{fig:defer:RCU Reader and Later Grace Period}
on
\cpageref{fig:defer:RCU Reader and Later Grace Period}.
The third guarantee says that if any part of an RCU read-side critical
section follows the end of a grace period, then the entirety of that
critical section follows the beginning of that grace period, as shown in
\cref{fig:defer:RCU Reader and Earlier Grace Period}
on
\cpageref{fig:defer:RCU Reader and Earlier Grace Period}.
Both of these two guarantees are discussed in
\cref{sec:defer:Wait For Pre-Existing RCU Readers}
on
\cpageref{sec:defer:Wait For Pre-Existing RCU Readers},
with more examples shown in
\cref{fig:defer:RCU Reader Within Grace Period,%
fig:defer:Summary of RCU Grace-Period Ordering Guarantees}
on
\cpageref{fig:defer:RCU Reader Within Grace Period,fig:defer:Summary of RCU Grace-Period Ordering Guarantees}.
These two guarantees have further version-maintenance consequences that
are discussed in
\cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}
on
\cpageref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}.
These guarantees will be discussed somewhat more formally in
\cref{sec:memorder:RCU}
on
\cpageref{sec:memorder:RCU}.
Much of the sophistication of RCU lies not in its guarantees, but in its
use cases, which are the subject of
\cref{sec:defer:RCU Usage}
starting on
\cpageref{sec:defer:RCU Usage}.
\subsubsection{Fully Ordered Intuitions}
\label{sec:memorder:Fully Ordered Intuitions}
A more extreme example of transitivity places at least one \co{smp_mb()}
between each pair of accesses.
All accesses seen by any given access will also be seen by all later
accesses.
The resulting program will be fully ordered, if somewhat slow.
Such programs will be sequentially consistent and much loved by
formal-verification experts who specialize in tried-and-true 1980s
proof techniques.
But slow or not, \co{smp_mb()} is always there when you need it!
Nevertheless, there are situations that cannot be addressed by these
intuitive approaches.
The next section therefore presents a more complete, if less transitive,
set of rules of thumb.
\subsection{Rules of Thumb}
\label{sec:memorder:Rules of Thumb}
The transitive intuitions presented in the previous section are
very appealing, at least as \IX{memory model}s go.
Unfortunately, hardware is under no obligation to provide temporal
cause-and-effect illusions when one thread's store overwrites a value either
loaded or stored by some other thread.
It is quite possible that, from the software's viewpoint, an earlier store
will overwrite a later store's value, but only if those two stores were
executed by different threads, as will be illustrated by
\cref{fig:memorder:Store-to-Store is Counter-Temporal}
on
\cpageref{fig:memorder:Store-to-Store is Counter-Temporal}.
Similarly, a later load might well read a value overwritten by an
earlier store, but again only if that load and store were executed by
different threads, as will be illustrated by
\cref{fig:memorder:Load-to-Store is Counter-Temporal}
on
\cpageref{fig:memorder:Load-to-Store is Counter-Temporal}.
This counter-intuitive behavior occurs due to the need to buffer
stores in order to achieve adequate performance, as will be discussed in
\cref{sec:memorder:Propagation}
on
\cpageref{sec:memorder:Propagation}.
\begin{figure}
\centering
\resizebox{.5\onecolumntextwidth}{!}{\includegraphics{memorder/ifthen}}
\caption{If-Then Nature of Memory Ordering}
\label{fig:memorder:If-Then Nature of Memory Ordering}
\end{figure}
As a result, situations where one thread reads a value written by some
other thread can make do with far weaker ordering than can situations
where one thread overwrites a value loaded or stored by some other thread.
These differences are captured by the following rules of thumb.
However, it is important to note that none of these rules involve
mutual exclusion.
There is instead an if-then nature to these rules, as illustrated by
\cref{fig:memorder:If-Then Nature of Memory Ordering}.
In the figure, if CPU~1's reference to Y follows that of CPU~1,
and if both CPUs provide sufficient ordering between their two
accesses, then CPU~0's access to X will also precede that of
CPU~1.
The first rule of thumb is that memory-ordering operations are only
required where there is a possibility of interaction between at least
two variables shared among at least two threads, which underlies the
singular intuitive bliss presented in
\cref{sec:memorder:Singular Intuitive Bliss}.
In light of the intervening material, this single sentence encapsulates
many of the basic rules of thumb that will be presented in
\cref{sec:memorder:Basic Rules of Thumb}
on
\cpageref{sec:memorder:Basic Rules of Thumb},
for example, keeping in mind that ``memory-barrier pairing'' is a
two-thread special case of ``cycle''.
And, as always, if a single-threaded program will provide sufficient
performance, why bother with parallelism?\footnote{
Hobbyists and researchers should of course feel free to ignore
this and many other cautions.}
After all, avoiding parallelism also avoids the added cost and complexity
of memory-ordering operations.
The second rule of thumb involves load-buffering situations:
If all thread-to-thread communication in a given cycle use store-to-load
links (that is, the next thread's load returns the value stored by
the previous thread), minimal ordering suffices.
Minimal ordering includes dependencies and acquires as well as all stronger
ordering operations.
Because a lock acquisition must load the lock-word value stored by any
prior release of that lock, this rule of thumb underlies the locking
intuitions presented in
\cref{sec:memorder:Locking Intuitions}.
The third rule of thumb involves release-acquire chains:
If all but one of the links in a given cycle is a store-to-load
link, it is sufficient to use release-acquire pairs for each of
those store-to-load links, as will be illustrated by
\cref{lst:memorder:Long ISA2 Release-Acquire Chain,%
lst:memorder:Long Z6.2 Release-Acquire Chain}
starting on
\cpageref{lst:memorder:Long ISA2 Release-Acquire Chain}.
This rule underlies the release-acquire intuitions presented in
\cref{sec:memorder:Release-Acquire Intuitions}.
You can replace a given acquire with a dependency in environments permitting
this, keeping in mind that the
\IXalthmr{C11 standard's memory model}{C11}{memory model} does \emph{not}
fully respect dependencies.
Therefore, a dependency leading to a load must be headed by
a \co{READ_ONCE()} or an \co{rcu_dereference()}:
A plain C-language load is not sufficient.
In addition, carefully review
\cref{sec:memorder:Address- and Data-Dependency Difficulties,%
sec:memorder:Control-Dependency Calamities},
on
\cpageref{sec:memorder:Address- and Data-Dependency Difficulties,%
sec:memorder:Control-Dependency Calamities},
because a dependency broken by your compiler will not order anything.
The two threads sharing the sole non-store-to-load link can
sometimes substitute \co{WRITE_ONCE()} plus \co{smp_wmb()} for
\co{smp_store_release()} on the one hand,
and \co{READ_ONCE()} plus \co{smp_rmb()} for \co{smp_load_acquire()}
on the other.
However, the wise developer will check such substitutions carefully,
for example, using the \co{herd} tool as described in
\cref{sec:formal:Axiomatic Approaches}
on
\cpageref{sec:formal:Axiomatic Approaches}.
\QuickQuiz{
Why is it necessary to use heavier-weight ordering for
load-to-store and store-to-store links, but not for
store-to-load links?
What on earth makes store-to-load links so special???
}\QuickQuizAnswer{
Recall that load-to-store and store-to-store links can be
counter-temporal, as illustrated by
\cref{fig:memorder:Load-to-Store is Counter-Temporal,%
fig:memorder:Store-to-Store is Counter-Temporal} in
\cref{sec:memorder:Propagation}.
This counter-temporal nature of load-to-store and store-to-store
links necessitates strong ordering.
In constrast, store-to-load links are temporal, as illustrated by
\cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test,%
lst:memorder:Load-Buffering Control-Dependency Litmus Test}.
This temporal nature of store-to-load links permits use of
minimal ordering.
}\QuickQuizEnd
The fourth and final rule of thumb identifies where full memory barriers
(or stronger) are required:
If a given cycle contains two or more non-store-to-load links (that is, a
total of two or more links that are either load-to-store or store-to-store
links), you will need at least one full barrier between each pair of
non-store-to-load links in that cycle, as will be illustrated by
\cref{lst:memorder:W+WRC Litmus Test With More Barriers}
on
\cpageref{lst:memorder:W+WRC Litmus Test With More Barriers}
and most especially by the example presented in
\cref{sec:memorder:A Counter-Intuitive Case Study}
on
\cpageref{sec:memorder:A Counter-Intuitive Case Study}.
% @@@ \cref{lst:memorder:W+WRC Litmus Test With More Barriers}
% @@@ as well as in the answer to
% @@@ \QuickQuizARef{\MemorderQQLitmusTestR}.
Full barriers include \co{smp_mb()}, successful full-strength non-\co{void}
atomic RMW operations, and other atomic RMW operations in conjunction with
either \co{smp_mb__before_atomic()} or \co{smp_mb__after_atomic()}.
Any of RCU's grace-period-wait primitives (\co{synchronize_rcu()} and
friends) also act as full barriers, but at far greater expense than
\co{smp_mb()}.
With strength comes expense, though full barriers
usually hurt performance more than they hurt scalability.
The extreme logical endpoint of this rule of thumb underlies the
fully ordered intuitions presented in
\cref{sec:memorder:Fully Ordered Intuitions}.
Recapping the rules:
\begin{enumerate}
\item Memory-ordering operations are required only if at least
two variables are shared by at least two threads.
\item If all links in a cycle are store-to-load links, then
minimal ordering suffices.
\item If all but one of the links in a cycle are store-to-load links,
then each store-to-load link may use a release-acquire pair.
\item Otherwise, at least one full barrier is required between
each pair of non-store-to-load links.
\end{enumerate}
Note that an architecture is permitted to provide stronger guarantees, as
will be discussed in
\cref{sec:memorder:Hardware Specifics},
but these guarantees may only be relied upon in code that runs only for
that architecture.
In addition, more accurate
\IX{memory model}s~\cite{Alglave:2018:FSC:3173162.3177156}
may give stronger guarantees with lower-overhead operations than do
these rules of thumb, albeit at the expense of greater complexity.
In these more formal memory-ordering papers, a store-to-load link is an
example of a reads-from (rf) link, a load-to-store link is an example
of a from-reads (fr) link, and a store-to-store link is an example of
a coherence (co) link.
One final word of advice:
Use of raw memory-ordering primitives is a last resort.
It is almost always better to use existing primitives, such as locking
or RCU, thus letting those primitives do the memory ordering for you.
With that said, the next section describes why hardware misorders memory
references and some of the things that you can do about it.
\section{Ordering:
Why and How?}
\label{sec:memorder:Ordering: Why and How?}
%
\epigraph{Nothing is orderly till people take hold of it.
Everything in creation lies around loose.}
{Henry Ward Beecher, updated}
One motivation for memory ordering can be seen in the trivial-seeming
litmus test in
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
(\path{C-SB+o-o+o-o.litmus}),
which at first glance might
appear to guarantee that the \co{exists} clause never triggers.\footnote{
Purists would instead insist that the \co{exists} clause is
never \emph{satisfied}, but we use ``trigger'' here by
analogy with assertions.}
After all, if \nbco{0:r2=0} as shown in the \co{exists} clause,\footnote{
That is, Thread~\co{P0()}'s instance of local variable \co{r2}
equals zero.
See \cref{sec:formal:Anatomy of a Litmus Test}
for documentation of litmus-test nomenclature.}
we might hope that Thread~\co{P0()}'s load from~\co{x1} into \co{r2}
must have happened before Thread~\co{P1()}'s store to~\co{x1}, which
might raise further hopes that Thread~\co{P1()}'s load from~\co{x0}
into \co{r2} must happen after Thread~\co{P0()}'s store to~\co{x0},
so that \nbco{1:r2=2}, thus never triggering the \co{exists} clause.
The example is symmetric, so similar reasoning might lead
us to hope that \nbco{1:r2=0} guarantees that \nbco{0:r2=2}.
Unfortunately, the lack of memory barriers dashes these hopes.
The CPU is within its rights to reorder
the statements within both Thread~\co{P0()} and Thread~\co{P1()},
even on relatively strongly ordered systems such as x86.
\begin{listing}
\input{CodeSamples/formal/litmus/C-SB+o-o+o-o@whole.fcv}
\caption{Memory Misordering:
Store-Buffering Litmus Test}
\label{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
\end{listing}
\QuickQuiz{
The compiler can also reorder Thread~\co{P0()}'s and
Thread~\co{P1()}'s memory accesses in
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test},
right?
}\QuickQuizAnswer{
In general, compiler optimizations carry out more extensive
and profound reorderings than CPUs can.
However, in this case, the volatile accesses in
\co{READ_ONCE()} and \co{WRITE_ONCE()}
prevent the compiler from reordering.
And also from doing much else as well, so the examples in this
chapter will be making heavy use of
\co{READ_ONCE()} and \co{WRITE_ONCE()}.
See \cref{sec:memorder:Compile-Time Consternation}
for more detail on the need for \co{READ_ONCE()} and \co{WRITE_ONCE()}.
}\QuickQuizEnd
This willingness to reorder can be confirmed using tools such as
\co{litmus7}~\cite{Alglave:2014:HCM:2594291.2594347},
which found that the counter-intuitive ordering happened
314 times out of 100,000,000 trials on an x86 laptop.
Oddly enough, the perfectly legal outcome where both loads return the
value 2 occurred less frequently, in this case, only 167 times.\footnote{
Please note that results are sensitive to the exact hardware
configuration,
how heavily the system is loaded, and much else besides.
So why not try it out on your own system?}
The lesson here is clear:
Increased counter-intuitivity does not necessarily imply decreased probability!
% Run on June 23, 2017:
% litmus7 -r 1000 -carch X86 C-SB+o-o+o-o.litmus
% Test C-SB+o-o+o-o Allowed
% Histogram (4 states)
% 314 *>0:r2=0; 1:r2=0;
% 49999625:>0:r2=2; 1:r2=0;
% 49999894:>0:r2=0; 1:r2=2;
% 167 :>0:r2=2; 1:r2=2;
The following sections show exactly how this intuition breaks down,
and then put forward some mental models of memory ordering that can help
you avoid these pitfalls.
\Cref{sec:memorder:Why Hardware Misordering?}
gives a brief overview of why hardware misorders memory accesses, and then
\cref{sec:memorder:How to Force Ordering?}
gives an equally brief overview of how you can thwart such misordering.
Finally, \cref{sec:memorder:Basic Rules of Thumb}
lists some basic rules of thumb, which will be further refined in
later sections.
These sections focus on hardware reordering, but rest assured that compilers
reorder much more aggressively than hardware ever dreamed of doing.
Compiler-induced reordering will be taken up in
\cref{sec:memorder:Compile-Time Consternation}.
\subsection{Why Hardware Misordering?}
\label{sec:memorder:Why Hardware Misordering?}
But why does memory misordering happen in the first place?
Can't CPUs keep track of ordering on their own?
Isn't that why we have computers in the first place, to keep track of things?
\begin{figure*}
\centering
\resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch}}
\caption{Intel Core 2 Architecture}
\ContributedBy{fig:memorder:Intel Core 2 Architecture}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted}
\end{figure*}
Many people do indeed expect their computers to keep track of things,
but many also insist that they keep track of things quickly.
In fact, so intense is the focus on performance that modern CPUs
are extremely complex, as can be seen in the simplified block diagram in
\cref{fig:memorder:Intel Core 2 Architecture}.
Those needing to squeeze the last few percent of performance from their
systems will in turn need to pay close attention to the fine details of this
figure when tuning their software.
Except that this close attention to detail means that when a given CPU
degrades with age, the software will no longer run quickly on it.
For example, if the leftmost ALU fails, software tuned to take full
advantage of all of the ALUs might well run more slowly than untuned
software.
One solution to this problem is to take systems out of service as soon
as any of their CPUs start degrading.
\begin{figure*}
\centering
\resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch-simplified}}
\caption{Intel Core 2 Architecture Simplified}
\ContributedBy{fig:memorder:Intel Core 2 Architecture Simplified}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted}
\end{figure*}
Another option is to recall the lessons of
\cref{chp:Hardware and its Habits},
especially the lesson that for many important workloads, main memory
cannot keep up with modern CPUs, which can execute hundreds of
instructions in the time required to fetch a single variable from memory.
For such workloads, the detailed internal structure of the CPU is
irrelevant, and the CPU can instead be approximated by the blue shapes in
\cref{fig:memorder:Intel Core 2 Architecture Simplified}
labeled CPU, store buffer, and cache.
Because of these data-intensive workloads, CPUs sport increasingly large
caches, as was seen back in
\cref{fig:cpu:System Hardware Architecture},
which means that although the first load by a given CPU from a given
variable will result in an expensive \emph{cache miss} as was discussed in
\cref{sec:cpu:Cache Misses}, subsequent
repeated loads from that variable by that CPU might execute
very quickly because the initial cache miss will have loaded that
variable into that CPU's cache.
However, it is also necessary to accommodate frequent concurrent stores
from multiple CPUs to a set of shared variables.
In cache-coherent systems, if the caches hold multiple copies of a given
variable, all the copies of that variable must have the same value.
This works extremely well for concurrent loads, but not so well for
concurrent stores:
Each store must do something about all copies of the old value
(another cache miss!), which, given the finite speed of light and
the atomic nature of matter, will be slower than impatient software
hackers would like.
And these strings of stores are the reason for the blue block
labelled store buffer in
\cref{fig:memorder:Intel Core 2 Architecture Simplified}.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{memorder/SystemArchSB}}
\caption{System Architecture With Store Buffers}
\label{fig:memorder:System Architecture With Store Buffers}
\end{figure}
Removing the internal CPU complexity from
\cref{fig:memorder:Intel Core 2 Architecture Simplified},
adding a second CPU, and showing main memory results in
\cref{fig:memorder:System Architecture With Store Buffers}.
When a given CPU stores to a variable
not present in that CPU's cache, then the new value
is instead placed in that CPU's store buffer.
The CPU can then proceed immediately, without having to wait for the
store to do something about all the old values of that variable
residing in other CPUs' caches.
\begin{figure}
\centering
\resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}}
\caption{CPUs Can Do Things Out of Order}
\ContributedBy{fig:memorder:CPUs Can Do Things Out of Order}{Melissa Broussard}
\end{figure}
Although store buffers can greatly increase performance, they can cause
instructions and memory references to execute out of order, which can
in turn cause serious confusion, as fancifully illustrated in
\cref{fig:memorder:CPUs Can Do Things Out of Order}.
\begin{table*}
\rowcolors{6}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\small
\centering\OneColumnHSpace{-0.1in}
\ebresizewidth{
\begin{tabular}{rllllll}
\toprule
& \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
\cmidrule(l){2-4} \cmidrule(l){5-7}
& Instruction & Store Buffer & Cache &
Instruction & Store Buffer & Cache \\
\cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7}
1 & (Initial state) & & \tco{x1==0} &
(Initial state) & & \tco{x0==0} \\
2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
3 & \tco{r2 = x1;} (0) & \tco{x0==2} & \tco{x1==0} &
\tco{r2 = x0;} (0) & \tco{x1==2} & \tco{x0==0} \\
4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
5 & (Finish store) & & \tco{x0==2} &
(Finish store) & & \tco{x1==2} \\
\bottomrule
\end{tabular}
}
\caption{Memory Misordering:
Store-Buffering Sequence of Events}
\label{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
\end{table*}
In particular, store buffers cause the memory misordering
illustrated by
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
\Cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
shows the steps leading to this misordering.
Row~1 shows the initial state, where CPU~0 has \co{x1} in its cache
and CPU~1 has \co{x0} in its cache, both variables having a value of zero.
\begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole]
Row~2 shows the state change due to each CPU's store (\clnref{st0,st1} of
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
Because neither CPU has the stored-to variable in its cache, both CPUs
record their stores in their respective store buffers.
\end{fcvref}
\QuickQuiz{
But wait!!!
On row~2 of
\cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
both \co{x0} and \co{x1} each have two values at the same time,
namely zero and two.
How can that possibly work???
}\QuickQuizAnswer{
There is an underlying cache-coherence protocol that straightens
things out, which are discussed in
\cref{sec:app:whymb:Cache-Coherence Protocols}.
But if you think that a given variable having two values at
the same time is surprising, just wait until you get to
\cref{sec:memorder:Variables With Multiple Values}!
}\QuickQuizEnd
\begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole]
Row~3 shows the two loads (\clnref{ld0,ld1} of
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
Because the variable being loaded by each CPU is in that CPU's cache,
each load immediately returns the cached value, which in both cases
is zero.
\end{fcvref}
But the CPUs are not done yet:
Sooner or later, they must empty their store buffers.
Because caches move data around in relatively large blocks called
\emph{cachelines}, and because each cacheline can hold several
variables, each CPU must get the cacheline into its own cache so
that it can update the portion of that cacheline corresponding
to the variable in its store buffer, but without disturbing any
other part of the cacheline.
Each CPU must also ensure that the cacheline is not present in any other
CPU's cache, for which a read-invalidate operation is used.
As shown on row~4, after both read-invalidate operations complete,
the two CPUs have traded cachelines, so that CPU~0's cache now contains
\co{x0} and CPU~1's cache now contains \co{x1}.
Once these two variables are in their new homes, each CPU can flush
its store buffer into the corresponding \IX{cache line}, leaving each
variable with its final value as shown on row~5.
\QuickQuiz{
But don't the values also need to be flushed from the cache
to main memory?
}\QuickQuizAnswer{
Perhaps surprisingly, not necessarily!
On some systems,
if the two variables are being used heavily, they might
be bounced back and forth between the CPUs' caches and never
land in main memory.
}\QuickQuizEnd
In summary, store buffers are needed to allow CPUs to handle
store instructions efficiently, but they can result in
counter-intuitive memory misordering.
But what do you do if your algorithm really needs its memory
references to be ordered?
For example, suppose that you are communicating with a driver using
a pair of flags, one that says whether or not the driver is running
and the other that says whether there is a request pending for that
driver.
The requester needs to set the request-pending flag, then check
the driver-running flag, and if false, wake the driver.
Once the driver has serviced all the pending requests that it knows about,
it needs to clear its driver-running flag, then check the request-pending
flag to see if it needs to restart.
This very reasonable approach cannot work unless there is some way
to make sure that the hardware processes the stores and loads in order.
This is the subject of the next section.
\subsection{How to Force Ordering?}
\label{sec:memorder:How to Force Ordering?}
It turns out that there are compiler directives and synchronization
primitives (such as locking and RCU) that are responsible for maintaining
the illusion of ordering through use of \emph{\IXBpl{memory barrier}} (for
example, \co{smp_mb()} in the Linux kernel).
These memory barriers can be explicit instructions, as they are on
\ARM, \Power{}, Itanium, and Alpha, or they can be implied by other instructions,
as they often are on x86.
Since these standard synchronization primitives preserve the illusion of
ordering, your path of least resistance is to simply use these primitives,
thus allowing you to stop reading this section.
\begin{listing}
\input{CodeSamples/formal/litmus/C-SB+o-mb-o+o-mb-o@whole.fcv}
\caption{Memory Ordering:
Store-Buffering Litmus Test}
\label{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
\end{listing}
However, if you need to implement the synchronization primitives
themselves, or if you are simply interested in understanding how memory
ordering works, read on!
The first stop on the journey is
\cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
(\path{C-SB+o-mb-o+o-mb-o.litmus}),
which places an \co{smp_mb()} Linux-kernel \IXh{full}{memory barrier} between
the store and load in both \co{P0()} and \co{P1()}, but is otherwise
identical to
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
% Test C-SB+o-mb-o+o-mb-o Allowed
% Histogram (3 states)
% 49553298:>0:r2=2; 1:r2=0;
% 49636449:>0:r2=0; 1:r2=2;
% 810253:>0:r2=2; 1:r2=2;
% No
These barriers prevent the counter-intuitive outcome from happening
on 100,000,000 trials on my x86 laptop.
Interestingly enough, the added overhead due to these barriers causes the
legal outcome where both loads return the value two to happen more
than 800,000 times, as opposed to only 167 times for the
barrier-free code in
\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
\begin{table*}
\rowcolors{6}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\small
\centering\OneColumnHSpace{-0.1in}
\ebresizewidth{
\begin{tabular}{rllllll}
\toprule
& \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
\cmidrule(l){2-4} \cmidrule(l){5-7}
& Instruction & Store Buffer & Cache &
Instruction & Store Buffer & Cache \\
\cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7}
1 & (Initial state) & & \tco{x1==0} &
(Initial state) & & \tco{x0==0} \\
2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
3 & \tco{smp_mb();} & \tco{x0==2} & \tco{x1==0} &
\tco{smp_mb();} & \tco{x1==2} & \tco{x0==0} \\
4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
5 & (Finish store) & & \tco{x0==2} &
(Finish store) & & \tco{x1==2} \\
6 & \tco{r2 = x1;} (2) & & \tco{x1==2} &
\tco{r2 = x0;} (2) & & \tco{x0==2} \\
\bottomrule
\end{tabular}
}
\caption{Memory Ordering:
Store-Buffering Sequence of Events}
\label{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}
\end{table*}
These barriers have a profound effect on ordering, as can be seen in
\cref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}.
Although the first two rows are the same as in
\cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
and although the \co{smp_mb()} instructions on row~3
do not change state
in and of themselves, they do cause the stores to complete
(rows~4 and~5) before the
loads (row~6), which rules out the counter-intuitive outcome shown in
\cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}.
Note that variables \co{x0} and \co{x1} each still have more than one
value on row~2, however, as promised earlier, the \co{smp_mb()}
invocations straighten things out in the end.
Although full barriers such as \co{smp_mb()} have extremely strong
ordering guarantees, their strength comes at a high price in terms
of foregone hardware and compiler optimizations.
A great many situations can be handled with much weaker ordering guarantees
that use much cheaper memory-ordering instructions, or, in some case, no
memory-ordering instructions at all.
\begin{table*}
\small
\centering\OneColumnHSpace{-0.7in}
\renewcommand*{\arraystretch}{1.1}
\rowcolors{7}{lightgray}{}
\ebresizewidth{
\begin{tabular}{lcccccccccccc}\toprule
& & \multicolumn{4}{c}{Prior Ordered Operation} &
\multicolumn{7}{c}{Subsequent Ordered Operation} \\
\cmidrule(l){3-6} \cmidrule(l){7-13}
Operation Providing Ordering & C &
Self & R & W & RMW & Self & R & W & DR & DW & RMW & SV \\
\cmidrule(r){1-1} \cmidrule{2-2} \cmidrule(l){3-6} \cmidrule(l){7-13}
Store, for example, \tco{WRITE_ONCE()} & &
Y & & & & & & & & & & Y \\
Load, for example, \tco{READ_ONCE()} & &
Y & & & & & & & Y & Y & & Y \\
\tco{_relaxed()} RMW operation & &
Y & & & & & & & Y & Y & & Y \\
\tco{*_dereference()} & &
Y & & & & & & & Y & Y & & Y \\
Successful \tco{*_acquire()} & &
R & & & & & Y & Y & Y & Y & Y & Y \\
Successful \tco{*_release()} & C &
& Y & Y & Y & W & & & & & & Y \\
\tco{smp_rmb()} & &
& Y & & R & & Y & & Y & & R & \\
\tco{smp_wmb()} & &
& & Y & W & & & Y & & Y & W & \\
\tco{smp_mb()} and \tco{synchronize_rcu()} & CP &
& Y & Y & Y & & Y & Y & Y & Y & Y & \\
Successful full-strength non-\tco{void} RMW & CP &
Y & Y & Y & Y & Y & Y & Y & Y & Y & Y & Y \\
\tco{smp_mb__before_atomic()} & CP &
& Y & Y & Y & & a & a & a & a & Y & \\
\tco{smp_mb__after_atomic()} & CP &
& a & a & Y & & Y & Y & Y & Y & Y & \\
\bottomrule
\end{tabular}
}
\vspace{5pt}\hfill
\ebresizeverb{.8}{
\framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt}
\rowcolors{1}{}{}
\begin{tabular}{lrl}
Key: & C: & Ordering is cumulative \\
& P: & Ordering propagates \\
& R: & Read, for example, \tco{READ_ONCE()}, or read portion of RMW \\
& W: & Write, for example, \tco{WRITE_ONCE()}, or write portion of RMW \\
& Y: & Provides the specified ordering \\
& a: & Provides specified ordering given intervening RMW atomic operation \\
& DR: & Dependent read (address dependency, \cref{sec:memorder:Address Dependencies}) \\
& DW: & Dependent write (address, data, or control dependency, \crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies}) \\
& RMW: & \IX{Atomic read-modify-write operation} \\
& Self: & Orders self, as opposed to accesses both before
and after \\
& SV: & Orders later accesses to the same variable \\
\multicolumn{3}{l}{\emph{Applies to Linux kernel v4.15 and later.}} \\
\end{tabular}
}\OneColumnHSpace{-0.9in}
}
\caption{Linux-Kernel Memory-Ordering Cheat Sheet}
\label{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
\end{table*}
\Cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
provides a cheatsheet of the Linux kernel's ordering primitives and their
guarantees.
Each row corresponds to a primitive or category of primitives that might
or might not provide ordering, with the columns labeled
``Prior Ordered Operation'' and ``Subsequent Ordered Operation''
being the operations that might (or might not) be ordered against.
Cells containing ``Y'' indicate that ordering is supplied unconditionally,
while other characters indicate that ordering is supplied only partially or
conditionally.
Blank cells indicate that no ordering is supplied.
The ``Store'' row also covers the store portion of an
\IXalt{atomic RMW operation}{atomic read-modify-write operation}.
In addition, the ``Load'' row covers the load
component of a successful value-returning \co{_relaxed()} RMW atomic
operation, although the combined ``\co{_relaxed()} RMW operation''
line provides a convenient combined reference in the value-returning case.
A CPU executing unsuccessful value-returning atomic RMW operations must
invalidate the corresponding variable from all other CPUs' caches.
Therefore, unsuccessful value-returning atomic RMW operations have many
of the properties of a store, which means that the ``\co{_relaxed()}
RMW operation'' line also applies to unsuccessful value-returning atomic
RMW operations.
The \co{*_acquire} row covers \co{smp_load_acquire()},
\co{cmpxchg_acquire()}, \co{xchg_acquire()}, and so on; the \co{*_release}
row covers \co{smp_store_release()}, \co{rcu_assign_pointer()},
\co{cmpxchg_release()}, \co{xchg_release()}, and so on; and
the ``Successful full-strength non-\co{void} RMW'' row covers
\co{atomic_add_return()}, \co{atomic_add_unless()}, \co{atomic_dec_and_test()},
\co{cmpxchg()}, \co{xchg()}, and so on.
The ``Successful'' qualifiers apply to primitives such as
\co{atomic_add_unless()}, \co{cmpxchg_acquire()}, and \co{cmpxchg_release()},
which have no effect on either memory or on ordering when they indicate
failure, as indicated by the earlier ``\co{_relaxed()} RMW operation'' row.
Column ``C'' indicates cumulativity and propagation, as explained in
\cref{sec:memorder:Cumulativity,%
sec:memorder:Propagation}.
In the meantime, this column can usually be ignored when there
are at most two threads involved.
\QuickQuizSeries{%
\QuickQuizB{
The rows in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
seem quite random and confused.
Whatever is the conceptual basis of this table???
}\QuickQuizAnswerB{
The rows correspond roughly to hardware mechanisms of increasing
power and overhead.
The \co{WRITE_ONCE()} row captures the fact that accesses to
a single variable are always fully ordered, as indicated by
the ``SV'' column.
Note that all other operations providing ordering against accesses
to multiple variables also provide this same-variable ordering.
The \co{READ_ONCE()} row captures the fact that (as of 2021) compilers
and CPUs do not indulge in user-visible speculative stores, so that
any store whose address, data, or execution depends on a prior load
is guaranteed to happen after that load completes.
However, this guarantee assumes that these dependencies have
been constructed carefully, as described in
\cref{sec:memorder:Address- and Data-Dependency Difficulties,%
sec:memorder:Control-Dependency Calamities}.
The ``\co{_relaxed()} RMW operation'' row captures the fact
that a value-returning \co{_relaxed()} RMW has done a load and a
store, which are every bit as good as a \co{READ_ONCE()} and a
\co{WRITE_ONCE()}, respectively.
The \co{*_dereference()} row captures the address and data
dependency ordering provided by \co{rcu_dereference()} and friends.
Again, these dependencies must been constructed carefully,
as described in
\cref{sec:memorder:Address- and Data-Dependency Difficulties}.
The ``Successful \co{*_acquire()}'' row captures the fact that many
CPUs have special ``acquire'' forms of loads and of atomic RMW
instructions,
and that many other CPUs have lightweight memory-barrier
instructions that order prior loads against subsequent loads
and stores.
The ``Successful \co{*_release()}'' row captures the fact that many
CPUs have special ``release'' forms of stores and of atomic RMW
instructions, and that many other CPUs have lightweight memory-barrier
instructions that order prior loads and stores against
subsequent stores.
The \co{smp_rmb()} row captures the fact that many CPUs have
lightweight memory-barrier instructions that order prior loads against
subsequent loads.
Similarly,
the \co{smp_wmb()} row captures the fact that many CPUs have
lightweight memory-barrier instructions that order prior stores against
subsequent stores.
None of the ordering operations thus far require prior stores to be
ordered against subsequent loads, which means that these operations
need not interfere with store buffers, whose main purpose in life
is in fact to reorder prior stores against subsequent loads.
The lightweight nature of these operations is precisely due to
their policy of store-buffer non-interference.
However, as noted earlier, it is sometimes necessary to interfere
with the store buffer in order to prevent prior stores from being
reordered against later stores, which brings us to the remaining
rows in this table.
The \co{smp_mb()} row corresponds to the \IXh{full}{memory barrier}
available on most platforms, with Itanium being the exception
that proves the rule.
However, even on Itanium, \co{smp_mb()} provides full ordering
with respect to \co{READ_ONCE()} and \co{WRITE_ONCE()},
as discussed in \cref{sec:memorder:Itanium}.
The ``Successful full-strength non-\co{void} RMW'' row captures
the fact that on some platforms (such as x86) atomic RMW instructions
provide full ordering both before and after.
The Linux kernel therefore requires that full-strength non-\co{void}
atomic RMW operations provide full ordering in cases where these
operations succeed.
(Full-strength atomic RMW operation's names do not end in
\co{_relaxed}, \co{_acquire}, or \co{_release}.)
As noted earlier, the case where these operations do not succeed
is covered by the ``\co{_relaxed()} RMW operation'' row.
However, the Linux kernel does not require that either \co{void}
or \co{_relaxed()} atomic RMW operations provide any ordering
whatsoever, with the canonical example being \co{atomic_inc()}.
Therefore, these operations, along with failing non-\co{void}
atomic RMW operations may be preceded by \co{smp_mb__before_atomic()}
and followed by \co{smp_mb__after_atomic()} to provide full
ordering for any accesses preceding or following both.
No ordering need be provided for accesses between the
\co{smp_mb__before_atomic()} (or, similarly, the
\co{smp_mb__after_atomic()}) and the atomic RMW operation, as
indicated by the ``a'' entries on the \co{smp_mb__before_atomic()}
and \co{smp_mb__after_atomic()} rows of the table.
In short, the structure of this table is dictated by the
properties of the underlying hardware, which are constrained by
nothing other than the laws of physics, which were covered back in
\cref{chp:Hardware and its Habits}.
That is, the table is not random, although it is quite possible
that you are confused.
}\QuickQuizEndB
%
\QuickQuizE{
Why is
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
missing \co{smp_mb__after_unlock_lock()} and
\co{smp_mb__after_spinlock()}?
}\QuickQuizAnswerE{
These two primitives are rather specialized, and at present
seem difficult to fit into
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
The \co{smp_mb__after_unlock_lock()} primitive is intended to be placed
immediately after a lock acquisition, and ensures that all CPUs
see all accesses in prior critical sections as happening before
all accesses following the \co{smp_mb__after_unlock_lock()}
and also before all accesses in later critical sections.
Here ``all CPUs'' includes those CPUs not holding that lock,
and ``prior critical sections'' includes all prior critical sections
for the lock in question as well as all prior critical sections
for all other locks that were released by the same CPU that executed
the \co{smp_mb__after_unlock_lock()}.
The \co{smp_mb__after_spinlock()} provides the same guarantees
as does \co{smp_mb__after_unlock_lock()}, but also provides
additional visibility guarantees for other accesses performed
by the CPU that executed the \co{smp_mb__after_spinlock()}.
Given any store S performed prior to any earlier lock acquisition
and any load L performed after the \co{smp_mb__after_spinlock()},
all CPUs will see S as happening before~L\@.
In other words, if a CPU performs a store S, acquires a lock,
executes an \co{smp_mb__after_spinlock()}, then performs a
load L, all CPUs will see S as happening before~L\@.
}\QuickQuizEndE
}
It is important to note that this table is just a cheat sheet,
and is therefore in no way a replacement for a good understanding
of memory ordering.
To begin building such an understanding, the next section will
present some basic rules of thumb.
\subsection{Basic Rules of Thumb}
\label{sec:memorder:Basic Rules of Thumb}
This section presents some basic rules of thumb that are ``good and
sufficient'' for a great many situations.
In fact, you could write a great deal of concurrent code having
excellent performance and scalability without needing anything more
than these rules of thumb.
More sophisticated rules of thumb will be presented in
\cref{sec:memorder:Memory-Model Intuitions}.
\QuickQuiz{
But how can I know that a given project can be designed
and coded within the confines of these rules of thumb?
}\QuickQuizAnswer{
Much of the purpose of the remainder of this chapter is
to answer exactly that question!
}\QuickQuizEnd
\paragraph{A given thread sees its own accesses in order.}
This rule assumes that loads and stores from/to shared variables use
\co{READ_ONCE()} and \co{WRITE_ONCE()}, respectively.
Otherwise, the compiler can profoundly scramble\footnote{
Many compiler writers prefer the word ``optimize''.}
your code, and sometimes the CPU can do a bit of scrambling as well,
as discussed in \cref{sec:memorder:Itanium}.
\paragraph{Interrupts and signal handlers are part of a thread.}
Both interrupt and signal handlers happen between a pair of adjacent
instructions in a thread.
This means that a given handler appears to execute atomically
from the viewpoint of the interrupted thread, at least at the
assembly-language level.
However, the C and C++ languages do not define the results of handlers
and interrupted threads sharing plain variables.
Instead, such shared variables must be \co{sig_atomic_t}, lock-free
atomics, or \co{volatile}.
On the other hand, because the handler executes within the interrupted
thread's context, the memory ordering used to synchronize communication
between the handler and the thread can be extremely lightweight.
For example, the counterpart of an acquire load is a \co{READ_ONCE()}
followed by a \co{barrier()} compiler directive and the counterpart
of a release store is a \co{barrier()} followed by a \co{WRITE_ONCE()}.
The counterpart of a full memory barrier is \co{barrier()}.
Finally, disabling interrupts or signals (as the case may be) within
the thread excludes handlers.
% @@@ Using flags to avoid expensive interrupt-disabling instructions
% @@@ and signal-blocking system calls. Harder than it looks!
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{memorder/memorybarrier}}
\caption{Memory Barriers Provide Conditional If-Then Ordering}
\label{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
\end{figure}
\paragraph{Ordering has conditional if-then semantics.}
\Cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
illustrates this for memory barriers.
Assuming that both memory barriers are strong enough, if CPU~1's access
Y1 happens after CPU~0's access Y0, then CPU~1's access X1 is guaranteed
to happen after CPU~0's access X0.
\QuickQuiz{
How can you tell which memory barriers are strong enough for
a given use case?
}\QuickQuizAnswer{
Ah, that is a deep question whose answer requires most of the
rest of this chapter.
But the short answer is that \co{smp_mb()} is almost always
strong enough, albeit at some cost.
}\QuickQuizEnd
\begin{fcvref}[ln:formal:C-SB+o-mb-o+o-mb-o:whole]
\Cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
is a case in point.
The \co{smp_mb()} on \clnref{P0:mb,P1:mb} serve as the barriers,
the store to \co{x0} on \clnref{P0:st} as X0, the load from \co{x1}
on \clnref{P0:ld} as Y0, the store to \co{x1} on \clnref{P1:st} as Y1,
and the load from \co{x0} on \clnref{P1:ld} as X1.
Applying the if-then rule step by step, we know that the store to
\co{x1} on \clnref{P1:st} happens after the load from \co{x1} on \clnref{P0:ld} if
\co{P0()}'s local variable \co{r2} is set to the value zero.
The if-then rule would then state that the load from \co{x0} on
\clnref{P1:ld} happens after the store to \co{x0} on \clnref{P0:st}.
In other words,
\co{P1()}'s local variable \co{r2} is guaranteed
to end up with the value two \emph{only if}
\co{P0()}'s local variable \co{r2} ends up with the value zero.
This underscores the point that memory ordering guarantees are
conditional, not absolute.
\end{fcvref}
Although
\cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
specifically mentions memory barriers, this same if-then rule applies
to the rest of the Linux kernel's ordering operations.
\paragraph{Ordering operations must be paired.}
If you carefully order the operations in one thread, but then fail to do
so in another thread, then there is no ordering.
Both threads must provide ordering for the if-then rule to apply.\footnote{
In \cref{sec:memorder:Propagation}, pairing will be
generalized to cycles.}
\paragraph{Ordering operations almost never speed things up.}
If you find yourself tempted to add a memory barrier in an attempt
to force a prior store to be flushed to memory faster, resist!
Adding ordering usually slows things down.
Of course, there are situations where adding instructions speeds things
up, as was shown by
\cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR} on
\cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
but careful benchmarking is required in such cases.
And even then, it is quite possible that although you sped things up
a little bit on \emph{your} system, you might well have slowed things
down significantly on your users' systems.
Or on your future system.
\paragraph{Ordering operations are not magic.}
When your program is failing due to some \IX{race condition}, it is often
tempting to toss in a few memory-ordering operations in an attempt
to barrier your bugs out of existence.
A far better reaction is to use higher-level primitives in a carefully
designed manner.
With concurrent programming, it is almost always better to design your
bugs out of existence than to hack them down to lower probabilities.
\paragraph{These are only rough rules of thumb.}
Although these rules of thumb cover the vast majority of situations
seen in actual practice, as with any set of rules of thumb, they
do have their limits.
The next section will demonstrate some of these limits by introducing
trick-and-trap litmus tests that are intended to insult your
intuition while increasing your understanding.
These litmus tests will also illuminate many of the concepts
represented by the Linux-kernel memory-ordering cheat sheet shown in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet},
and can be automatically analyzed given proper
tooling~\cite{Alglave:2018:FSC:3173162.3177156}.
\Cref{sec:memorder:Memory-Model Intuitions} will
circle back to this cheat sheet, presenting a more sophisticated set of
rules of thumb in light of learnings from all the intervening tricks
and traps.
\QuickQuiz{
Wait!!!
Where do I find this tooling that automatically analyzes
litmus tests???
}\QuickQuizAnswer{
Get version v4.17 (or later) of the Linux-kernel source code,
then follow the instructions in \path{tools/memory-model/README}
to install the needed tools.
Then follow the further instructions to run these tools on the
litmus test of your choice.
}\QuickQuizEnd
\section{Tricks and Traps}
\label{sec:memorder:Tricks and Traps}
%
\epigraph{Knowing where the trap is---that's the first step in evading it.}
{Duke Leto Atreides, \emph{Dune}, Frank Herbert}
Now that you know that hardware can reorder memory accesses and that you
can prevent it from doing so, the next step is to get you to admit
that your intuition has a problem.
This painful task is taken up by
\cref{sec:memorder:Variables With Multiple Values},
which presents some code demonstrating that scalar variables can
take on multiple values simultaneously,
and by
\crefthro{sec:memorder:Memory-Reference Reordering}
{sec:memorder:Multicopy Atomicity},
which show a series of intuitively correct code fragments that fail miserably
on real hardware.
Once your intuition has made it through the grieving process, later
sections will summarize the basic rules that memory ordering follows.
But first, let's take a quick look at just how many values a single
variable might have at a single point in time.
\subsection{Variables With Multiple Values}
\label{sec:memorder:Variables With Multiple Values}
It is natural to think of a variable as taking on a well-defined
sequence of values in a well-defined, global order.
Unfortunately, the next stop on the journey says ``goodbye'' to this comforting fiction.
Hopefully, you already started to say ``goodbye'' in response to row~2 of
\cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events,%
tab:memorder:Memory Ordering: Store-Buffering Sequence of Events},
and if so, the purpose of this section is to drive this point home.
\begin{fcvref}[ln:memorder:Software Logic Analyzer]
To this end, consider the program fragment shown in
\cref{lst:memorder:Software Logic Analyzer}.
This code fragment is executed in parallel by several CPUs.
\Clnref{setid} sets a shared variable to the current CPU's ID, \clnref{init}
initializes several variables from a \co{gettb()} function that
delivers the value of a fine-grained hardware ``timebase'' counter that is
synchronized among all CPUs (not available from all CPU architectures,
unfortunately!), and the loop from \clnrefrange{loop:b}{loop:e}
records the length of
time that the variable retains the value that this CPU assigned to it.
Of course, one of the CPUs will ``win'', and would thus never exit
the loop if not for the check on \clnrefrange{iftmout}{break}.
\end{fcvref}
\QuickQuiz{
What assumption is the code fragment
in \cref{lst:memorder:Software Logic Analyzer}
making that might not be valid on real hardware?
}\QuickQuizAnswer{
The code assumes that as soon as a given CPU stops
seeing its own value, it will immediately see the
final agreed-upon value.
On real hardware, some of the CPUs might well see several
intermediate results before converging on the final value.
The actual code used to produce the data in the figures
discussed later in this section was therefore somewhat more
complex.
}\QuickQuizEnd
\begin{listing}
\begin{fcvlabel}[ln:memorder:Software Logic Analyzer]
\begin{VerbatimL}[commandchars=\\\[\]]
state.variable = mycpu; \lnlbl[setid]
lasttb = oldtb = firsttb = gettb(); \lnlbl[init]
while (state.variable == mycpu) { \lnlbl[loop:b]
lasttb = oldtb;
oldtb = gettb();
if (lasttb - firsttb > 1000) \lnlbl[iftmout]
break; \lnlbl[break]
} \lnlbl[loop:e]
\end{VerbatimL}
\end{fcvlabel}
\caption{Software Logic Analyzer}
\label{lst:memorder:Software Logic Analyzer}
\end{listing}
Upon exit from the loop, \co{firsttb} will hold a timestamp
taken shortly after the assignment and \co{lasttb} will hold
a timestamp taken before the last sampling of the shared variable
that still retained the assigned value, or a value equal to \co{firsttb}
if the shared variable had changed before entry into the loop.
This allows us to plot each CPU's view of the value of \co{state.variable}
over a 532-nanosecond time period, as shown in
\cref{fig:memorder:A Variable With Multiple Simultaneous Values}.
This data was collected in 2006 on 1.5\,GHz \Power{5} system with 8 cores,
each containing a pair of hardware threads.
CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test.
The timebase counter period was about 5.32\,ns, sufficiently fine-grained
to allow observations of intermediate cache states.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{memorder/MoreThanOneValue}}
\caption{A Variable With Multiple Simultaneous Values}
\label{fig:memorder:A Variable With Multiple Simultaneous Values}
\end{figure}
Each horizontal bar represents the observations of a given CPU over time,
with the gray regions to the left indicating the time before the
corresponding CPU's first measurement.
During the first 5\,ns, only CPU~3 has an opinion about the value of the
variable.
During the next 10\,ns, CPUs~2 and~3 disagree on the value of the variable,
but thereafter agree that the value is~``2'', which is in fact
the final agreed-upon value.
However, CPU~1 believes that the value is~``1'' for almost 300\,ns, and
CPU~4 believes that the value is~``4'' for almost 500\,ns.
\QuickQuizSeries{%
\QuickQuizB{
How could CPUs possibly have different views of the
value of a single variable \emph{at the same time?}
}\QuickQuizAnswerB{
As discussed in
\cref{sec:memorder:Why Hardware Misordering?},
many CPUs have store buffers that record the values of
recent stores, which do not become globally visible until
the corresponding cache line makes its way to the CPU\@.
Therefore, it is quite possible for each CPU to see its own value
for a given variable (in its own store buffer) at a single point
in time---and for main memory to hold yet another value.
One of the reasons that memory barriers were invented was
to allow software to deal gracefully with situations like
this one.
Fortunately, software rarely cares about the fact that multiple
CPUs might see multiple values for the same variable.
}\QuickQuizEndB
%
\QuickQuizE{
Why do CPUs~2 and~3 come to agreement so quickly, when it
takes so long for CPUs~1 and~4 to come to the party?
}\QuickQuizAnswerE{
CPUs~2 and~3 are a pair of hardware threads on the same
core, sharing the same cache hierarchy, and therefore have
very low communications latencies.
This is a \IXacr{numa}, or, more accurately, a \IXacr{nuca} effect.
This leads to the question of why CPUs~2 and~3 ever disagree
at all.
One possible reason is that they each might have a small amount
of private cache in addition to a larger shared cache.
Another possible reason is instruction reordering, given the
short 10-nanosecond duration of the disagreement and the
total lack of memory-ordering operations in the code fragment.
}\QuickQuizEndE
}
And if you think that the situation with four CPUs was intriguing, consider
\cref{fig:memorder:A Variable With More Simultaneous Values},
which shows the same situation, but with 15~CPUs each assigning their
number to a single shared variable at time~$t=0$. Both diagrams in the
figure are drawn in the same way as
\cref{fig:memorder:A Variable With Multiple Simultaneous Values}.
The only difference is that the unit of horizontal axis is timebase ticks,
with each tick lasting about 5.3~nanoseconds.
The entire sequence therefore lasts a bit longer than the events recorded in
\cref{fig:memorder:A Variable With Multiple Simultaneous Values},
consistent with the increase in number of CPUs.
The upper diagram shows the overall picture, while the lower one zooms
in on the first 50~timebase ticks.
Again, CPU~0 coordinates the test, so does not record any values.
\begin{figure*}
\centering
\IfEbookSize{
\resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}}
}{
\resizebox{5in}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}}
}
\caption{A Variable With More Simultaneous Values}
\ContributedBy{fig:memorder:A Variable With More Simultaneous Values}{Akira Yokosawa}
\end{figure*}
All CPUs eventually agree on the final value of~9, but not before
the values~15 and~12 take early leads.
Note that there are fourteen different opinions on the variable's value
at time~21 indicated by the vertical line in the lower diagram.
Note also that all CPUs see sequences whose orderings are consistent with
the directed graph shown in
\cref{fig:memorder:Possible Global Orders With More Simultaneous Values}.
Nevertheless, these figures underscore the importance of
proper use of memory-ordering operations.
\begin{figure}
\centering
\resizebox{2.0in}{!}{\includegraphics{memorder/store15tred}}
\caption{Possible Global Orders With More Simultaneous Values}
\label{fig:memorder:Possible Global Orders With More Simultaneous Values}
\end{figure}
How many values can a single variable take on at a single point in
time?
As many as one per store buffer in the system!
We have therefore entered a regime where we must bid a fond farewell to
comfortable intuitions about values of variables and the passage of time.
This is the regime where memory-ordering operations are needed.
But remember well the lessons from
\cref{chp:Hardware and its Habits,%
chp:Partitioning and Synchronization Design}.
Having all CPUs store concurrently to the same variable
is no way to design a parallel program, at least
not if performance and scalability are at all important to you.
Unfortunately, memory ordering has many other ways of insulting your
intuition, and not all of these ways conflict with performance and
scalability.
The next section overviews reordering of unrelated memory reference.
\subsection{Memory-Reference Reordering}
\label{sec:memorder:Memory-Reference Reordering}
\Cref{sec:memorder:Why Hardware Misordering?}
showed that even relatively strongly ordered systems like x86
can reorder prior stores with later loads, at least when the
store and load are to different variables.
This section builds on that result, looking at the other combinations of
loads and stores.
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-o@whole.fcv}
\caption{Message-Passing Litmus Test (No Ordering)}
\label{lst:memorder:Message-Passing Litmus Test (No Ordering)}
\end{listing}
\subsubsection{Load Followed By Load}
\label{sec:memorder:Load Followed By Load}
\begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-o:whole]
\Cref{lst:memorder:Message-Passing Litmus Test (No Ordering)}
(\path{C-MP+o-wmb-o+o-o.litmus})
shows the classic \emph{message-passing} litmus test, where \co{x0} is
the message and \co{x1} is a flag indicating whether or not a message
is available.
In this test, the \co{smp_wmb()} forces \co{P0()} stores to be ordered,
but no ordering is specified for the loads.
Relatively strongly ordered architectures, such as x86, do enforce ordering.
However, weakly ordered architectures often do
not~\cite{JadeAlglave2011ppcmem}.
Therefore, the \co{exists} clause on \clnref{exists} of the listing \emph{can}
trigger.
\end{fcvref}
One rationale for reordering loads from different locations is that doing
so allows execution to proceed when an earlier load misses the cache,
but the values for later loads are already present.
\QuickQuiz{
But why make load-load reordering visible to the user?
Why not just use speculative execution to allow execution to
proceed in the common case where there are no intervening
stores, in which case the reordering cannot be visible anyway?
}\QuickQuizAnswer{
They can and many do, otherwise systems containing
strongly ordered CPUs would be slow indeed.
However, speculative execution does have its downsides, especially
if speculation must be rolled back frequently, particularly
on battery-powered systems.
Speculative execution can also introduce side channels, which
might in turn be exploited to exfiltrate information.
But perhaps future systems will be able to overcome these
disadvantages.
Until then, we can expect vendors to continue producing
weakly ordered CPUs.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-rmb-o@whole.fcv}
\caption{Enforcing Order of Message-Passing Litmus Test}
\label{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-rmb-o:whole]
Thus, portable code relying on ordered loads must
add explicit ordering, for example, the \co{smp_rmb()} shown on
\clnref{rmb} of
\cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
(\path{C-MP+o-wmb-o+o-rmb-o.litmus}), which prevents
the \co{exists} clause from triggering.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-o+o-o@whole.fcv}
\caption{Load-Buffering Litmus Test (No Ordering)}
\label{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
\end{listing}
\subsubsection{Load Followed By Store}
\label{sec:memorder:Load Followed By Store}
\begin{fcvref}[ln:formal:C-LB+o-o+o-o:whole]
\Cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
(\path{C-LB+o-o+o-o.litmus})
shows the classic \emph{load-buffering} litmus test.
Although relatively strongly ordered systems such as x86
or the IBM Mainframe do not reorder prior loads with subsequent stores,
many weakly ordered architectures really do allow such
reordering~\cite{JadeAlglave2011ppcmem}.
Therefore, the \co{exists} clause on \clnref{exists} really can trigger.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-r+a-o@whole.fcv}
\caption{Enforcing Ordering of Load-Buffering Litmus Test}
\label{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-LB+o-r+a-o:whole]
Although it is rare for actual hardware to
exhibit this reordering~\cite{LucMaranget2017aarch64},
one situation where it might be desirable to do so is when a load
misses the cache, the store buffer is nearly full, and the cacheline for
a subsequent store is ready at hand.
Therefore, portable code must enforce any required ordering, for example,
as shown in
\cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
(\path{C-LB+o-r+a-o.litmus}).
The \co{smp_store_release()} and \co{smp_load_acquire()} guarantee that
the \co{exists} clause on \clnref{exists} never triggers.
\end{fcvref}
\QuickQuiz{
Why should it be rare for systems to reorder prior loads with
later stores?
}\QuickQuizAnswer{
Keep in mind that conventional computer systems are not permitted
to make a given CPU's speculative stores visible to other CPUs.
This constraint has consequences.
For example, suppose that address translation has not yet completed
for that prior load.
In that case, it is quite possible that this address translation
will result in a segmentation violation, which might in turn mean
that the later store would never execute.
Therefore, a given store cannot make its value visible to other
CPUs until address translation has completed for each and every
prior load.
Of course, memory latencies can be quite long compared to
address-translation delays, especially if all the information
required for a given address translation is already present
in the translation lookaside buffer (TLB)\@.
Although these long memory latencies should provide ample
opportunity for store-to-load reordering, the fact its that such
reordering is demonstrably rare, especially on the large systems
that would be expected to have the longest memory latencies.
And the reason for that is that large systems are most likely
to have parity or error-correcting code (ECC) checking on memory.
An uncorrectable parity or ECC error would cause a load to take a
fault, which would prevent subsequent stores from ever executing
just as surely as would a segmentation violation.
Therefore, on systems featuring parity or ECC, a subsequent store
cannot expose its value to other CPUs until after parity or
ECC checking has completed for all prior loads.
A given load's checking clearly cannot be carried out until that
load's value has been returned (check bits and all), which has
the side effect of prohibiting reordering of prior loads with
later stores.
So the only systems that can exhibit load-to-store reordering are
those lacking both parity and ECC, for example, small embedded
systems and GPGPUs.
Given that load-to-store reordering is not possible on common-case
computer systems, why do hardware memory models continue to
permit it?
Only CPU architects know for sure, but here are some possible
reasons:
\begin{enumerate}
\item There might be increased demand for small systems that
do not require parity or ECC.
\item Memory might become more reliable, so that larger systems
no longer require parity or ECC.
\item Systems having multiple hardware threads per core might
permit cross-hardware-thread speculation, which would
permit speculative stores to make their values visible
to other threads in that core.
\item Operating systems and hardware might work together to
terminate any processes affected by an uncorrectable
parity or ECC error.
This might allow speculative stores to make their
values visible to other CPUs, give or take things
like debuggers and assertions.
\end{enumerate}
In all of these cases, there might well be compelling performance
advantages to once again permitting load-to-store reordering.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-o+o-rmb-o@whole.fcv}
\caption{Message-Passing Litmus Test, No Writer Ordering (No Ordering)}
\label{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)}
\end{listing}
\subsubsection{Store Followed By Store}
\label{sec:memorder:Store Followed By Store}
\Cref{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)}
(\path{C-MP+o-o+o-rmb-o.litmus})
once again shows the classic message-passing litmus test, with the
\co{smp_rmb()} providing ordering for \co{P1()}'s loads, but without
any ordering for \co{P0()}'s stores.
Again, the relatively strongly ordered architectures do enforce ordering,
but weakly ordered architectures do not necessarily do
so~\cite{JadeAlglave2011ppcmem}, which means that the
\co{exists} clause can trigger.
One situation in which such reordering could be beneficial is when
the store buffer is full, another store is ready to execute, but the
cacheline needed by the oldest store is not yet available.
In this situation, allowing stores to complete out of order would
allow execution to proceed.
Therefore, portable code must explicitly order the stores, for
example, as shown in
\cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test},
thus preventing the \co{exists} clause from triggering.
\QuickQuiz{
Why should strongly ordered systems pay the performance price
of unnecessary \co{smp_rmb()} and \co{smp_wmb()} invocations?
Shouldn't weakly ordered systems shoulder the full cost of
their misordering choices???
}\QuickQuizAnswer{
That is in fact exactly what happens.
On strongly ordered systems, \co{smp_rmb()} and \co{smp_wmb()}
emit no instructions, but instead just constrain the compiler.
Thus, in this case, weakly ordered systems do in fact shoulder
the full cost of their memory-ordering choices.
}\QuickQuizEnd
\subsection{Address Dependencies}
\label{sec:memorder:Address Dependencies}
An \emph{\IXBh{address}{dependency}} occurs when the value returned by a load
instruction is used to compute the address used by a later memory-reference
instruction.
This means that the exact same sequence of instructions used to traverse
a linked data structure in single-threaded code provides weak but extremely
useful ordering in concurrent code.
\QuickQuiz{
Must an address dependency begin with a load instruction?
Why not something like \co{xchg_relaxed()}, which also loads
a value from memory?
}\QuickQuizAnswer{
Yes, pretty much any instruction that loads a value from memory
can start an address dependency, including certain atomic operations
such as \co{xchg_relaxed()}.
The same is true of the control and data dependencies discussed in
\cref{sec:memorder:Control Dependencies,sec:memorder:Data Dependencies}.
However, in all three cases, it is often the case that if you
are going to start your dependency with an atomic operation,
it will be more convenient and maintainable to instead use
an atomic operation with acquire semantics, in this example,
\co{xchg_acquire()} or maybe even just the fully ordered
\co{xchg()}.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-addr-o@whole.fcv}
\caption{Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
\label{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
\end{listing}
\begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
\Cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
(\path{C-MP+o-wmb-o+o-addr-o.litmus})
shows a linked variant of the message-passing pattern.
The head pointer is \co{x1}, which initially
references the \co{int} variable \co{y} (\clnref{init:x1}), which is in turn
initialized to the value $1$ (\clnref{init:y}).
\co{P0()} updates head pointer \co{x1} to reference \co{x0} (\clnref{P0:x1}),
but only after initializing it to $2$ (\clnref{P0:x0}) and forcing ordering
(\clnref{P0:wmb}).
\co{P1()} picks up the head pointer \co{x1} (\clnref{P1:x1}), and then loads
the referenced value (\clnref{P1:ref}).
There is thus an address dependency from the load on \clnref{P1:x1} to the
load on \clnref{P1:ref}.
In this case, the value returned by \clnref{P1:x1} is exactly the address
used by \clnref{P1:ref}, but many variations are possible,
\end{fcvref}
including field access using the C-language \co{->} operator,
addition, subtraction, and array indexing.\footnote{
But note that in the Linux kernel, the address dependency must
be carried through the pointer to the array, not through the
array index.}
\begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
One might hope that \clnref{P1:x1}'s load from the head pointer would be ordered
before \clnref{P1:ref}'s dereference, which is in fact the case on Linux v4.15
and later.
However, prior to v4.15, this was not the case on DEC Alpha, which could
in effect use a speculated value for the dependent load, as described
in more detail in \cref{sec:memorder:Alpha}.
Therefore, on older versions of Linux,
\cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}'s
\co{exists} clause \emph{can} trigger.
\end{fcvref}
\begin{listing}
\begin{fcvlabel}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
\begin{VerbatimL}[commandchars=\@\[\]]
C C-MP+o-wmb-o+ld-addr-o
{
y=1;
x1=y;
}
P0(int* x0, int** x1) {
WRITE_ONCE(*x0, 2);
smp_wmb();
WRITE_ONCE(*x1, x0);
}
P1(int** x1) {
int *r2;
int r3;
r2 = lockless_dereference(*x1); // Obsolete @lnlbl[deref]
r3 = READ_ONCE(*r2); @lnlbl[read]
}
exists (1:r2=x0 /\ 1:r3=1)
\end{VerbatimL}
\end{fcvlabel}
\caption{Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
\label{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
\end{listing}
\begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
\Cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
% \path{C-MP+o-wmb-o+ld-addr-o.litmus} available at commit bc4b1c3f3b35
% ("styleguide: Loosen restriction on comment in litmus test")
shows how to make this work reliably on pre-v4.15 Linux kernels running on
DEC Alpha, by replacing \co{READ_ONCE()} on \clnref{P1:x1} of
\cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
with \apikh{lockless_dereference()},\footnote{
Note that \co{lockless_dereference()} is not needed on v4.15 and
later, and therefore is not available in these later Linux kernels.
Nor is it needed in versions of this book containing this footnote.}
which acts like \co{READ_ONCE()} on all platforms other than DEC Alpha,
where it acts like a \co{READ_ONCE()} followed by an \co{smp_mb()},
thereby forcing the required ordering on all platforms, in turn
preventing the \co{exists} clause from triggering.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-S+o-wmb-o+o-addr-o@whole.fcv}
\caption{S Address-Dependency Litmus Test}
\label{lst:memorder:S Address-Dependency Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
But what happens if the dependent operation is a store rather than
a load, for example, in the \emph{S}
litmus test~\cite{JadeAlglave2011ppcmem} shown in
\cref{lst:memorder:S Address-Dependency Litmus Test}
(\path{C-S+o-wmb-o+o-addr-o.litmus})?
Because no production-quality platform speculates stores,
it is not possible for the \co{WRITE_ONCE()} on \clnref{P0:x0} to overwrite
the \co{WRITE_ONCE()} on \clnref{P1:r2}, meaning that the \co{exists}
clause on \clnref{exists} cannot trigger, even on DEC Alpha, even
in pre-v4.15 Linux kernels.
\end{fcvref}
\QuickQuizSeries{%
\QuickQuizB{
But how do we know that \emph{all} platforms really avoid
triggering the \co{exists} clauses in
\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
lst:memorder:S Address-Dependency Litmus Test}?
}\QuickQuizAnswerB{
Answering this requires identifying three major groups of platforms:
\begin{enumerate*}[(1)]
\item Total-store-order (TSO) platforms,
\item Weakly ordered platforms, and
\item DEC Alpha.
\end{enumerate*}
\begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
The TSO platforms order all pairs of memory references except for
prior stores against later loads.
Because the address dependency on \clnref{deref,read} of
\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
is instead a load followed by another load, TSO platforms preserve
this address dependency.
\end{fcvref}
\begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
They also preserve the address dependency on \clnref{P1:x1,P1:r2} of
\cref{lst:memorder:S Address-Dependency Litmus Test}
because this is a load followed by a store.
Because address dependencies must start with a load, TSO platforms
implicitly but completely respect them, give or take compiler
optimizations, hence the need for \co{READ_ONCE()}.
\end{fcvref}
Weakly ordered platforms don't necessarily maintain ordering of
unrelated accesses.
However, the address dependencies in
\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
lst:memorder:S Address-Dependency Litmus Test} are not unrelated:
There is an address dependency.
The hardware tracks dependencies and maintains the needed
ordering.
\begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
There is one (famous) exception to this rule for weakly ordered
platforms, and that exception is DEC Alpha for load-to-load
address dependencies.
And this is why, in Linux kernels predating v4.15, DEC Alpha
requires the explicit memory barrier supplied for it by the
now-obsolete \apikh{lockless_dereference()} on \clnref{deref} of
\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}.
\end{fcvref}
\begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
However, DEC Alpha does track load-to-store address dependencies,
which is why \clnref{P1:x1} of
\cref{lst:memorder:S Address-Dependency Litmus Test}
does not need a \co{lockless_dereference()}, even in Linux
kernels predating v4.15.
\end{fcvref}
To sum up, current platforms either respect address dependencies
implicitly, as is the case for TSO platforms (x86, mainframe,
SPARC,~\dots), have hardware tracking for address dependencies
(\ARM, PowerPC, MIPS,~\dots), have the required memory barriers
supplied by \co{READ_ONCE()} (DEC Alpha in Linux kernel v4.15 and
later), or supplied by
\co{rcu_dereference()} (DEC Alpha in Linux kernel v4.14 and earlier).
}\QuickQuizEndB
%
\QuickQuizM{
Why the use of \co{smp_wmb()} in
\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
lst:memorder:S Address-Dependency Litmus Test}?
Wouldn't \co{smp_store_release()} be a better choice?
}\QuickQuizAnswerM{
In most cases, \co{smp_store_release()} is indeed a better choice.
However, \co{smp_wmb()} was there first in the Linux kernel,
so it is still good to understand how to use it.
}\QuickQuizEndM
%
\QuickQuizE{
SP, MP, LB, and now~S\@.
Where do all these litmus-test abbreviations come from and
how can anyone keep track of them?
}\QuickQuizAnswerE{
The best scorecard is the infamous
\co{test6.pdf}~\cite{test6-pdf}.
Unfortunately, not all of the abbreviations have catchy
expansions like SB (store buffering), MP (message passing),
and LB (load buffering), but at least the list of abbreviations
is readily available.
}\QuickQuizEndE
}
However, it is important to note that address dependencies can
be fragile and easily broken by compiler optimizations, as discussed in
\cref{sec:memorder:Address- and Data-Dependency Difficulties}.
\subsection{Data Dependencies}
\label{sec:memorder:Data Dependencies}
A \emph{\IXBh{data}{dependency}} occurs when the value returned by a load
instruction is used to compute the data stored by a later store
instruction.
Note well the ``data'' above:
If the value returned by a load was instead used to compute the address
used by a later store instruction, that would instead be an address
dependency, which was covered in
\cref{sec:memorder:Address Dependencies}.
However, the existence of data dependencies means that the exact same
sequence of instructions used to update a linked data structure in
single-threaded code provides weak but extremely useful ordering in
concurrent code.
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-r+o-data-o@whole.fcv}
\caption{Load-Buffering Data-Dependency Litmus Test}
\label{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole]
\Cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
(\path{C-LB+o-r+o-data-o.litmus})
is similar to
\cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test},
except that \co{P1()}'s ordering between \clnref{ld,st} is
enforced not by an \IX{acquire load}, but instead by a data dependency:
The value loaded by \clnref{ld} is what \clnref{st} stores.
The ordering provided by this data dependency is sufficient to prevent
the \co{exists} clause from triggering.
\end{fcvref}
Just as with address dependencies, data dependencies are
fragile and can be easily broken by compiler optimizations, as discussed in
\cref{sec:memorder:Address- and Data-Dependency Difficulties}.
In fact, data dependencies can be even more fragile than are address
dependencies.
The reason for this is that address dependencies normally involve
pointer values.
In contrast, as shown in
\cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test},
it is tempting to carry data dependencies through integral values,
which the compiler has much more freedom to optimize into nonexistence.
For but one example, if the integer loaded was multiplied by the constant
zero, the compiler would know that the result was zero, and could therefore
substitute the constant zero for the value loaded, thus breaking
the dependency.
\QuickQuiz{
\begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole]
But wait!!!
\Clnref{ld} of
\cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
uses \co{READ_ONCE()}, which marks the load as volatile,
which means that the compiler absolutely must emit the load
instruction even if the value is later multiplied by zero.
So how can the compiler possibly break this data dependency?
\end{fcvref}
}\QuickQuizAnswer{
Yes, the compiler absolutely must emit a load instruction for
a volatile load.
But if you multiply the value loaded by zero, the compiler is
well within its rights to substitute a constant zero for the
result of that multiplication, which will break the data
dependency on many platforms.
Worse yet, if the dependent store does not use \co{WRITE_ONCE()},
the compiler could hoist it above the load, which would cause
even TSO platforms to fail to provide ordering.
}\QuickQuizEnd
In short, you can rely on data dependencies only if you prevent the
compiler from breaking them.
\subsection{Control Dependencies}
\label{sec:memorder:Control Dependencies}
A \emph{\IXBh{control}{dependency}} occurs when the value returned by a load
instruction is tested to determine whether or not a later store instruction
is executed.
In other words, a simple conditional branch or conditional-move
instruction can act as a weak but low-overhead memory-barrier instruction.
However, note well the ``later store instruction'':
Although all platforms respect load-to-store dependencies, many platforms
do \emph{not} respect load-to-load control dependencies.
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-r+o-ctrl-o@whole.fcv}
\caption{Load-Buffering Control-Dependency Litmus Test}
\label{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-LB+o-r+o-ctrl-o:whole]
\Cref{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
(\path{C-LB+o-r+o-ctrl-o.litmus})
shows another load-buffering example, this time using a control
dependency (\clnref{if}) to order the load on \clnref{ld} and the store on
\clnref{st}.
The ordering is sufficient to prevent the \co{exists} from triggering.
\end{fcvref}
However, control dependencies are even more susceptible to being optimized
out of existence than are data dependencies, and
\cref{sec:memorder:Control-Dependency Calamities}
describes some of the rules that must be followed in order to prevent
your compiler from breaking your control dependencies.
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-r+o-ctrl-o@whole.fcv}
\caption{Message-Passing Control-Dependency Litmus Test (No Ordering)}
\label{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
\end{listing}
\begin{fcvref}[ln:formal:C-MP+o-r+o-ctrl-o:whole]
It is worth reiterating that control dependencies provide ordering only
from loads to stores.
Therefore, the load-to-load control dependency shown on \clnrefrange{ld1}{ld2} of
\cref{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
(\path{C-MP+o-r+o-ctrl-o.litmus})
does \emph{not} provide ordering, and therefore does \emph{not}
prevent the \co{exists} clause from triggering.
\end{fcvref}
In summary, control dependencies can be useful, but they are
high-maintenance items.
You should therefore use them only when performance considerations
permit no other solution.
\QuickQuiz{
Wouldn't control dependencies be more robust if they were
mandated by language standards???
}\QuickQuizAnswer{
But of course!
And perhaps in the fullness of time they will be so mandated.
}\QuickQuizEnd
\subsection{Cache Coherence}
\label{sec:memorder:Cache Coherence}
On cache-coherent platforms, all CPUs agree on the order of loads and
stores to a given variable.
Fortunately, when \co{READ_ONCE()} and \co{WRITE_ONCE()} are used,
almost all platforms are cache-coherent, as indicated by the ``SV''
column of the cheat sheet shown in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
Unfortunately, this property is so popular that it has been named
multiple times, with ``single-variable SC'',\footnote{
Recall that SC stands for sequentially consistent.}
``single-copy atomic''~\cite{Stone:1995:SP:623262.623912},
and just plain ``coherence''~\cite{JadeAlglave2011ppcmem}
having seen use.
Rather than further compound the confusion by inventing yet another term
for this concept, this book uses ``\IX{cache coherence}'' and ``coherence''
interchangeably.
\begin{listing}
\input{CodeSamples/formal/litmus/C-CCIRIW+o+o+o-o+o-o@whole.fcv}
\caption{Cache-Coherent IRIW Litmus Test}
\label{lst:memorder:Cache-Coherent IRIW Litmus Test}
\end{listing}
\begin{fcvref}[ln:formal:C-CCIRIW+o+o+o-o+o-o:whole]
\Cref{lst:memorder:Cache-Coherent IRIW Litmus Test}
(\path{C-CCIRIW+o+o+o-o+o-o.litmus})
shows a litmus test that tests for cache coherence,
where ``IRIW'' stands
for ``independent reads of independent writes''.
Because this litmus test uses only one variable,
\co{P2()} and \co{P3()} must agree
on the order of \co{P0()}'s and \co{P1()}'s stores.
In other words, if \co{P2()} believes that \co{P0()}'s store
came first, then \co{P3()} had better not believe that
\co{P1()}'s store came first.
And in fact the \co{exists} clause on \clnref{exists} will trigger if this
situation arises.
\end{fcvref}
\QuickQuiz{
But in
\cref{lst:memorder:Cache-Coherent IRIW Litmus Test},
wouldn't be just as bad if \co{P2()}'s \co{r1} and \co{r2}
obtained the values~2 and~1, respectively, while \co{P3()}'s
\co{r3} and \co{r4} obtained the values~1 and~2, respectively?
}\QuickQuizAnswer{
Yes, it would.
Feel free to modify the \co{exists} clause to
check for that outcome and see what happens.
}\QuickQuizEnd
It is tempting to speculate that different-sized overlapping loads
and stores to a single region of memory (as might be set up using
the C-language \co{union} keyword) would provide similar ordering
guarantees.
However, Flur et al.~\cite{Flur:2017:MCA:3093333.3009839} discovered some
surprisingly simple litmus tests that demonstrate that such guarantees
can be violated on real hardware.
It is therefore necessary to restrict code to non-overlapping
same-sized aligned accesses to a given variable, at least if portability
is a consideration.\footnote{
There is reason to believe that using atomic RMW operations
(for example, \co{xchg()}) for all the stores will
provide sequentially consistent ordering, but this has not
yet been proven either way.}
Adding more variables and threads increases the scope for reordering
and other counter-intuitive behavior, as discussed in the next section.
\subsection{Multicopy Atomicity}
\label{sec:memorder:Multicopy Atomicity}
\begin{figure}
\centering
\resizebox{3.0in}{!}{\includegraphics{memorder/SystemArchBus}}
\caption{Global System Bus And Multi-Copy Atomicity}
\label{fig:memorder:Global System Bus And Multi-Copy Atomicity}
\end{figure}
Threads running on a fully
\emph{multicopy atomic}~\cite{Stone:1995:SP:623262.623912}
platform are guaranteed
to agree on the order of stores, even to different variables.
A useful mental model of such a system is the single-bus architecture
shown in
\cref{fig:memorder:Global System Bus And Multi-Copy Atomicity}.
If each store resulted in a message on the bus, and if the bus could
accommodate only one store at a time, then any pair of CPUs would
agree on the order of all stores that they observed.
Unfortunately, building a computer system as shown in the figure, without
store buffers or even caches, would result in glacially slow computation.
Most CPU vendors interested in providing
\IXBalt{multicopy atomicity}{multi-copy atomic} therefore
instead provide the slightly weaker
\emph{other-multicopy atomicity}~\cite[Section B2.3]{ARMv8A:2017},
which excludes the CPU doing a given store from the requirement that all
CPUs agree on the order of all stores.\footnote{
As of early 2021, \ARMv8 and x86 provide other-multicopy atomicity,
IBM mainframe provides full multicopy atomicity, and PPC
provides no multicopy atomicity at all.
More detail is shown in
\cref{tab:memorder:Summary of Memory Ordering}
on
\cpageref{tab:memorder:Summary of Memory Ordering}.}
This means that if only a subset of CPUs are doing stores, the
other CPUs will agree on the order of stores, hence the ``other''
in ``\IXBalthy{other-multicopy atomicity}{other-}{multi-copy atomic}''.
Unlike multicopy-atomic platforms, within other-multicopy-atomic platforms,
the CPU doing the store is permitted to observe its
store early, which allows its later loads to obtain the newly stored
value directly from the store buffer, which improves performance.
\QuickQuiz{
Can you give a specific example showing different behavior for
multicopy atomic on the one hand and other-multicopy atomic
on the other?
}\QuickQuizAnswer{
\Cref{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
(\path{C-MP-OMCA+o-o-o+o-rmb-o.litmus})
shows such a test.
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP-OMCA+o-o-o+o-rmb-o@whole.fcv}
\caption{Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
\label{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
\end{listing}
\begin{fcvref}[ln:formal:C-MP-OMCA+o-o-o+o-rmb-o:whole]
On a multicopy-atomic platform, \co{P0()}'s store to \co{x} on
\clnref{P0:st} must become visible to both \co{P0()} and \co{P1()}
simultaneously.
Because this store becomes visible to \co{P0()} on \clnref{P0:ld}, before
\co{P0()}'s store to \co{y} on \clnref{P0:y}, \co{P0()}'s store to
\co{x} must become visible before its store to \co{y} everywhere,
including \co{P1()}.
Therefore, if \co{P1()}'s load from \co{y} on \clnref{P1:y} returns the
value 1, so must its load from \co{x} on \clnref{P1:x}, given that
the \co{smp_rmb()} on \clnref{P1:rmb} forces these two loads to execute
in order.
Therefore, the \co{exists} clause on \clnref{exists} cannot trigger on a
multicopy-atomic platform.
\end{fcvref}
In contrast, on an other-multicopy-atomic platform, \co{P0()}
could see its own store early, so that there would be no constraint
on the order of visibility of the two stores from \co{P1()},
which in turn allows the \co{exists} clause to trigger.
}\QuickQuizEnd
Perhaps there will come a day when all platforms provide some flavor
of multi-copy atomicity, but
in the meantime, \IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic}
platforms do exist, and so software must deal with them.
\begin{listing}
\input{CodeSamples/formal/litmus/C-WRC+o+o-data-o+o-rmb-o@whole.fcv}
\caption{WRC Litmus Test With Dependencies (No Ordering)}
\label{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
\end{listing}
\begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
\Cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
(\path{C-WRC+o+o-data-o+o-rmb-o.litmus})
demonstrates multicopy atomicity, that is, on a multicopy-atomic platform,
the \co{exists} clause on \clnref{exists} cannot trigger.
In contrast, on a non-multicopy-atomic
platform this \co{exists} clause can trigger, despite
\co{P1()}'s accesses being ordered by a data dependency and \co{P2()}'s
accesses being ordered by an \co{smp_rmb()}.
Recall that the definition of multicopy atomicity requires that all
threads agree on the order of stores, which can be thought of as
all stores reaching all threads at the same time.
Therefore, a non-multicopy-atomic platform can have a store reach
different threads at different times.
In particular, \co{P0()}'s store might reach \co{P1()} long before it
reaches \co{P2()}, which raises the possibility that \co{P1()}'s store
might reach \co{P2()} before \co{P0()}'s store does.
\end{fcvref}
\begin{figure}
\centering
\resizebox{3.0in}{!}{\includegraphics{memorder/NonMCAplatform}}
\caption{Shared Store Buffers And Multi-Copy Atomicity}
\label{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}
\end{figure}
This leads to the question of why a real system constrained by the
usual laws of physics would ever trigger the \co{exists} clause of
\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}.
The cartoonish diagram of a such a real system is shown in
\cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.
CPU~0 and CPU~1 share a store buffer, as do CPUs~2 and~3.
This means that CPU~1 can load a value out of the store buffer, thus
potentially immediately seeing a value stored by CPU~0.
In contrast, CPUs~2 and~3 will have to wait for the corresponding \IX{cache
line} to carry this new value to them.
\QuickQuiz{
Then who would even \emph{think} of designing a system with shared
store buffers???
}\QuickQuizAnswer{
This is in fact a very natural design for any system having
multiple hardware threads per core.
Natural from a hardware point of view, that is!
}\QuickQuizEnd
\begin{table*}
\small
\centering\OneColumnHSpace{-0.8in}
\renewcommand*{\arraystretch}{1.1}
\rowcolors{10}{}{lightgray}
\ebresizewidth{
\begin{tabular}{rlllllll}\toprule
& \multicolumn{1}{c}{\tco{P0()}} & \multicolumn{2}{c}{\tco{P0()} \& \tco{P1()}} &
\multicolumn{1}{c}{\tco{P1()}} & \multicolumn{3}{c}{\tco{P2()}} \\
\cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(lr){5-5} \cmidrule(l){6-8}
& Instruction & Store Buffer & Cache & Instruction &
Instruction & Store Buffer & Cache \\
\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4}
\cmidrule(lr){5-5} \cmidrule(l){6-8}
1 & (Initial state) & & \tco{y==0} &
(Initial state) &
(Initial state) & & \tco{x==0} \\
2 & \tco{x = 1;} & \tco{x==1} & \tco{y==0} &
& & & \tco{x==0} \\
3 & (Read-Invalidate \tco{x}) & \tco{x==1} & \tco{y==0} & \tco{r1 = x} (1)
& & & \tco{x==0} \\
4 & & \tco{x==1} \tco{y==1} & \tco{y==0} & \tco{y = r1}
& \tco{r2 = y} & & \tco{x==0} \\
5 & & \tco{x==1} & \tco{y==1} & (Finish store)
& (Read \tco{y}) & & \tco{x==0} \\
6 & (Respond \tco{y}) & \tco{x==1} & \tco{y==1} &
& (\tco{r2==1}) & & \tco{x==0} \tco{y==1} \\
7 & & \tco{x==1} & \tco{y==1} &
& \tco{smp_rmb()} & & \tco{x==0} \tco{y==1} \\
8 & & \tco{x==1} & \tco{y==1} &
& \tco{r3 = x (0)} & & \tco{x==0} \tco{y==1} \\
9 & & \tco{x==1} & \tco{x==0} \tco{y==1} &
& (Respond \tco{x}) & & \tco{y==1} \\
10 & (Finish store) & & \tco{x==1} \tco{y==1} &
& & & \tco{y==1} \\
\bottomrule
\end{tabular}
}
\caption{Memory Ordering:
WRC Sequence of Events}
\label{tab:memorder:Memory Ordering: WRC Sequence of Events}
\end{table*}
\Cref{tab:memorder:Memory Ordering: WRC Sequence of Events}
shows one sequence of events that can result in the \co{exists} clause in
\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
triggering.
This sequence of events will depend critically on \co{P0()} and
\co{P1()} sharing both cache and a store buffer in the manner shown in
\cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.
\QuickQuiz{
But just how is it fair that \co{P0()} and \co{P1()} must share a store
buffer and a cache, but \co{P2()} gets one each of its very own???
}\QuickQuizAnswer{
Presumably there is a \co{P3()}, as is in fact shown in
\cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity},
that shares \co{P2()}'s store buffer and cache.
But not necessarily.
Some platforms allow different cores to disable different numbers
of threads, allowing the hardware to adjust to the needs of the
workload at hand.
For example, a single-threaded critical-path portion of the workload
might be assigned to a core with only one thread enabled, thus
allowing the single thread running that portion of the workload
to use the entire capabilities of that core.
Other more highly parallel but cache-miss-prone portions of the
workload might be assigned to cores with all hardware threads
enabled to provide improved throughput.
This improved throughput could be due to the fact that while one
hardware thread is stalled on a cache miss, the other hardware
threads can make forward progress.
In such cases, performance requirements override quaint human
notions of fairness.
}\QuickQuizEnd
\begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
Row~1 shows the initial state, with the initial value of \co{y} in
\co{P0()}'s and \co{P1()}'s shared cache, and the initial value of \co{x} in
\co{P2()}'s cache.
Row~2 shows the immediate effect of \co{P0()} executing its store on \clnref{P0:x}.
Because the cacheline containing \co{x} is not in \co{P0()}'s and \co{P1()}'s
shared cache, the new value (\co{1}) is stored in the shared store buffer.
Row~3 shows two transitions.
First, \co{P0()} issues a read-invalidate operation to fetch the cacheline
containing \co{x} so that it can flush the new value for \co{x} out of
the shared store buffer.
Second, \co{P1()} loads from \co{x} (\clnref{P1:x}), an operation that completes
immediately because the new value of \co{x} is immediately available
from the shared store buffer.
Row~4 also shows two transitions.
First, it shows the immediate effect of \co{P1()} executing its store to
\co{y} (\clnref{P1:y}), placing the new value into the shared store buffer.
Second, it shows the start of \co{P2()}'s load from \co{y} (\clnref{P2:y}).
Row~5 continues the tradition of showing two transitions.
First, it shows \co{P1()} complete its store to \co{y}, flushing
from the shared store buffer to the cache.
Second, it shows \co{P2()} request the cacheline containing \co{y}.
Row~6 shows \co{P2()} receive the cacheline containing \co{y}, allowing
it to finish its load into \co{r2}, which takes on the value \co{1}.
Row~7 shows \co{P2()} execute its \co{smp_rmb()} (\clnref{P2:rmb}), thus keeping
its two loads ordered.
Row~8 shows \co{P2()} execute its load from \co{x}, which immediately
returns with the value zero from \co{P2()}'s cache.
Row~9 shows \co{P2()} \emph{finally} responding to \co{P0()}'s request for
the cacheline containing \co{x}, which was made way back up on row~3.
Finally, row~10 shows \co{P0()} finish its store, flushing its value of
\co{x} from the shared store buffer to the shared cache.
Note well that the \co{exists} clause on \clnref{exists} has triggered.
The values of \co{r1} and \co{r2} are both the value one, and
the final value of \co{r3} the value zero.
This strange result occurred because \co{P0()}'s new value of \co{x} was
communicated to \co{P1()} long before it was communicated to \co{P2()}.
\end{fcvref}
\QuickQuiz{
\begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
Referring to
\cref{tab:memorder:Memory Ordering: WRC Sequence of Events},
why on earth would \co{P0()}'s store take so long to complete when
\co{P1()}'s store complete so quickly?
In other words, does the \co{exists} clause on \clnref{exists} of
\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
really trigger on real systems?
\end{fcvref}
}\QuickQuizAnswer{
You need to face the fact that it really can trigger.
Akira Yokosawa used the \co{litmus7} tool to run this litmus test
on a \Power{8} system.
Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause.
Thus, triggering the \co{exists} clause is not merely a one-in-a-million
occurrence, but rather a one-in-a-hundred-million occurrence.
But it nevertheless really does trigger on real systems.
}\QuickQuizEnd
This counter-intuitive result happens because although dependencies
do provide ordering, they provide it only within the confines of their
own thread.
This three-thread example requires stronger ordering, which
is the subject of
\crefthro{sec:memorder:Cumulativity}
{sec:memorder:Release-Acquire Chains}.
\subsubsection{Cumulativity}
\label{sec:memorder:Cumulativity}
The three-thread example shown in
\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
requires \emph{cumulative} ordering, or \emph{cumulativity}.
A cumulative memory-ordering operation orders not just any given
access preceding it, but also earlier accesses by any thread to that
same variable.
\begin{listing}
\input{CodeSamples/formal/litmus/C-WRC+o+o-r+a-o@whole.fcv}
\caption{WRC Litmus Test With Release}
\label{lst:memorder:WRC Litmus Test With Release}
\end{listing}
Dependencies do not provide cumulativity,
which is why the ``C'' column is blank for the \co{READ_ONCE()} row
of \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
on
\cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
However, as indicated by the ``C'' in their ``C'' column,
release operations do provide cumulativity.
Therefore,
\cref{lst:memorder:WRC Litmus Test With Release}
(\path{C-WRC+o+o-r+a-o.litmus})
substitutes a release operation for
\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}'s
data dependency.
\begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole]
Because the release operation is cumulative, its ordering applies not only to
\cref{lst:memorder:WRC Litmus Test With Release}'s
load from \co{x} by \co{P1()} on \clnref{P1:x}, but also to the store to \co{x}
by \co{P0()} on \clnref{P0:x}---but only if that load returns the value stored,
which matches the \co{1:r1=1} in the \co{exists} clause on \clnref{exists}.
This means that \co{P2()}'s load-acquire suffices to force the
load from \co{x} on \clnref{P2:x} to happen after the store on \clnref{P0:x}, so
the value returned is one, which does not match \co{2:r3=0}, which
in turn prevents the \co{exists} clause from triggering.
\end{fcvref}
\begin{figure*}
\centering
\includegraphics{memorder/memorybarriercum}
\caption{Cumulativity}
\label{fig:memorder:Cumulativity}
\end{figure*}
\begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole]
These ordering constraints are depicted graphically in
\cref{fig:memorder:Cumulativity}.
Note also that cumulativity is not limited to a single step back in time.
If there was another load from \co{x} or store to \co{x} from any thread
that came before the store on \clnref{P0:x}, that prior load or store would also
be ordered before the load on \clnref{P2:x}, though only if both \co{r1} and
\co{r2} both end up containing the value \co{1}.
\end{fcvref}
In short, use of cumulative ordering operations can suppress
non-multicopy-atomic behaviors in some situations.
Cumulativity nevertheless has limits, which are examined in the next section.
\subsubsection{Propagation}
\label{sec:memorder:Propagation}
\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
\Cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
(\path{C-W+RWC+o-r+a-o+o-mb-o.litmus})
shows the limitations of cumulativity and store-release,
even with a full memory barrier.
The problem is that although the \co{smp_store_release()} on
\clnref{P0:sr} has cumulativity, and although that cumulativity does
order \co{P2()}'s load on \clnref{P2:ld}, the \co{smp_store_release()}'s
ordering cannot propagate through the combination of \co{P1()}'s
load (\clnref{P1:ld}) and \co{P2()}'s store (\clnref{P2:st}).
This means that the \co{exists} clause on \clnref{exists} really can trigger.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-W+RWC+o-r+a-o+o-mb-o@whole.fcv}
\caption{W+RWC Litmus Test With Release (No Ordering)}
\label{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
\end{listing}
\QuickQuiz{
But it is not necessary to worry about propagation unless
there are at least three threads in the litmus test, right?
}\QuickQuizAnswer{
Wrong.
\begin{listing}
\input{CodeSamples/formal/litmus/C-R+o-wmb-o+o-mb-o@whole.fcv}
\caption{R Litmus Test With Write Memory Barrier (No Ordering)}
\label{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
\end{listing}
\begin{fcvref}[ln:formal:C-R+o-wmb-o+o-mb-o:whole]
\Cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
(\path{C-R+o-wmb-o+o-mb-o.litmus})
shows a two-thread litmus test that requires propagation due to
the fact that it only has store-to-store and load-to-store
links between its pair of threads.
Even though \co{P0()} is fully ordered by the \co{smp_wmb()} and
\co{P1()} is fully ordered by the \co{smp_mb()}, the
counter-temporal nature of the links means that
the \co{exists} clause on \clnref{exists} really can trigger.
To prevent this triggering, the \co{smp_wmb()} on \clnref{wmb}
must become an \co{smp_mb()}, bringing propagation into play
twice, once for each non-temporal link.
\end{fcvref}
}\QuickQuizEnd
\QuickQuizLabel{\MemorderQQLitmusTestR}
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/fr}}
\caption{Load-to-Store is Counter-Temporal}
\label{fig:memorder:Load-to-Store is Counter-Temporal}
\end{figure}
This situation might seem completely counter-intuitive, but keep
in mind that the speed of light is finite and computers are of
non-zero size.
It therefore takes time for the effect of the \co{P2()}'s store to
\co{x} to propagate to \co{P1()}, which in turn means that it is possible
that \co{P1()}'s read from \co{x} happens much later in time, but
nevertheless still sees the old value of zero.
This situation is depicted in
\cref{fig:memorder:Load-to-Store is Counter-Temporal}:
Just because a load sees the old value does \emph{not} mean that
this load executed at an earlier time than did the store of the
new value.
Note that
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
also shows the limitations of memory-barrier pairing, given that
there are not two but three processes.
These more complex litmus tests can instead be said to have \emph{cycles},
where memory-barrier pairing is the special case of a two-thread cycle.
\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
The cycle in
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
goes through \co{P0()} (\clnref{P0:st,P0:sr}), \co{P1()} (\clnref{P1:la,P1:ld}),
\co{P2()} (\clnref{P2:st,P2:mb,P2:ld}), and back to \co{P0()} (\clnref{P0:st}).
The \co{exists} clause delineates this cycle:
The \co{1:r1=1} indicates that the \co{smp_load_acquire()} on \clnref{P1:la}
returned the value stored by the \co{smp_store_release()} on \clnref{P0:sr},
the \co{1:r2=0} indicates that the \co{WRITE_ONCE()} on \clnref{P2:st} came
too late to affect the value returned by the \co{READ_ONCE()} on \clnref{P1:ld},
and finally the \co{2:r3=0} indicates that the
\co{WRITE_ONCE()} on \clnref{P0:st} came too late to affect the value returned
by the \co{READ_ONCE()} on \clnref{P2:ld}.
In this case, the fact that the \co{exists} clause can trigger means that
the cycle is said to be \emph{allowed}.
In contrast, in cases where the \co{exists} clause cannot trigger,
the cycle is said to be \emph{prohibited}.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-W+RWC+o-mb-o+a-o+o-mb-o@whole.fcv}
\caption{W+WRC Litmus Test With More Barriers}
\label{lst:memorder:W+WRC Litmus Test With More Barriers}
\end{listing}
\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
But what if we need to prohibit the cycle corresponding to the \co{exists}
clause on \clnref{exists} of
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}?
One solution is to replace \co{P0()}'s \co{smp_store_release()}
with an \co{smp_mb()}, which
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
shows to have not only cumulativity, but also propagation.
\end{fcvref}
The result is shown in
\cref{lst:memorder:W+WRC Litmus Test With More Barriers}
(\path{C-W+RWC+o-mb-o+a-o+o-mb-o.litmus}).
\QuickQuiz{
\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
But given that \co{smp_mb()} has the propagation property,
why doesn't the \co{smp_mb()} on \clnref{P2:mb} of
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
prevent the \co{exists} clause from triggering?
\end{fcvref}
}\QuickQuizAnswer{
\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
As a rough rule of thumb, the \co{smp_mb()} barrier's
propagation property is sufficient to maintain ordering
through only one load-to-store link between
processes.
Unfortunately,
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
has not one but two load-to-store links, with the
first being from the \co{READ_ONCE()} on \clnref{P1:ld} to the
\co{WRITE_ONCE()} on \clnref{P2:st} and the second being from
the \co{READ_ONCE()} on \clnref{P2:ld} to the \co{WRITE_ONCE()}
on \clnref{P0:st}.
Therefore, preventing the \co{exists} clause from triggering
should be expected to require not one but two
instances of \co{smp_mb()}.
\end{fcvref}
As a special exception to this rule of thumb, a release-acquire
chain can have one load-to-store link between processes
and still prohibit the cycle.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/co}}
\caption{Store-to-Store is Counter-Temporal}
\label{fig:memorder:Store-to-Store is Counter-Temporal}
\end{figure}
For completeness,
\cref{fig:memorder:Store-to-Store is Counter-Temporal}
shows that the ``winning'' store among a group of stores to the
same variable is not necessarily the store that started last.
This should not come as a surprise to anyone who carefully examined
\cref{fig:memorder:A Variable With More Simultaneous Values}
on
\cpageref{fig:memorder:A Variable With More Simultaneous Values}.
One way to rationalize the counter-temporal properties of both
load-to-store and store-to-store ordering is to clearly distinguish
between the temporal order in which the store instructions executed on
the one hand, and the order in which the corresponding cacheline visited
the CPUs that executed those instructions on the other.
It is the cacheline-visitation order that defines the externally
visible ordering of the actual stores.
This cacheline-visitation order is not directly visible to the code
executing the store instructions, which results in the counter-intuitive
counter-temporal nature of load-to-store and store-to-store ordering.\footnote{
In some hardware-multithreaded systems, the store would become
visible to other CPUs in that same core as soon as the store
reached the shared store buffer.
As a result, such systems are non-multicopy atomic.}
\begin{listing}
\input{CodeSamples/formal/litmus/C-2+2W+o-wmb-o+o-wmb-o@whole.fcv}
\caption{2+2W Litmus Test With Write Barriers}
\label{lst:memorder:2+2W Litmus Test With Write Barriers}
\end{listing}
\QuickQuiz{
But for litmus tests having only ordered stores, as shown in
\cref{lst:memorder:2+2W Litmus Test With Write Barriers}
(\path{C-2+2W+o-wmb-o+o-wmb-o.litmus}),
research shows that the cycle is prohibited, even in weakly
ordered systems such as \ARM\ and Power~\cite{test6-pdf}.
Given that, is store-to-store ordering really \emph{always}
counter-temporal???
}\QuickQuizAnswer{
This litmus test is indeed a very interesting curiosity.
Its ordering apparently occurs naturally given typical
weakly ordered hardware design, which would normally be
considered a great gift from the relevant laws of physics
and cache-coherency-protocol mathematics.
\begin{fcvref}[ln:formal:C-2+2W+o-wmb-o+o-wmb-o:whole]
Unfortunately, no one has been able to come up with a software use
case for this gift that does not have a much better alternative
implementation.
Therefore, neither the C11 nor the Linux kernel memory models
provide any guarantee corresponding to
\cref{lst:memorder:2+2W Litmus Test With Write Barriers}.
This means that the \co{exists} clause on \clnref{exists} can
trigger.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-2+2W+o-o+o-o@whole.fcv}
\caption{2+2W Litmus Test (No Ordering)}
\label{lst:memorder:2+2W Litmus Test (No Ordering)}
\end{listing}
Of course, without the barrier, there are no ordering
guarantees, even on real weakly ordered hardware, as shown in
\cref{lst:memorder:2+2W Litmus Test (No Ordering)}
(\path{C-2+2W+o-o+o-o.litmus}).
}\QuickQuizEnd
But sometimes time really is on our side.
Read on!
\subsubsection{Happens-Before}
\label{sec:memorder:Happens-Before}
As shown in
\cref{fig:memorder:Store-to-Load is Temporal},
on platforms without user-visible speculation, if a load returns the value
from a particular store, then, courtesy of the finite speed of light and
the non-zero size of modern computing systems, the store absolutely has
to have executed at an earlier time than did the load.
This means that carefully constructed programs can rely on the
passage of time itself as a memory-ordering operation.
\QuickQuiz{
Why don't we just stick to sanely ordered CPU families like x86,
so that time will \emph{always} be on our side???
}\QuickQuizAnswer{
Sorry to be the one to break it to you, but x86 CPUs have
store buffers and they are not afraid to use them.
The data shown in
\crefrange{fig:memorder:x86 CPUs Can Disagree}{fig:memorder:Store-to-Load is Temporal on x86}
was obtained from a series of torture tests
(\path{perftemporal.sh}) running on a two-socket x86 system
having a total of 80 hardware
threads.\footnote{
\co{Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz},
for those wanting more details.}
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe-nvals}}
\caption{x86 CPUs Can Disagree}
\label{fig:memorder:x86 CPUs Can Disagree}
\end{figure}
\Cref{fig:memorder:x86 CPUs Can Disagree}
is from a torture-test run similar to the one that generated
\cref{fig:memorder:A Variable With More Simultaneous Values}
on
\cpageref{fig:memorder:A Variable With More Simultaneous Values},
but showing the number of distinct opinions per unit time, where
each timestamp period is 0.5~nanoseconds.\footnote{
Recall that each CPU writes its own number to a single
shared variable, which each CPU repeatedly polls,
recording time and value at each change in value.}
As you can see, there is a significant period of time during
which there are more than 40 distinct opinions as to the value
of a single shared variable.
Even on x86.
\begin{figure}
\centering
\resizebox{.85\twocolumnwidth}{!}{\includegraphics{memorder/co-hopes}}
\caption{Is Store-to-Store Counter-Temporal on x86?}
\label{fig:memorder:Is Store-to-Store Counter-Temporal on x86?}
\end{figure}
But perhaps we might hope that on x86, the last CPU to execute its
store in global time order would ``win'', that is, the final value
of the shared variable would be that of that last store.
If so, any store starting after the ``winning'' store finished would
overwrite the winning store.
In contrast, we know that on weakly ordered systems, a
counter-intuitive store such as that shown at the bottom of
\cref{fig:memorder:Is Store-to-Store Counter-Temporal on x86?}
could be overwritten by the old value, despite the fact that it
started $t$ timestamp periods after the winning store completed,
as depicted in the figure by the time interval that is labelled
$-t$.\footnote{
That is completed from the viewpoint of the instruction
stream containing that store.
The value stored might well remain in the store buffer
long afterwards.}
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe}}
\caption{Store-to-Store is Counter-Temporal on x86}
\label{fig:memorder:Store-to-Store is Counter-Temporal on x86}
\end{figure}
However,
\cref{fig:memorder:Store-to-Store is Counter-Temporal on x86}
dashes any fond hope of x86 refusing to indulge in
counter-intuitive counter-temporal stores.
The data in this figure summarizes the results from 79,000
torture-test runs of the type that generated
\cref{fig:memorder:x86 CPUs Can Disagree},
and is a histogram of the minimum time elapsed between the start of
a non-winning CPU's store and the end of the winning CPU's store.
Of course, if this value is negative, the winning store completed
(though its value might not have propagated past the store buffer)
before the other store even started.
And the negative-time data points in that figure show that this
counter-temporal behavior really can be made to happen most of
the time, even on x86.
\begin{figure}
\centering
\resizebox{.6\twocolumnwidth}{!}{\includegraphics{memorder/fr-hopes}}
\caption{Is Load-to-Store Counter-Temporal on x86?}
\label{fig:memorder:Is Load-to-Store Counter-Temporal on x86?}
\end{figure}
But perhaps we could instead hope that once a store had
executed, any future load from that same variable might
be guaranteed to return the new value.
In contrast, we know that on weakly ordered systems, a
counter-intuitive load such as that shown at the bottom
of
\cref{fig:memorder:Is Load-to-Store Counter-Temporal on x86?}
could return the old value, despite having started $t$
timestamp periods after the end of the store, again, as
depicted in the figure by the $-t$.
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/fre}}
\caption{Load-to-Store is Counter-Temporal on x86}
\label{fig:memorder:Load-to-Store is Counter-Temporal on x86}
\end{figure}
However,
\cref{fig:memorder:Load-to-Store is Counter-Temporal on x86}
dashes any fond hopes that loads executing after a given store
would see that store's value (or some later value).
This data is generated by 1,000 runs of a program in which the
parent thread spawns the children, each of which polls a shared
variable.
After all the children are running, the parent writes a new
value to the shared variable.
The quantity histogrammed in the figure is the time from just
after the store until just before the last load of the old value.
And most of these data points lie on the negative x~axis,
clearly demonstrating that a torture test can easily provoke
the counter-temporal nature of cross-thread load-to-store.
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf-hopes}}
\caption{Is Store-to-Load Counter-Temporal on x86?}
\label{fig:memorder:Is Store-to-Load Counter-Temporal on x86?}
\end{figure}
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/rfe}}
\caption{Store-to-Load is Temporal on x86}
\label{fig:memorder:Store-to-Load is Temporal on x86}
\end{figure}
It is only reasonable to assume that a load that ends before
a store starts will be unable to load that store's value,
as shown in
\cref{fig:memorder:Is Store-to-Load Counter-Temporal on x86?},
even on weakly ordered systems.
And the data points from all 1,000 runs shown in
\cref{fig:memorder:Store-to-Load is Temporal on x86},
lie on the positive x~axis, so in at least this case, our temporal
intuitions, hopes, and dreams have been fulfilled.
These results should be no surprise.
Even on x86, the cache line shuttles between cores and sockets, and
\cref{fig:memorder:x86 CPUs Can Disagree}
indicates that a given store can remain in its CPU's store buffer
for more than a microsecond, which is quite enough time for
today's multi-GHz CPUs to detect the resulting counter-temporal
behavior.
However, for the store-to-load case, the laws of physics
guarantee temporal ordering because the finite speed of light
and the non-zero size of atoms work together to guarantee that
some time will pass between a store on one CPU and a load of
that store's value on some other CPU.
Alert readers may have noticed that the distribution shown in
\cref{fig:memorder:Store-to-Store is Counter-Temporal on x86}
is nearly monomodal, which those in
\cref{fig:memorder:Load-to-Store is Counter-Temporal on x86}
and
\cref{fig:memorder:Store-to-Load is Temporal on x86}
are decidedly trimodal.
Such readers are encouraged to consider why that might be,
perhaps referring to \path{CodeSamples/cpu/perftemporal.sh} and
the scripts and programs that it invokes.
And all readers are encouraged to note that even on relatively
strongly ordered CPUs such as x86, store-to-store and
load-to-store IPC links can be strongly counter-temporal.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf}}
\caption{Store-to-Load is Temporal}
\label{fig:memorder:Store-to-Load is Temporal}
\end{figure}
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+a-o+o-data-o+o-data-o@whole.fcv}
\caption{LB Litmus Test With One Acquire}
\label{lst:memorder:LB Litmus Test With One Acquire}
\end{listing}
Of course, just the passage of time by itself is not enough, as
was seen in
\cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
on
\cpageref{lst:memorder:Load-Buffering Litmus Test (No Ordering)},
which has nothing but store-to-load links and, because it provides
absolutely no ordering, still can trigger its \co{exists} clause.
However, as long as each thread provides even the weakest possible
ordering, \co{exists} clause would not be able to trigger.
For example,
\cref{lst:memorder:LB Litmus Test With One Acquire}
(\path{C-LB+a-o+o-data-o+o-data-o.litmus})
shows \co{P0()} ordered with an \co{smp_load_acquire()} and
both \co{P1()} and \co{P2()} ordered with data dependencies.
These orderings, which are close to the top of
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet},
suffice to prevent the \co{exists} clause from triggering.
\QuickQuiz{
Can you construct a litmus test like that in
\cref{lst:memorder:LB Litmus Test With One Acquire}
that uses \emph{only} dependencies?
}\QuickQuizAnswer{
\Cref{lst:memorder:LB Litmus Test With No Acquires}
shows a somewhat nonsensical but very real example.
Creating a more useful (but still real) litmus test is left
as an exercise for the reader.
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-data-o+o-data-o+o-data-o@whole.fcv}
\caption{LB Litmus Test With No Acquires}
\label{lst:memorder:LB Litmus Test With No Acquires}
\end{listing}
}\QuickQuizEnd
An important use of time for ordering memory accesses is covered in the
next section.
\subsubsection{Release-Acquire Chains}
\label{sec:memorder:Release-Acquire Chains}
A minimal release-acquire chain was shown in
\cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
on
\cpageref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test},
but these chains can be much longer, as shown in
\cref{lst:memorder:Long LB Release-Acquire Chain}
(\path{C-LB+a-r+a-r+a-r+a-r.litmus}).
The longer the release-acquire chain, the more ordering is gained
from the passage of time, so that no matter how many threads are
involved, the corresponding \co{exists} clause cannot trigger.
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+a-r+a-r+a-r+a-r@whole.fcv}
\caption{Long LB Release-Acquire Chain}
\label{lst:memorder:Long LB Release-Acquire Chain}
\end{listing}
Although release-acquire chains are inherently store-to-load creatures,
it turns out that they can tolerate one load-to-store step, despite
such steps being counter-temporal, as shown in
\cref{fig:memorder:Load-to-Store is Counter-Temporal}
on
\cpageref{fig:memorder:Load-to-Store is Counter-Temporal}.
For example,
\cref{lst:memorder:Long ISA2 Release-Acquire Chain}
(\path{C-ISA2+o-r+a-r+a-r+a-o.litmus})
shows a three-step release-acquire chain, but where \co{P3()}'s
final access is a \co{READ_ONCE()} from \co{x0}, which is
accessed via \co{WRITE_ONCE()} by \co{P0()}, forming a non-temporal
load-to-store link between these two processes.
\begin{fcvref}[ln:formal:litmus:C-ISA2+o-r+a-r+a-r+a-o:whole]
However, because \co{P0()}'s \co{smp_store_release()} (\clnref{P0:rel})
is cumulative, if \co{P3()}'s \co{READ_ONCE()} returns zero,
this cumulativity will force the \co{READ_ONCE()} to be ordered
before \co{P0()}'s \co{smp_store_release()}.
In addition, the release-acquire chain
(\clnref{P0:rel,P1:acq,P1:rel,P2:acq,P2:rel,P3:acq})
forces \co{P3()}'s \co{READ_ONCE()} to be ordered after \co{P0()}'s
\co{smp_store_release()}.
Because \co{P3()}'s \co{READ_ONCE()} cannot be both before and after
\co{P0()}'s \co{smp_store_release()}, either or both of two things must
be true:
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-ISA2+o-r+a-r+a-r+a-o@whole.fcv}
\caption{Long ISA2 Release-Acquire Chain}
\label{lst:memorder:Long ISA2 Release-Acquire Chain}
\end{listing}
\begin{enumerate}
\item \co{P3()}'s \co{READ_ONCE()} came after \co{P0()}'s
\co{WRITE_ONCE()}, so that the \co{READ_ONCE()} returned
the value two, so that the \co{exists} clause's \co{3:r2=0}
is false.
\item The release-acquire chain did not form, that is, one or more
of the \co{exists} clause's \co{1:r2=2}, \co{2:r2=2}, or \co{3:r1=2}
is false.
\end{enumerate}
Either way, the \co{exists} clause cannot trigger, despite this litmus
test containing a notorious load-to-store link between
\co{P3()} and \co{P0()}.
But never forget that release-acquire chains can tolerate only one
load-to-store link, as was seen in
\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}.
\begin{listing}
\input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-r+a-r+a-o@whole.fcv}
\caption{Long Z6.2 Release-Acquire Chain}
\label{lst:memorder:Long Z6.2 Release-Acquire Chain}
\end{listing}
Release-acquire chains can also tolerate a single store-to-store step,
as shown in
\cref{lst:memorder:Long Z6.2 Release-Acquire Chain}
(\path{C-Z6.2+o-r+a-r+a-r+a-o.litmus}).
\begin{fcvref}[ln:formal:C-Z6.2+o-r+a-r+a-r+a-o:whole]
As with the previous example, \co{smp_store_release()}'s cumulativity
combined with the temporal nature of the release-acquire chain
prevents the \co{exists} clause on \clnref{exists} from triggering.
\end{fcvref}
\begin{listing}
\input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-o+o-mb-o@whole.fcv}
\caption{Z6.2 Release-Acquire Chain (Ordering?)}
\label{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}
\end{listing}
\QuickQuiz{
Suppose we have a short release-acquire chain along with one
load-to-store link and one store-to-store link, like that shown in
\cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}.
Given that there is only one of each type of non-store-to-load
link, the \co{exists} cannot trigger, right?
}\QuickQuizAnswer{
Wrong.
It is the number of non-store-to-load links that matters.
If there is only one non-store-to-load link, a release-acquire
chain can prevent the \co{exists} clause from triggering.
However, if there is more than one non-store-to-load link,
be they store-to-store, load-to-store, or any combination
thereof, it is necessary to have at least one full barrier
(\co{smp_mb()} or better) between each non-store-to-load link.
In
\cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)},
preventing the \co{exists} clause from triggering therefore requires
an additional full barrier between either \co{P0()}'s or
\co{P1()}'s accesses.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/formal/litmus/C-MP+o-r+a-o@whole.fcv}
\caption{A Release-Acquire Chain Ordering Multiple Accesses}
\label{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses}
\end{listing}
\begin{listing}
\input{CodeSamples/formal/litmus/C-MPO+o-r+a-o+o@whole.fcv}
\caption{A Release-Acquire Chain With Added Store (Ordering?)}
\label{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
\end{listing}
But beware:
Adding a second store-to-store link allows the correspondingly updated
\co{exists} clause to trigger.
To see this, review
\cref{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses,%
lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)},
which have identical \co{P0()} and \co{P1()} processes.
The only code difference is that
\cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
has an additional \co{P2()} that does an \co{smp_store_release()} to
the \co{x2} variable that \co{P0()} releases and \co{P1()} acquires.
The \co{exists} clause is also adjusted to exclude executions in which
\co{P2()}'s \co{smp_store_release()} precedes that of \co{P0()}.
Running the litmus test in
\cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
shows that the addition of \co{P2()} can totally destroy the
ordering from the release-acquire chain.
Therefore, when constructing release-acquire chains, please take care
to construct them properly.
\QuickQuiz{
There are store-to-load links, load-to-store links, and
store-to-store links.
But what about load-to-load links?
}\QuickQuizAnswer{
The problem with the concept of load-to-load links is that
if the two loads from the same variable return the same
value, there is no way to determine their ordering.
The only way to determine their ordering is if they return
different values, in which case there had to have been an
intervening store.
And that intervening store means that there is no load-to-load
link, but rather a load-to-store link followed by a
store-to-load link.
}\QuickQuizEnd
In short, properly constructed release-acquire chains form a peaceful
(if rather tightly constrained)
island of intuitive bliss surrounded by a strongly counter-intuitive
sea of more complex memory-ordering constraints.
\subsection{A Counter-Intuitive Case Study}
\label{sec:memorder:A Counter-Intuitive Case Study}
This section will revisit
\cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
on \cpageref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)},
which was presented in the answer to
\QuickQuizARef{\MemorderQQLitmusTestR}.
This litmus test has only two threads, with the stores in \co{P0()}
being ordered by \co{smp_wmb()} and the accesses in \co{P1()} being
ordered by \co{smp_mb()}.
Despite this litmus test's small size and heavy ordering, the
counter-intuitive outcome shown in the \co{exists} clause is in fact
allowed.
One way to look at this was presented in the answer to
\QuickQuizARef{\MemorderQQLitmusTestR}, namely that the link from
\co{P0()} to \co{P1()} is a store-to-store link, and that back
from \co{P1()} to \co{P0()} is a store-to-store link.
Both links are counter-temporal, thus requiring full memory barriers
in both processes.
Revisiting
\cref{fig:memorder:Store-to-Store is Counter-Temporal,%
fig:memorder:Store-to-Load is Temporal}
shows that these counter-temporal links give the hardware considerable
latitude.
But that raises the question of exactly how hardware would go about using
this latitude to satisfy the \co{exists} clause in
\cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}.
There is no known ``toy'' hardware implementation that can do this, so
let us instead study the sequence of steps that the PowerPC architecture
goes through to make this happen.
The first step in this study is to translate
\cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
to a PowerPC assembly language litmus test
(\cref{sec:formal:Anatomy of a Litmus Test} on
\cpageref{sec:formal:Anatomy of a Litmus Test}):
\begin{fcvlabel}[ln:memorder:inline:ppcasmr]
\begin{VerbatimN}[samepage=true,commandchars=\@\[\]]
PPC R+lwsync+sync
{
0:r1=1; 0:r2=x; 0:r4=y; @lnlbl[init0]
1:r1=2; 1:r2=y; 1:r4=x; @lnlbl[init1]
}
P0 | P1 ; @lnlbl[procs]
stw r1,0(r2) | stw r1,0(r2) ; @lnlbl[stores]
lwsync | sync ; @lnlbl[barriers]
stw r1,0(r4) | lwz r3,0(r4) ; @lnlbl[storeload]
exists (y=2 /\ 1:r3=0) @lnlbl[exists]
\end{VerbatimN}
\end{fcvlabel}
\begin{fcvref}[ln:memorder:inline:ppcasmr]
The first line identifies the type of test (\co{PPC}) and gives
the test's name.
\Clnref{init0,init1} initialize \co{P0()}'s and \co{P1()}'s registers,
respectively.
\Clnrefrange{procs}{storeload} show the PowerPC assembly
statements corresponding to the C code from
\cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)},
with the first column being the code for \co{P0()} and the second column
being the code for \co{P1()}.
\Clnref{stores} shows the initial \co{WRITE_ONCE()} calls in both columns;
the columns of \clnref{barriers} show the \co{smp_wmb()} and \co{smp_mb()}
for \co{P0()} and \co{P1()}, respectively;
the columns of \clnref{storeload} shows \co{P0()}'s \co{WRITE_ONCE()} and
\co{P1()}'s \co{READ_ONCE()}, respectively;
and finally \clnref{exists} shows the \co{exists} clause.
\end{fcvref}
In order for this \co{exists} clause to be satisfied, \co{P0()}'s
\co{stw} to \co{y} must precede that of \co{P1()}, but \co{P1()}'s
later \co{lwz} from \co{x} must precede \co{P0()}'s \co{stw} to \co{x}.
Seeing how this can happen requires a rough understanding of the
following PowerPC terminology.
\begin{description}[style=nextline]
\item[Instruction commit:]
This can be thought of as the execution of that instruction as opposed
to the memory-system consequences of having executed that instruction.
\item[Write reaching coherence point:]
This can be thought of as the value written being deposited into the
corresponding cache line.
\item[Partial coherence commit:]
This can be thought of as the system having worked out the order in which
a pair of values written will be deposited into the corresponding cache
line, but potentially well before that cache line arrives.
Some might argue that the data in
\cref{fig:memorder:A Variable With More Simultaneous Values}
suggests that real PowerPC hardware does in fact use partial coherence
commits to handle concurrent stores by multiple hardware threads within
a single core.
\item[Write propagate to thread:]
This occurs when a second hardware thread becomes aware of the first
hardware thread's write.
The time at which a write propagates to a given thread might not have
any relation to cache-line movement.
For example, if a pair of threads share a store buffer, they might see
each others' writes long before the cache line gets involved.
On the other hand, if a pair of hardware threads are widely separated,
the first thread's write's value might have been deposited into the
corresponding cache line long before the second thread learns of that
write.
\item[Barrier propagate to thread:]
Hardware threads make each other aware of memory-barrier instructions
as needed by propagating them to each other.
\item[Acknowledge \tco{sync}:]
The PowerPC \co{sync} instruction implements the Linux kernel's
\co{smp_mb()} full barrier.
And one reason that the \co{sync} instruction provides such strong
ordering is that each \co{sync} is not only propagated to other hardware
threads, but these other threads must also acknowledge each \co{sync}.
This two-way communication allows the hardware threads to cooperate
to produce the required strong global ordering.
\end{description}
\begin{figure*}[tbp]
\centering
\resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM0.png}}
\caption{PPCMEM Initial R State}
\label{fig:memorder:PPCMEM Initial R State}
\end{figure*}
\begin{figure*}[tbp]
\centering
\resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM1.png}}
\caption{PPCMEM First R Step}
\label{fig:memorder:PPCMEM First R Step}
\end{figure*}
We are now ready to step through the PowerPC sequence of events that
satisfies the above \co{exists} clause.
To best understand this, please follow along at
\url{https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html},
carefully copying the above assembly-language litmus test into the pane.
The result should look as shown in
\cref{fig:memorder:PPCMEM Initial R State}, give or take space characters.
Click on the ``Interactive'' button in the lower left, which, after a
short delay, should produce a display as shown in
\cref{fig:memorder:PPCMEM First R Step}.
If the ``Interactive'' button refuses to do anything, this usually means
that there is a syntax error, for example, a spurious newline character
might have been introduced during the copy-paste operation.
This display has one clickable link in each section displaying thread
state, and as the ``Commit'' in each link suggests, these links commit
each thread's first \co{stw} instruction.
If you prefer, you can instead click on the corresponding links listed
under ``Enabled transitions'' near the bottom of the screen.
Note well that some of the later memory-system transitions will appear
in the upper ``Storage subsystem state'' section of this display.
The following sequence of clicks demonstrates how the \co{exists} clause
can be satisfied:
\begin{enumerate}
\item Commit \co{P0()}'s first \co{stw} instruction (to \co{x}).
\item Commit \co{P1()}'s \co{stw} instruction.
\item Commit \co{P0()}'s \co{lwsync} instruction.
\item Commit \co{P0()}'s second \co{stw} instruction (to \co{y}).
\item Commit \co{P1()}'s \co{sync} instruction.
\item At this point, there should be no clickable links in either of
the two sections displaying thread state, but there should be
quite a few of them up in the ``Storage subsystem state''.
The following steps tell you which of them to click on.
\item \co{Partial coherence commit: c:W y=1 -> d:W y=2}.
This commits the system to processing \co{P0()}'s store to
\co{y} before \co{P1()}'s store even though neither store
has reached either the coherence point or any other thread.
One might imagine partial coherence commits happening within a
store buffer that is shared by multiple hardware threads
that are writing to the same variable.
\item \co{Write propagate to thread: d:W y=2 to Thread 0}.
This is necessary to allow \co{P1()}'s \co{sync} instruction
to propagate to \co{P0()}.
\item \co{Barrier propagate to thread: e:Sync to Thread 0}.
\item \co{Write reaching coherence point: a:W x=1}.
\item \co{Write reaching coherence point: c:W y=1}.
\item \co{Write reaching coherence point: d:W y=2}.
These three operations were required in order to allow \co{P0()}
to acknowledge \co{P1()}'s \co{sync} instruction.
\item \co{Acknowledge sync: Sync e:Sync}.
\item Back down in thread \co{P1()}'s state, click on \co{Read i:W
x=0}, which loads the value zero, thus satisfying the \co{exists}
clause.
All that remains is cleanup, which can be carried out in any order.
\item Commit \co{P1()}'s \co{lwz} instruction.
\item \co{Write propagate to thread: a:W x=1 to Thread 1}.
\item \co{Barrier propagate to thread: b:Lwsync to Thread 1}.
\end{enumerate}
\begin{figure*}[tbp]
\centering
\resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEMfinal.png}}
\caption{PPCMEM Final R State}
\label{fig:memorder:PPCMEM Final R State}
\end{figure*}
At this point, you should see something like
\cref{fig:memorder:PPCMEM Final R State}.
Note that the satisified \co{exists} clause is shown in blue near the
bottom, confirming that this counter-intuitive really can happen.
If you wish, you can click on ``Undo'' to explore other options or
click on ``Reset'' to start over.
It can be very helpful to carry out these steps in different orders
to better understand how a
\IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic}
architecture operates.
\QuickQuiz{
What happens if that \co{lwsync} instruction is instead a
\co{sync} instruction?
}\QuickQuizAnswer{
The counter-intuitive outcome cannot happen.
(Try it!)
}\QuickQuizEnd
Although a full understanding of how this counter-intuitive outcome
happens would require hardware details that are beyond the scope of
this book, this exercise should provide some helpful intuitions.
Or perhaps more accurately, destroy some counter-productive intuitions.
% @@@ Exercises?
% @@@ Hardware details from Appendix?
\section{Compile-Time Consternation}
\label{sec:memorder:Compile-Time Consternation}
%
\epigraph{Science increases our power in proportion as it lowers our pride.}
{Claude Bernard}
Most languages, including C, were developed on uniprocessor systems
by people with little or no parallel-programming experience.
As a result, unless explicitly told otherwise, these languages assume
that the current CPU is the only thing that is reading or writing memory.
This in turn means that these languages' compilers' optimizers
are ready, willing, and oh so able to make dramatic changes to the
order, number, and sizes of memory references that your program
executes.
In fact, the reordering carried out by hardware can seem quite tame
by comparison.
This section will help you tame your compiler, thus avoiding a great
deal of compile-time consternation.
\Cref{sec:memorder:Memory-Reference Restrictions}
describes how to keep the compiler from destructively optimizing
your code's memory references,
\cref{sec:memorder:Address- and Data-Dependency Difficulties}
describes how to protect address and data dependencies,
and finally,
\cref{sec:memorder:Control-Dependency Calamities}
describes how to protect those delicate control dependencies.
\subsection{Memory-Reference Restrictions}
\label{sec:memorder:Memory-Reference Restrictions}
As noted in \cref{sec:toolsoftrade:Accessing Shared Variables},
unless told otherwise, compilers assume that nothing else
is affecting the variables that the code is accessing.
Furthermore, this assumption is not simply some design error, but is
instead enshrined in various standards.\footnote{
Or perhaps it is a standardized design error.}
It is worth summarizing this material in preparation for the following
sections.
\IXplx{Plain access}{es}, as in plain-access C-language assignment statements such
as \qco{r1 = a} or \qco{b = 1} are subject to the
shared-variable shenanigans described in
\cref{sec:toolsoftrade:Shared-Variable Shenanigans}.
Ways of avoiding these shenanigans are described in
\crefrange{sec:toolsoftrade:A Volatile Solution}{sec:toolsoftrade:Avoiding Data Races}
starting on
\cpageref{sec:toolsoftrade:A Volatile Solution}:
\begin{enumerate}
\item Plain accesses can tear, for example, the compiler could choose
to access an eight-byte pointer one byte at a time.
Tearing of aligned machine-sized accesses can be prevented by
using \co{READ_ONCE()} and \co{WRITE_ONCE()}.
\item Plain loads can fuse, for example, if the results of an earlier
load from that same object are still in a machine register,
the compiler might opt to reuse the value in that register
instead of reloading from memory.
Load fusing can be prevented by using \co{READ_ONCE()} or by
enforcing ordering between the two loads using \co{barrier()},
\co{smp_rmb()}, and other means shown in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
\item Plain stores can fuse, so that a store can be omitted entirely
if there is a later store to that same variable.
Store fusing can be prevented by using \co{WRITE_ONCE()} or by
enforcing ordering between the two stores using \co{barrier()},
\co{smp_wmb()}, and other means shown in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
\item Plain accesses can be reordered in surprising ways by modern
optimizing compilers.
This reordering can be prevented by enforcing ordering as
called out above.
\item Plain loads can be invented, for example, register pressure might
cause the compiler to discard a previously loaded value from
its register, and then reload it later on.
Invented loads can be prevented by using \co{READ_ONCE()} or by
enforcing ordering as called out above between the load and a
later use of its value using \co{barrier()}.
\item Stores can be invented before a plain store, for example, by
using the stored-to location as temporary storage.
This can be prevented by use of \co{WRITE_ONCE()}.
\item Stores can be transformed into a load-check-store sequence,
which can defeat control dependencies.
This can be prevented by use of \co{smp_load_acquire()}.
\end{enumerate}
\QuickQuiz{
Why not place a \co{barrier()} call immediately before
a plain store to prevent the compiler from inventing stores?
}\QuickQuizAnswer{
Because it would not work.
Although the compiler would be prevented from inventing a
store prior to the \co{barrier()}, nothing would prevent
it from inventing a store between that \co{barrier()} and
the plain store.
}\QuickQuizEnd
Please note that all of these shared-memory shenanigans can instead be
avoided by avoiding \IXpl{data race} on plain accesses, as described in
\cref{sec:toolsoftrade:Avoiding Data Races}.
After all, if there are no data races, then each and every one of the
compiler optimizations mentioned above is perfectly safe.
But for code containing data races, this list is subject to change
without notice as compiler optimizations continue becoming increasingly
aggressive.
In short, use of \co{READ_ONCE()}, \co{WRITE_ONCE()}, \co{barrier()},
\co{volatile}, and other primitives called out in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
on
\cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
are valuable tools in preventing the compiler from
optimizing your parallel algorithm out of existence.
Compilers are starting to provide other mechanisms for avoiding
load and store tearing, for example, \co{memory_order_relaxed}
atomic loads and stores, however, work is still
needed~\cite{JonathanCorbet2016C11atomics}.
In addition, compiler issues aside, \co{volatile} is still needed
to avoid fusing and invention of accesses, including C11 atomic accesses.
Please note that, it is possible to overdo use of \co{READ_ONCE()} and
\co{WRITE_ONCE()}.
For example, if you have prevented a given variable from changing
(perhaps by holding the lock guarding all updates to that
variable), there is no point in using \co{READ_ONCE()}.
Similarly, if you have prevented any other CPUs or threads from
reading a given variable (perhaps because you are initializing
that variable before any other CPU or thread has access to it),
there is no point in using \co{WRITE_ONCE()}.
However, in my experience, developers need to use things like
\co{READ_ONCE()} and \co{WRITE_ONCE()} more often than they think that
they do, and the overhead of unnecessary uses is quite low.
In contrast, the penalty for failing to use them when needed can be quite high.
\subsection{Address- and Data-Dependency Difficulties}
\label{sec:memorder:Address- and Data-Dependency Difficulties}
\OriginallyPublished{sec:memorder:Address- and Data-Dependency Difficulties}{Address- and Data-Dependency Difficulties}{the Linux kernel}{PaulEMcKenney2014rcu-dereference}
The low overheads of the \IXalth{address}{address}{dependency}
and \IXalth{data dependencies}{data}{dependency} discussed in
\cref{sec:memorder:Address Dependencies,%
sec:memorder:Data Dependencies},
respectively, makes their use extremely attractive.
Unfortunately, compilers do not understand either address or data
dependencies, although there are efforts underway to teach them, or at
the very least, standardize the process of teaching
them~\cite{PaulEMcKennneyConsumeP0190R4,PaulEMcKenney2017markconsumeP0462R1}.
In the meantime, it is necessary to be very careful in order to prevent
your compiler from breaking your dependencies.
\subsubsection{Give your dependency chain a good start}
The load that heads your dependency chain must use proper
ordering, for example \co{rcu_dereference()} or \co{READ_ONCE()}.
Failure to follow this rule can have serious side effects:
\begin{enumerate}
\item On DEC Alpha, a dependent load might not be ordered with
the load heading the dependency chain, as described in
\cref{sec:memorder:Alpha}.
\item If the load heading the dependency chain is a
C11 non-volatile \co{memory_order_relaxed} load,
the compiler could omit the load, for example, by using a value
that it loaded in the past.
\item If the load heading the dependency chain is a plain load,
the compiler can omit the load, again by using a value
that it loaded in the past.
Worse yet, it could load twice instead of once, so that
different parts of your code use different values---and
compilers really do this, especially when under register
pressure.
\item The value loaded by the head of the dependency chain must
be a pointer.
In theory, yes, you could load an integer, perhaps to use
it as an array index.
In practice, the compiler knows too much about integers,
and thus has way too many opportunities to break your
dependency chain~\cite{PaulEMcKennneyConsumeP0190R4}.
\end{enumerate}
\subsubsection{Avoid arithmetic dependency breakage}
Although it is just fine to do some arithmetic operations on a pointer in
your dependency chain, you need to be careful to avoid giving the
compiler too much information.
After all, if the compiler learns enough to determine the exact value
of the pointer, it can use that exact value instead of the pointer itself.
As soon as the compiler does that, the dependency is broken and all
ordering is lost.
\begin{listing}
\begin{fcvlabel}[ln:memorder:Breakable Dependencies With Comparisons]
\begin{VerbatimL}[commandchars=\\\[\]]
int reserve_int;
int *gp;
int *p;
p = rcu_dereference(gp);
if (p == &reserve_int) \lnlbl[cmp]
handle_reserve(p); \lnlbl[handle]
do_something_with(*p); /* buggy! */
\end{VerbatimL}
\end{fcvlabel}
\caption{Breakable Dependencies With Comparisons}
\label{lst:memorder:Breakable Dependencies With Comparisons}
\end{listing}
\begin{listing}
\begin{fcvlabel}[ln:memorder:Broken Dependencies With Comparisons]
\begin{VerbatimL}[commandchars=\\\[\]]
int reserve_int;
int *gp;
int *p;
p = rcu_dereference(gp); \lnlbl[deref1]
if (p == &reserve_int) {
handle_reserve(&reserve_int);
do_something_with(reserve_int); /* buggy! */ \lnlbl[deref2]
} else {
do_something_with(*p); /* OK! */
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Broken Dependencies With Comparisons}
\label{lst:memorder:Broken Dependencies With Comparisons}
\end{listing}
\begin{enumerate}
\item Although it is permissible to compute offsets from a
pointer, these offsets must not result in total cancellation.
For example, given a \co{char} pointer \co{cp},
\co{cp-(uintptr_t)cp} will cancel and can allow the compiler
to break your dependency chain.
On the other hand, canceling offset values with each other
is perfectly safe and legal.
For example, if \co{a} and \co{b} are equal, \co{cp+a-b}
is an identity function, including preserving the dependency.
\item Comparisons can break dependencies.
\Cref{lst:memorder:Breakable Dependencies With Comparisons}
shows how this can happen.
Here global pointer \co{gp} points to a dynamically allocated
integer, but if memory is low, it might instead point to
the \co{reserve_int} variable.
\begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons]
This \co{reserve_int} case might need special handling, as
shown on \clnref{cmp,handle} of the listing.
\end{fcvref}
\begin{fcvref}[ln:memorder:Broken Dependencies With Comparisons]
But the compiler could reasonably transform this code into
the form shown in
\cref{lst:memorder:Broken Dependencies With Comparisons},
especially on systems where instructions with absolute
addresses run faster than instructions using addresses
supplied in registers.
However, there is clearly no ordering between the pointer
load on \clnref{deref1} and the dereference on \clnref{deref2}.
Please note that this is simply an example:
There are a great many other ways to break dependency chains
with comparisons.
\end{fcvref}
\end{enumerate}
\QuickQuizSeries{%
\QuickQuizB{
\begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons]
Why can't you simply dereference the pointer before comparing it
to \co{&reserve_int} on \clnref{cmp} of
\cref{lst:memorder:Breakable Dependencies With Comparisons}?
\end{fcvref}
}\QuickQuizAnswerB{
For first, it might be necessary to invoke
\co{handle_reserve()} before \co{do_something_with()}.
But more relevant to memory ordering, the compiler is often within
its rights to hoist the comparison ahead of the dereferences,
which would allow the compiler to use \co{&reserve_int} instead
of the variable \co{p} that the hardware has tagged with
a dependency.
}\QuickQuizEndB
%
\QuickQuizE{
But it should be safe to compare two pointer variables, right?
After all, the compiler doesn't know the value
of either, so how can it possibly learn anything from the
comparison?
}\QuickQuizAnswerE{
%
\begin{listing}
\begin{fcvlabel}[ln:memorder:Breakable Dependencies With Non-Constant Comparisons]
\begin{VerbatimL}
int *gp1;
int *p;
int *q;
p = rcu_dereference(gp1);
q = get_a_pointer();
if (p == q)
handle_equality(p);
do_something_with(*p);
\end{VerbatimL}
\end{fcvlabel}
\caption{Breakable Dependencies With Non-Constant Comparisons}
\label{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}
\end{listing}%
%
\begin{listing}
\begin{fcvlabel}[ln:memorder:Broken Dependencies With Non-Constant Comparisons]
\begin{VerbatimL}[commandchars=\\\[\]]
int *gp1;
int *p;
int *q;
p = rcu_dereference(gp1); \lnlbl[p]
q = get_a_pointer();
if (p == q) {
handle_equality(q);
do_something_with(*q); \lnlbl[q]
} else {
do_something_with(*p);
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Broken Dependencies With Non-Constant Comparisons}
\label{lst:memorder:Broken Dependencies With Non-Constant Comparisons}
\end{listing}%
%
Unfortunately, the compiler really can learn enough to
break your dependency chain, for example, as shown in
\cref{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}.
The compiler is within its rights to transform this code
into that shown in
\cref{lst:memorder:Broken Dependencies With Non-Constant Comparisons},
and might well make this transformation due to register pressure
if \co{handle_equality()} was inlined and needed a lot of registers.
\begin{fcvref}[ln:memorder:Broken Dependencies With Non-Constant Comparisons]
\Clnref{q} of this transformed code uses \co{q}, which although
equal to \co{p}, is not necessarily tagged by the hardware as
carrying a dependency.
Therefore, this transformed code does not necessarily guarantee
that \clnref{q} is ordered after \clnref{p}.\footnote{
Kudos to \ppl{Linus}{Torvalds} for providing this example.}
\end{fcvref}
}\QuickQuizEndE
}
Note that a series of inequality comparisons might, when taken together,
give the compiler enough information to determine the exact value of
the pointer, at which point the dependency is broken.
Furthermore, the compiler might be able to combine information from
even a single inequality comparison with other information to learn
the exact value, again breaking the dependency.
Pointers to elements in arrays are especially susceptible to this latter
form of dependency breakage.
\subsubsection{Safe comparison of dependent pointers}
It turns out that there are several safe ways to compare dependent
pointers:
\begin{enumerate}
\item Comparisons against the \co{NULL} pointer.
In this case, all the compiler can learn is that the pointer
is \co{NULL}, in which case you are not allowed to
dereference it anyway.
\item The dependent pointer is never dereferenced, whether before or
after the comparison.
\item The dependent pointer is compared to a pointer that references
objects that were last modified a very long time ago, where
the only unconditionally safe value of ``a very long time ago'' is
``at compile time''.
The key point is that something other than the address or data
dependency guarantees ordering.
\item Comparisons between two pointers, each of which carries
an appropriate dependency.
For example, you have a pair of pointers, each carrying a
dependency, to data structures each containing a lock, and you
want to avoid \IX{deadlock} by acquiring the locks in address order.
\item The comparison is not-equal, and the compiler does not have
enough other information to deduce the value of the
pointer carrying the dependency.
\end{enumerate}
\begin{listing}
\begin{fcvlabel}[ln:memorder:Broken Dependencies With Pointer Comparisons]
\begin{VerbatimL}[commandchars=\\\[\]]
struct foo { \lnlbl[foo:b]
int a;
int b;
int c;
}; \lnlbl[foo:e]
struct foo *gp1; \lnlbl[gp1]
struct foo *gp2; \lnlbl[gp2]
void updater(void) \lnlbl[upd:b]
{
struct foo *p;
p = malloc(sizeof(*p)); \lnlbl[upd:alloc]
BUG_ON(!p); \lnlbl[upd:bug]
p->a = 42; \lnlbl[upd:init:a]
p->b = 43;
p->c = 44; \lnlbl[upd:init:c]
rcu_assign_pointer(gp1, p); \lnlbl[upd:assign1]
WRITE_ONCE(p->b, 143); \lnlbl[upd:upd:b]
WRITE_ONCE(p->c, 144); \lnlbl[upd:upd:c]
rcu_assign_pointer(gp2, p); \lnlbl[upd:assign2]
} \lnlbl[upd:e]
void reader(void) \lnlbl[read:b]
{
struct foo *p;
struct foo *q;
int r1, r2 = 0;
p = rcu_dereference(gp2); \lnlbl[read:gp2]
if (p == NULL) \lnlbl[read:nulchk]
return; \lnlbl[read:nulret]
r1 = READ_ONCE(p->b); \lnlbl[read:pb]
q = rcu_dereference(gp1); \lnlbl[read:gp1]
if (p == q) { \lnlbl[read:equ]
r2 = READ_ONCE(p->c); \lnlbl[read:pc]
}
do_something_with(r1, r2);
} \lnlbl[read:e]
\end{VerbatimL}
\end{fcvlabel}
\caption{Broken Dependencies With Pointer Comparisons}
\label{lst:memorder:Broken Dependencies With Pointer Comparisons}
\end{listing}
Pointer comparisons can be quite tricky, and so it is well worth working
through the example shown in
\cref{lst:memorder:Broken Dependencies With Pointer Comparisons}.
\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons]
This example uses a simple \co{struct foo} shown on \clnrefrange{foo:b}{foo:e}
and two global pointers, \co{gp1} and \co{gp2}, shown on \clnref{gp1,gp2},
respectively.
This example uses two threads, namely \co{updater()} on
\clnrefrange{upd:b}{upd:e} and \co{reader()} on \clnrefrange{read:b}{read:e}.
\end{fcvref}
\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:upd]
The \co{updater()} thread allocates memory on \clnref{alloc}, and complains
bitterly on \clnref{bug} if none is available.
\Clnrefrange{init:a}{init:c} initialize the newly allocated structure,
and then \clnref{assign1} assigns the pointer to \co{gp1}.
\Clnref{upd:b,upd:c} then update two of the structure's fields, and does
so \emph{after} \clnref{assign1} has made those fields visible to readers.
Please note that unsynchronized update of reader-visible fields
often constitutes a bug.
Although there are legitimate use cases doing just this, such use cases
require more care than is exercised in this example.
Finally, \clnref{assign2} assigns the pointer to \co{gp2}.
\end{fcvref}
\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
The \co{reader()} thread first fetches \co{gp2} on \clnref{gp2}, with
\clnref{nulchk,nulret} checking for \co{NULL} and returning if so.
\Clnref{pb} fetches field \co{->b} and
\clnref{gp1} fetches \co{gp1}.
If \clnref{equ} sees that the pointers fetched on \clnref{gp2,gp1}
are equal, \clnref{pc} fetches \co{p->c}.
Note that \clnref{pc} uses pointer \co{p} fetched on \clnref{gp2}, not
pointer \co{q} fetched on \clnref{gp1}.
But this difference might not matter.
An equals comparison on \clnref{equ} might lead the compiler to (incorrectly)
conclude that both pointers are equivalent, when in fact they carry
different dependencies.
This means that the compiler might well transform \clnref{pc} to instead
be \co{r2 = READ_ONCE(q->c)}, which might well cause the value 44 to be loaded
instead of the expected value 144.
\end{fcvref}
\QuickQuiz{
\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
But doesn't the condition in \clnref{equ} supply a control dependency
that would keep \clnref{pc} ordered after \clnref{gp1}?
\end{fcvref}
}\QuickQuizAnswer{
\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
Yes, but no.
Yes, there is a control dependency, but control dependencies do
not order later loads, only later stores.
If you really need ordering, you could place an \co{smp_rmb()}
between \clnref{equ,pc}.
Or better yet, have \co{updater()}
allocate two structures instead of reusing the structure.
For more information, see
\cref{sec:memorder:Control-Dependency Calamities}.
\end{fcvref}
}\QuickQuizEnd
In short, great care is required to ensure that dependency
chains in your source code are still dependency chains in the
compiler-generated assembly code.
\subsection{Control-Dependency Calamities}
\label{sec:memorder:Control-Dependency Calamities}
The \IXalth{control dependencies}{control}{dependency} described in
\cref{sec:memorder:Control Dependencies}
are attractive due to their low overhead, but are also especially
tricky because current compilers do not understand them and can easily
break them.
The rules and examples in this section are intended to help you
prevent your compiler's ignorance from breaking your code.
A load-load control dependency requires a full \IXh{read}{memory barrier},
not simply a data dependency barrier.
Consider the following bit of code:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q) {
<data dependency barrier>
q = READ_ONCE(y);
}
\end{VerbatimN}
This will not have the desired effect because there is no actual data
dependency, but rather a control dependency that the CPU may short-circuit
by attempting to predict the outcome in advance, so that other CPUs see
the load from~\co{y} as having happened before the load from~\co{x}.
In such a case what's actually required is:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q) {
<read barrier>
q = READ_ONCE(y);
}
\end{VerbatimN}
However, stores are not speculated.
This means that ordering \emph{is} provided for load-store control
dependencies, as in the following example:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q)
WRITE_ONCE(y, 1);
\end{VerbatimN}
Control dependencies pair normally with other types of ordering operations.
That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()}
are optional!
Without the \co{READ_ONCE()}, the compiler might fuse the load
from~\co{x} with other loads from~\co{x}.
Without the \co{WRITE_ONCE()}, the compiler might fuse the store
to~\co{y} with other stores to~\co{y}, or, worse yet, read the
value, compare it, and only conditionally do the store.
Any of these can result in highly counter-intuitive effects on ordering.
Worse yet, if the compiler is able to prove (say) that the value of
variable~\co{x} is always non-zero, it would be well within its rights
to optimize the original example by eliminating the \qco{if} statement
as follows:
\begin{VerbatimN}
q = READ_ONCE(x);
WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
\end{VerbatimN}
\QuickQuiz{
But there is a \co{READ_ONCE()}, so how can the compiler
prove anything about the value of \co{q}?
}\QuickQuizAnswer{
Given the simple \co{if} statement comparing against zero,
it is hard to imagine the compiler proving anything.
But suppose that later code executed a division by \co{q}.
Because division by zero is undefined behavior, as of 2023,
many compilers will assume that the value of \co{q} must
be non-zero, and will thus remove that \co{if} statement,
thus unconditionally executing the \co{WRITE_ONCE()}, in
turn destroying the control dependency.
There are some who argue (correctly, in Paul's view) that
back-propagating undefined behavior across volatile accesses
constitutes a compiler bug, but many compiler writers insist
that this is not a bug, but rather a valuable optimization.
}\QuickQuizEnd
It is tempting to try to enforce ordering on identical stores on both
branches of the \qco{if} statement as follows:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q) {
barrier();
WRITE_ONCE(y, 1);
do_something();
} else {
barrier();
WRITE_ONCE(y, 1);
do_something_else();
}
\end{VerbatimN}
Unfortunately, current compilers will transform this as follows at high
optimization levels:
\begin{VerbatimN}
q = READ_ONCE(x);
barrier();
WRITE_ONCE(y, 1); /* BUG: No ordering!!! */
if (q)
do_something();
else
do_something_else();
\end{VerbatimN}
Now there is no conditional between the load from~\co{x} and the store
to~\co{y}, which means that the CPU is within its rights to reorder them:
The conditional is absolutely required, and must be present in the
assembly code even after all compiler optimizations have been applied.
Therefore, if you need ordering in this example, you need explicit
memory-ordering operations, for example, a \IX{release store}:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q) {
smp_store_release(&y, 1);
do_something();
} else {
smp_store_release(&y, 1);
do_something_else();
}
\end{VerbatimN}
The initial \co{READ_ONCE()} is still required to prevent the compiler from
guessing the value of~\co{x}.
In addition, you need to be careful what you do with the local variable~%
\co{q},
otherwise the compiler might be able to guess its value and again remove
the needed conditional.
For example:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q % MAX) {
WRITE_ONCE(y, 1);
do_something();
} else {
WRITE_ONCE(y, 2);
do_something_else();
}
\end{VerbatimN}
If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:
\begin{VerbatimN}
q = READ_ONCE(x);
WRITE_ONCE(y, 2);
do_something_else();
\end{VerbatimN}
Given this transformation, the CPU is not required to respect the ordering
between the load from variable~\co{x} and the store to variable~\co{y}.
It is tempting to add a \co{barrier()} to constrain the compiler,
but this does not help.
The conditional is gone, and the \co{barrier()} won't bring it back.
Therefore, if you are relying on this ordering, you should make sure
that \co{MAX} is greater than one, perhaps as follows:
\begin{VerbatimN}
q = READ_ONCE(x);
BUILD_BUG_ON(MAX <= 1);
if (q % MAX) {
WRITE_ONCE(y, 1);
do_something();
} else {
WRITE_ONCE(y, 2);
do_something_else();
}
\end{VerbatimN}
Please note once again that the stores to~\co{y} differ.
If they were identical, as noted earlier, the compiler could pull this
store outside of the \qco{if} statement.
You must also avoid excessive reliance on boolean short-circuit evaluation.
Consider this example:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q || 1 > 0)
WRITE_ONCE(y, 1);
\end{VerbatimN}
Because the first condition cannot fault and the second condition is
always true, the compiler can transform this example as following,
defeating the control dependency:
\begin{VerbatimN}
q = READ_ONCE(x);
WRITE_ONCE(y, 1);
\end{VerbatimN}
This example underscores the need to ensure that the compiler cannot
out-guess your code.
Never forget that, although \co{READ_ONCE()} does force
the compiler to actually emit code for a given load, it does not force
the compiler to use the value loaded.
In addition, control dependencies apply only to the then-clause and
else-clause of the if-statement in question.
In particular, it does
not necessarily apply to code following the if-statement:
\begin{VerbatimN}
q = READ_ONCE(x);
if (q)
WRITE_ONCE(y, 1);
else
WRITE_ONCE(y, 2);
WRITE_ONCE(z, 1); /* BUG: No ordering. */
\end{VerbatimN}
It is tempting to argue that there in fact is ordering because the
compiler cannot reorder volatile accesses and also cannot reorder
the writes to~\co{y} with the condition.
Unfortunately for this line
of reasoning, the compiler might compile the two writes to~\co{y} as
conditional-move instructions, as in this fanciful pseudo-assembly
language:
\begin{VerbatimN}
ld r1,x
cmp r1,$0
cmov,ne r4,$1
cmov,eq r4,$2
st r4,y
st $1,z
\end{VerbatimN}
A weakly ordered CPU would have no dependency of any sort between the load
from~\co{x} and the store to~\co{z}.
The control dependencies would extend
only to the pair of \co{cmov} instructions and the store depending on them.
In short, control dependencies apply only to the stores in the \qco{then}
and \qco{else} of the \qco{if} in question (including functions invoked by
those two clauses), and not necessarily to code following that \qco{if}.
Finally, control dependencies do \emph{not} provide cumulativity.\footnote{
Refer to \cref{sec:memorder:Cumulativity} for
the meaning of cumulativity.}
This is demonstrated by two related litmus tests, namely
\cref{lst:memorder:LB Litmus Test With Control Dependency,%
lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
with the initial values
of~\co{x} and~\co{y} both being zero.
\begin{listing}
\input{CodeSamples/formal/litmus/C-LB+o-cgt-o+o-cgt-o@whole.fcv}
\caption{LB Litmus Test With Control Dependency}
\label{lst:memorder:LB Litmus Test With Control Dependency}
\end{listing}
The \co{exists} clause in the two-thread example of
\cref{lst:memorder:LB Litmus Test With Control Dependency}
(\path{C-LB+o-cgt-o+o-cgt-o.litmus})
will never trigger.
If control dependencies guaranteed cumulativity (which they do
not), then adding a thread to the example as in
\cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
(\path{C-WWC+o-cgt-o+o-cgt-o+o.litmus})
would guarantee the related \co{exists} clause never to trigger.
\begin{listing}
\input{CodeSamples/formal/litmus/C-WWC+o-cgt-o+o-cgt-o+o@whole.fcv}
\caption{WWC Litmus Test With Control Dependency (Cumulativity?)}
\label{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
\end{listing}
But because control dependencies do \emph{not} provide cumulativity, the
\co{exists} clause in the three-thread litmus test can trigger.
If you need the three-thread example to provide ordering, you will need
\co{smp_mb()} between the load and store in \co{P0()},
that is, just before or just after the \qco{if} statements.
Furthermore, the original two-thread example is very fragile and should be avoided.
\QuickQuiz{
Can't you instead add an \co{smp_mb()} to \co{P1()} in
\cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}?
}\QuickQuizAnswer{
Not given the Linux kernel memory model.
(Try it!)
However, you can instead replace \co{P0()}'s
\co{WRITE_ONCE()} with \co{smp_store_release()},
which usually has less overhead than does adding an \co{smp_mb()}.
}\QuickQuizEnd
The following list of rules summarizes the lessons of this section:
\begin{enumerate}
\item Compilers do not understand control dependencies, so it is
your job to make sure that the compiler cannot break your code.
\item Control dependencies can order prior loads against later stores.
However, they do \emph{not} guarantee any other sort of ordering:
Not prior loads against later loads, nor prior stores against
later anything.
If you need these other forms of ordering, use \co{smp_rmb()},
\co{smp_wmb()}, or, in the case of prior stores and later loads,
\co{smp_mb()}.
\item If both legs of the \qco{if} statement begin with identical stores
to the same variable, then the control dependency will not order
those stores.
If ordering is needed, precede both of them with \co{smp_mb()} or
use \co{smp_store_release()}.
Please note that it is \emph{not} sufficient to use \co{barrier()}
at beginning of each leg of the \qco{if} statement because, as shown
by the example above, optimizing compilers can destroy the control
dependency while respecting the letter of the \co{barrier()} law.
\item Control dependencies require at least one run-time conditional
between the prior load and the subsequent store, and this
conditional must involve the prior load.
If the compiler is able to optimize the conditional away, it
will have also optimized away the ordering.
Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help
to preserve the needed conditional.
\item Control dependencies require that the compiler avoid reordering
the dependency into nonexistence.
Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or
\co{atomic64_read()} can help to preserve your control
dependency.
\item Control dependencies apply only to the \qco{then} and
\qco{else} of the \qco{if} containing the control
dependency, including any functions that these two clauses call.
Control dependencies do \emph{not} apply to code following the
end of the \qco{if} statement containing the control dependency.
\item Control dependencies pair normally with other types of
memory-ordering operations.
\item Control dependencies do \emph{not} provide cumulativity.
If you need cumulativity, use something that provides it,
such as \co{smp_store_release()} or \co{smp_mb()}.
\end{enumerate}
Again, many popular languages were designed with single-threaded use
in mind.
Successful multithreaded use of these languages requires you to pay
special attention to your memory references and dependencies.
\section{Higher-Level Primitives}
\label{sec:memorder:Higher-Level Primitives}
%
\epigraph{Method will teach you to win time.}
{Johann Wolfgang von Goethe}
The answer to one of the quick quizzes in
\cref{sec:formal:Axiomatic Approaches and Locking}
demonstrated exponential speedups due to verifying programs
modeled at higher levels of abstraction.
This section will look into how higher levels of abstraction can
also provide a deeper understanding of the synchronization primitives
themselves.
\Cref{sec:memorder:Memory Allocation}
takes a look at memory allocation,
\cref{sec:memorder:Locking}
examines the surprisingly varied semantics of locking, and
\cref{sec:memorder:RCU}
digs more deeply into RCU\@.
\subsection{Memory Allocation}
\label{sec:memorder:Memory Allocation}
\Cref{sec:SMPdesign:Parallel Fastpath for Resource Allocation}
touched upon memory allocation, and this section expands upon the relevant
memory-ordering issues.
The key requirement is that any access executed on a given block of
memory before freeing that block must be ordered before any access
executed after that same block is reallocated.
It would after all be a cruel and unusual memory-allocator bug if a store
preceding the free were to be reordered after another store following
the reallocation!
However, it would also be cruel and unusual to require developers to use
\co{READ_ONCE()} and \co{WRITE_ONCE()} to access dynamically allocated
memory.
Full ordering must therefore be provided for plain accesses, in spite of
all the shared-variable shenanigans called out in
\cref{sec:toolsoftrade:Shared-Variable Shenanigans}.
Of course, each CPU sees its own accesses in order and the compiler
always has fully accounted for intra-CPU shenanigans, give or take
the occasional compiler bug.
These facts are what enables the lockless fastpaths in
\co{memblock_alloc()} and \co{memblock_free()}, which are shown in
\cref{lst:SMPdesign:Allocator-Cache Allocator Function,%
lst:SMPdesign:Allocator-Cache Free Function},
respectively.
However, this is also why the developer is responsible for providing
appropriate ordering (for example, by using \co{smp_store_release()})
when publishing a pointer to a newly allocated block of memory.
After all, in the CPU-local case, the allocator has not necessarily
provided any cross-CPU ordering.
This means that the allocator must provide ordering when rebalancing
its per-thread pools.
This ordering is provided by the calls to \co{spin_lock()} and
\co{spin_unlock()} from \co{memblock_alloc()} and \co{memblock_free()}.
For any block that has migrated from one thread to another, the old
thread will have executed \co{spin_unlock(&globalmem.mutex)} after
placing the block in the \co{globalmem} pool, and the new thread will
have executed \co{spin_lock(&globalmem.mutex)} before moving that
block to its per-thread pool.
This \co{spin_unlock()} and \co{spin_lock()} ensures that both the
old and new threads see the old thread's accesses as having happened
before those of the new thread.
\QuickQuiz{
But doesn't PowerPC have weak unlock-lock ordering properties
within the Linux kernel, allowing a write before the unlock to
be reordered with a read after the lock?
}\QuickQuizAnswer{
Yes, but only from the perspective of a third thread not holding
that lock.
In contrast, memory allocators need only concern themselves with
the two threads migrating the memory.
It is after all the developer's responsibility to properly
synchronize with any other threads that need access to the newly
migrated block of memory.
}\QuickQuizEnd
Therefore, the ordering required by conventional uses of memory allocation
can be provided solely by non-fastpath locking, allowing the fastpath to
remain synchronization-free.
\subsection{Locking}
\label{sec:memorder:Locking}
Locking is a well-known synchronization primitive with which the
parallel-programming community has had decades of experience.
As such, locking's semantics are quite simple.
That is, they are quite simple until you start trying to mathematically
model them.
The simple part is that any CPU or thread holding a given lock is
guaranteed to see any accesses executed by CPUs or threads while they
were previously holding that same lock.
Similarly, any CPU or thread holding a given lock is guaranteed not
to see accesses that will be executed by other CPUs or threads while
subsequently holding that same lock.
And what else is there?
As it turns out, quite a bit:
\begin{enumerate}
\item Are CPUs, threads, or compilers allowed to pull memory accesses
into a given lock-based critical section?
\item Will a CPU or thread holding a given lock also be guaranteed
to see accesses executed by CPUs and threads before they last
acquired that same lock, and vice versa?
\item Suppose that a given CPU or thread executes one access
(call it ``A''), releases a lock, reacquires that same lock,
then executes another access (call it ``B'')\@.
Is some other CPU or thread not holding that lock guaranteed to
see A and B in order?
\item As above, but with the lock reacquisition carried out by some
other CPU or thread?
\item As above, but with the lock reacquisition being some other lock?
\item What ordering guarantees are provided by \co{spin_is_locked()}?
\end{enumerate}
The reaction to some or even all of these questions might well be ``Why
would anyone do \emph{that}?''
However, any complete mathematical definition of locking must have
answers to all of these questions.
Therefore, the following sections address these questions in the context
of the Linux kernel.
\subsubsection{Accesses Into Critical Sections?}
\label{sec:memorder:Accesses Into Critical Sections?}
Can memory accesses be reordered into lock-based critical sections?
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-before-into@whole.fcv}
\caption{Prior Accesses Into Critical Section (Ordering?)}
\label{lst:memorder:Prior Accesses Into Critical Section (Ordering?)}
\end{listing}
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-after-into@whole.fcv}
\caption{Subsequent Accesses Into Critical Section (Ordering?)}
\label{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)}
\end{listing}
Within the context of the Linux-kernel memory model, the simple answer
is ``yes''.
This may be verified by running the litmus tests shown in
\cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?),%
lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)}
(\path{C-Lock-before-into.litmus} and \path{C-Lock-after-into.litmus},
respectively), both of which will yield the \co{Sometimes} result.
This result indicates that the \co{exists} clause can be satisfied, that
is, that the final value of both \co{P0()}'s and \co{P1()}'s \co{r1} variable
can be zero.
This means that neither \co{spin_lock()} nor \co{spin_unlock()}
are required to act as a \IXh{full}{memory barrier}.
However, other environments might make other choices.
For example, locking implementations that run only on the x86 CPU
family will have lock-acquisition primitives that fully order the lock
acquisition with any prior and any subsequent accesses.
Therefore, on such systems the ordering shown in
\cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?)}
comes for free.
There are x86 lock-release implementations that are weakly ordered,
thus failing to provide the ordering shown in
\cref{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)},
but an implementation could nevertheless choose to guarantee this ordering.
For their part, weakly ordered systems might well choose to execute
the memory-barrier instructions required to guarantee both orderings,
possibly simplifying code making advanced use of combinations of locked
and lockless accesses.
However, as noted earlier, \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}
chooses not to provide these additional
orderings, in part to avoid imposing performance penalties on the simpler
and more prevalent locking use cases.
Instead, the \co{smp_mb__after_spinlock()} and \co{smp_mb__after_unlock_lock()}
primitives are provided for those more complex use cases, as discussed
in \cref{sec:memorder:Hardware Specifics}.
Thus far, this section has discussed only hardware reordering.
Can the compiler also reorder memory references into lock-based
critical sections?
The answer to this question in the context of the Linux kernel is a
resounding ``No!''
One reason for this otherwise inexplicable favoring of hardware reordering
over compiler optimizations is that the hardware will avoid reordering
a page-faulting access into a lock-based critical section.
In contrast, compilers have no clue about page faults, and would
therefore happily reorder a page fault into a critical section, which
could crash the kernel.
The compiler is also unable to reliably determine which accesses
will result in cache misses, so that compiler reordering into critical
sections could also result in excessive lock contention.
Therefore, the Linux kernel prohibits the compiler (but not the CPU)
from moving accesses into lock-based critical sections.
\subsubsection{Accesses Outside of Critical Section?}
\label{sec:memorder:Accesses Outside of Critical Section?}
If a given CPU or thread holds a given lock, it is guaranteed to see
accesses executed during all prior critical sections for that same
lock.
Similarly, such a CPU or thread is guaranteed not to see accesses
that will be executed during all subsequent critical sections for
that same lock.
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-outside-across@whole.fcv}
\caption{Accesses Outside of Critical Sections}
\label{lst:memorder:Accesses Outside of Critical Sections}
\end{listing}
But what about accesses preceding prior critical sections and
following subsequent critical sections?
This question can be answered for the Linux kernel by referring to
\cref{lst:memorder:Accesses Outside of Critical Sections}
(\path{C-Lock-outside-across.litmus}).
Running this litmus test yields the \co{Never} result,
which means that accesses in code leading up to a prior critical section
is also visible to the current CPU or thread holding that same lock.
Similarly, code that is placed after a subsequent critical section
is never visible to the current CPU or thread holding that same lock.
As a result, the Linux kernel cannot allow accesses to be moved
across the entirety of a given critical section.
Other environments might well wish to allow such code motion, but please
be advised that doing so is likely to yield profoundly counter-intuitive
results.
In short, the ordering provided by \co{spin_lock()} extends not only
throughout the critical section, but also indefinitely beyond the end
of that critical section.
Similarly, the ordering provided by \co{spin_unlock()} extends not
only throughout the critical section, but also indefinitely beyond the
beginning of that critical section.
\subsubsection{Ordering for Non-Lock Holders?}
\label{sec:memorder:Ordering for Non-Lock Holders?}
Does a CPU or thread that is not holding a given lock see that lock's
critical sections as being ordered?
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-1@whole.fcv}
\caption{Accesses Between Same-CPU Critical Sections (Ordering?)}
\label{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)}
\end{listing}
This question can be answered for the Linux kernel by referring to
\cref{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)}
(\path{C-Lock-across-unlock-lock-1.litmus}), which
shows an example where \co{P(0)} places its write and read in two
different critical sections for the same lock.
Running this litmus test shows that the \co{exists} can be satisfied,
which means that the answer is ``no'', and that CPUs can reorder accesses
across consecutive critical sections.
In other words, not only are \co{spin_lock()} and \co{spin_unlock()}
weaker than a full barrier when considered separately, they are also
weaker than a full barrier when taken together.
If the ordering of a given lock's critical sections are to be observed,
then either the observer must hold that lock on the one hand or either
\co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()}
must be executed just after the second lock acquisition on the other.
But what if the two critical sections run on different CPUs or threads?
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-2@whole.fcv}
\caption{Accesses Between Different-CPU Critical Sections (Ordering?)}
\label{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)}
\end{listing}
This question is answered for the Linux kernel by referring to
\cref{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)}
(\path{C-Lock-across-unlock-lock-2.litmus}),
in which the first lock acquisition is executed by \co{P0()} and the
second lock acquisition is executed by \co{P1()}.
Note that \co{P1()} must read \co{x} to reject executions in which
\co{P1()} executes before \co{P0()} does.
Running this litmus test shows that the \co{exists} can be satisfied,
which means that the answer is ``no'', and that CPUs can reorder accesses
across consecutive critical sections, even if each of those critical
sections runs on a different CPU or thread.
\QuickQuiz{
But if there are three critical sections, isn't it true that
CPUs not holding the lock will observe the accesses from the
first and the third critical section as being ordered?
}\QuickQuizAnswer{
No.
\begin{listing}
\input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-3@whole.fcv}
\caption{Accesses Between Multiple Different-CPU Critical Sections}
\label{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections}
\end{listing}
\Cref{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections}
shows an example three-critical-section chain
(\path{Lock-across-unlock-lock-3.litmus}).
Running this litmus test shows that the \co{exists} clause can
still be satisfied, so this additional critical section is still
not sufficient to force ordering.
However, as the reader can verify, placing an
\co{smp_mb__after_spinlock()} after either \co{P1()}'s or
\co{P2()}'s lock acquisition does suffice to force ordering.
}\QuickQuizEnd
As before, if the ordering of a given lock's critical sections are to
be observed, then either the observer must hold that lock or either
\co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()} must
be executed just after \co{P1()}'s lock acquisition.
Given that ordering is not guaranteed when both critical sections are
protected by the same lock, there is no hope of any ordering guarantee
when different locks are used.
However, readers are encouraged to construct the corresponding litmus
test and see this for themselves.
This situation can seem counter-intuitive, but it is rare for code to
care.
This approach also allows certain weakly ordered systems to implement
locks more efficiently.
\subsubsection{Ordering for \tco{spin_is_locked()}?}
\label{sec:memorder:Ordering for spin-is-locked()?}
The Linux kernel's \co{spin_is_locked()} primitive returns
\co{true} if the specified lock is held and \co{false} otherwise.
Note that \co{spin_is_locked()} returns \co{true} when some other
CPU or thread holds the lock, not just when the current CPU or thread
holds that lock.
This raises the question of what ordering guarantees \co{spin_is_locked()}
might provide.
In the Linux kernel, the answer has varied over time.
Initially, \co{spin_is_locked()} was unordered, but a few interesting
use cases motivated strong ordering.
Later discussions surrounding the Linux-kernel memory model concluded
that \co{spin_is_locked()} should be used only for debugging.
Part of the reason for this is that even a fully ordered
\co{spin_is_locked()} might return \co{true} because some other CPU or
thread was just about to release the lock in question.
In this case, there is little that can be learned from that return value
of \co{true}, which means that reliable use of \co{spin_is_locked()}
is surprisingly complex.
Other approaches almost always work better, for example, use of explicit
shared variables or the \co{spin_trylock()} primitive.
This situation resulted in the current state, namely that
\co{spin_is_locked()} provides no ordering guarantees, except that if
it returns \co{false}, the current CPU or thread cannot be holding the
corresponding lock.
\QuickQuiz{
But if \co{spin_is_locked()} returns \co{false}, don't we also
know that no other CPU or thread is holding the corresponding
lock?
}\QuickQuizAnswer{
No.
By the time that the code inspects the return value from
\co{spin_is_locked()}, some other CPU or thread might well have
acquired the corresponding lock.
}\QuickQuizEnd
\subsubsection{Why Mathematically Model Locking?}
\label{sec:memorder:Why Mathematically Model Locking?}
Given all these possible choices, why model locking in general?
Why not simply model a simple implementation?
One reason is modeling performance, as shown in
\cref{tab:formal:Locking: Modeling vs. Emulation Time (s)}
on
\cpageref{tab:formal:Locking: Modeling vs. Emulation Time (s)}.
Directly modeling locking in general is orders of magnitude faster
than emulating even a trivial implementation.
This should be no surprise, given the combinatorial explosion experienced
by present-day formal-verification tools with increases in the number of
memory accesses executed by the code being modeled.
Splitting the modeling at API boundaries can therefore result in
combinatorial implosion.
Another reason is that a trivial implementation might needlessly constrain
either real implementations or real use cases.
In contrast, modeling a platonic lock allows the widest variety of
implementations while providing specific guidance to locks' users.
\subsection{RCU}
\label{sec:memorder:RCU}
As described in
\cref{sec:defer:RCU Fundamentals},
the fundamental property of RCU grace periods is this straightforward
two-part guarantee:
\begin{enumerate*}[(1)]
\item If any part of a given RCU read-side critical section precedes
the beginning of a given \IX{grace period}, then the entirety of that
critical section precedes the end of that grace period.
\item If any part of a given RCU read-side critical section follows
the end of a given grace period, then the entirety of that
critical section follows the beginning of that grace period.
\end{enumerate*}
These guarantees are summarized in
\cref{fig:memorder:RCU Grace-Period Ordering Guarantees},
where the grace period is denoted by the dashed arrow between the
\co{call_rcu()} invocation in the upper right and the corresponding
RCU callback invocation in the lower left.\footnote{
For more detail, please see
\crefrange{fig:defer:RCU Reader and Later Grace Period}{fig:defer:RCU Reader Within Grace Period}
starting on
\cpageref{fig:defer:RCU Reader and Later Grace Period}.}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{memorder/RCUGPordering}}
\caption{RCU Grace-Period Ordering Guarantees}
\label{fig:memorder:RCU Grace-Period Ordering Guarantees}
\end{figure}
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul@whole.fcv}
\caption{RCU Fundamental Property}
\label{lst:memorder:RCU Fundamental Property}
\end{listing}
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+i-rl-o-o-rul@whole.fcv}
\caption{RCU Fundamental Property and Reordering}
\label{lst:memorder:RCU Fundamental Property and Reordering}
\end{listing}
In short, an RCU read-side critical section is guaranteed never to
completely overlap an RCU grace period, as demonstrated by
\cref{lst:memorder:RCU Fundamental Property}
(\path{C-SB+o-rcusync-o+rl-o-o-rul.litmus}).
Either or neither of the \co{r2} registers can have the final value of zero,
but at least one of them must be non-zero (that is, the cycle identified
by the \co{exists} clause is prohibited), courtesy of RCU's fundamental
grace-period guarantee, as can be seen by running \co{herd} on this litmus test.
Note that this guarantee is insensitive to the ordering of the accesses
within \co{P1()}'s critical section, so the litmus test shown in
\cref{lst:memorder:RCU Fundamental Property and Reordering}\footnote{
Dependencies can of course limit the ability to reorder accesses
within RCU read-side critical sections.}
also forbids this same cycle.
However, this definition is incomplete, as can be seen from the following
list of questions:\footnote{
Several of which were introduced to Paul by \ppl{Jade}{Alglave} during
early work on LKMM, and a few more of which came from other
LKMM participants~\cite{Alglave:2018:FSC:3173162.3177156}.}
\begin{enumerate}
\item What ordering is provided by \co{rcu_read_lock()}
and \co{rcu_read_unlock()}, independent of RCU grace periods?
\item What ordering is provided by \co{synchronize_rcu()}
and \co{synchronize_rcu_expedited()}, independent of RCU read-side
critical sections?
\item If the entirety of a given RCU read-side critical section
precedes the end of a given RCU grace period, what about
accesses preceding that critical section?
\item If the entirety of a given RCU read-side critical section
follows the beginning of a given RCU grace period, what about
accesses following that critical section?
\item What happens in situations involving more than one RCU read-side
critical section and/or more than one RCU grace period?
\item What happens when RCU is combined with other memory-ordering
mechanisms?
\end{enumerate}
These questions are addressed in the following sections.
\subsubsection{RCU Read-Side Ordering}
\label{sec:memorder:RCU Read-Side Ordering}
On their own, RCU's read-side primitives \co{rcu_read_lock()} and
\co{rcu_read_unlock()} provide no ordering whatsoever.
In particular, despite their names, they do not act like locks, as can
be seen in
\cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}
(\path{C-LB+rl-o-o-rul+rl-o-o-rul.litmus}).
This litmus test's cycle is allowed:
Both instances of the \co{r1} register can have final values of 1.
\begin{listing}
\input{CodeSamples/formal/herd/C-LB+rl-o-o-rul+rl-o-o-rul@whole.fcv}
\caption{RCU Readers Provide No Lock-Like Ordering}
\label{lst:memorder:RCU Readers Provide No Lock-Like Ordering}
\end{listing}
Nor do these primitives have barrier-like ordering properties,
at least not unless there is a grace period in the mix, as can be seen in
\cref{lst:memorder:RCU Readers Provide No Barrier-Like Ordering}
(\path{C-LB+o-rl-rul-o+o-rl-rul-o.litmus}).
This litmus test's cycle is also allowed.
(Try it!)
\begin{listing}
\input{CodeSamples/formal/herd/C-LB+o-rl-rul-o+o-rl-rul-o@whole.fcv}
\caption{RCU Readers Provide No Barrier-Like Ordering}
\label{lst:memorder:RCU Readers Provide No Barrier-Like Ordering}
\end{listing}
Of course, lack of ordering in both these litmus tests should be absolutely
no surprise, given that both \co{rcu_read_lock()} and \co{rcu_read_unlock()}
are no-ops in the \IXacr{qsbr} implementation of RCU\@.
\subsubsection{RCU Update-Side Ordering}
\label{sec:memorder:RCU Update-Side Ordering}
In contrast with RCU readers, the RCU update-side functions
\co{synchronize_rcu()} and \co{synchronize_rcu_expedited()}
provide memory ordering at least as strong as \co{smp_mb()},\footnote{
And also way more expensive!}
as can be seen by running \co{herd} on the litmus test shown in
\cref{lst:memorder:RCU Updaters Provide Full Ordering}.
This test's cycle is prohibited, just as it would with \co{smp_mb()}.
This should be no surprise given the information presented in
\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o@whole.fcv}
\caption{RCU Updaters Provide Full Ordering}
\label{lst:memorder:RCU Updaters Provide Full Ordering}
\end{listing}
\subsubsection{RCU Readers:
Before and After}
\label{sec:memorder:RCU Readers: Before and After}
Before reading this section, it would be well to reflect on the distinction
between guarantees that are available and guarantees that maintainable
software should rely on.
Keeping that firmly in mind, this section presents a few of the
more exotic RCU guarantees.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-o-rul@whole.fcv}
\caption{What Happens Before RCU Readers?}
\label{lst:memorder:What Happens Before RCU Readers?}
\end{listing}
\Cref{lst:memorder:What Happens Before RCU Readers?}
(\path{C-SB+o-rcusync-o+o-rl-o-rul.litmus})
shows a litmus test similar to that in
\cref{lst:memorder:RCU Fundamental Property},
but with the RCU reader's first access preceding the RCU read-side critical
section, rather than the more conventional (and maintainable!\@) approach of
being contained within it.
Perhaps surprisingly, running \co{herd} on this litmus test gives the
same result as for that in
\cref{lst:memorder:RCU Fundamental Property}:
The cycle is forbidden.
Why would this be the case?
Because both of \co{P1()}'s accesses are volatile,
as discussed in
\cref{sec:toolsoftrade:A Volatile Solution},
the compiler is not permitted to reorder them.
This means that the code emitted for \co{P1()}'s \co{WRITE_ONCE()} will
precede that of \co{P1()}'s \co{READ_ONCE()}.
Therefore, RCU implementations that place memory-barrier instructions in
\co{rcu_read_lock()} and \co{rcu_read_unlock()} will preserve the ordering
of \co{P1()}'s two accesses all the way down to the hardware level.
On the other hand, RCU implementations that rely on interrupt-based
state machines will also fully preserve this ordering
\emph{relative to the grace period} due to the fact that interrupts take
place at a precise location in the execution of the interrupted code.
This in turn means that if the \co{WRITE_ONCE()} follows the end of a
given RCU grace period, then the accesses within \emph{and following}
that RCU read-side critical section must follow the beginning of that
same grace period.
Similarly, if the \co{READ_ONCE()} precedes the beginning of the grace
period, everything within \emph{and preceding} that critical section
must precede the end of that same grace period.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-rul-o@whole.fcv}
\caption{What Happens After RCU Readers?}
\label{lst:memorder:What Happens After RCU Readers?}
\end{listing}
\Cref{lst:memorder:What Happens After RCU Readers?}
(\path{C-SB+o-rcusync-o+rl-o-rul-o.litmus})
is similar, but instead looks at accesses after the RCU read-side
critical section.
This test's cycle is also forbidden, as can be checked with the \co{herd}
tool.
The reasoning is similar to that for
\cref{lst:memorder:What Happens Before RCU Readers?},
and is left as an exercise for the reader.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-rul-o@whole.fcv}
\caption{What Happens With Empty RCU Readers?}
\label{lst:memorder:What Happens With Empty RCU Readers?}
\end{listing}
\Cref{lst:memorder:What Happens With Empty RCU Readers?}
(\path{C-SB+o-rcusync-o+o-rl-rul-o.litmus})
takes things one step farther, moving \co{P1()}'s \co{WRITE_ONCE()}
to precede the RCU read-side critical section and moving
\co{P1()}'s \co{READ_ONCE()} to follow it, resulting in an
empty RCU read-side critical section.
Perhaps surprisingly, despite the empty critical section, RCU nevertheless
still manages to forbid the cycle.
This can again be checked using the \co{herd} tool.
Furthermore, the reasoning is once again similar to that for
\cref{lst:memorder:What Happens Before RCU Readers?}.
Recapping, if \co{P1()}'s \co{WRITE_ONCE()} follows the end of a given
grace period, then \co{P1()}'s RCU read-side critical section---and
everything following it---must follow the beginning of that same grace
period.
Similarly, if \co{P1()}'s \co{READ_ONCE()} precedes the beginning of a
given grace period, then \co{P1()}'s RCU read-side critical section---and
everything preceding it---must precede the end of that same grace period.
In both cases, the critical section's emptiness is irrelevant.
\QuickQuiz{
Wait a minute!
In QSBR implementations of RCU, no code is emitted for
\co{rcu_read_lock()} and \co{rcu_read_unlock()}.
This means that the RCU read-side critical section in
\cref{lst:memorder:What Happens With Empty RCU Readers?}
isn't just empty, it is completely nonexistent!!!
So how can something that doesn't exist at all possibly have
any effect whatsoever on ordering???
}\QuickQuizAnswer{
Because in QSBR, RCU read-side critical sections don't
actually disappear.
Instead, they are extended in both directions until a quiescent
state is encountered.
For example, in the Linux kernel, the critical section might
be extended back to the most recent \co{schedule()} call and
ahead to the next \co{schedule()} call.
Of course, in non-QSBR implementations, \co{rcu_read_lock()}
and \co{rcu_read_unlock()} really do emit code, which can clearly
provide ordering.
And within the Linux kernel, even the QSBR implementation
has a compiler \co{barrier()} in \co{rcu_read_lock()} and
\co{rcu_read_unlock()}, which is necessary to prevent
the compiler from moving memory accesses that might result
in page faults into the RCU read-side critical section.
Therefore, strange though it might seem, empty RCU read-side
critical sections really can and do provide some degree of
ordering.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-o@whole.fcv}
\caption{What Happens With No RCU Readers?}
\label{lst:memorder:What Happens With No RCU Readers?}
\end{listing}
This situation leads to the question of what happens if
\co{rcu_read_lock()} and \co{rcu_read_unlock()} are omitted
entirely, as shown in
\cref{lst:memorder:What Happens With No RCU Readers?}
(\path{C-SB+o-rcusync-o+o-o.litmus}).
As can be checked with \co{herd}, this litmus test's cycle is allowed,
that is, both instances of \co{r2} can have final values of zero.
This might seem strange in light of the fact that empty RCU
read-side critical sections can provide ordering.
And it is true that QSBR implementations of RCU would in fact forbid
this outcome, due to the fact that there is no quiescent state anywhere
in \co{P1()}'s function body, so that \co{P1()} would run
within an implicit RCU read-side critical section.
However, RCU also has non-QSBR implementations, which have no implied
RCU read-side critical section, and in turn no way for RCU to enforce
ordering.
Therefore, this litmus test's cycle is allowed.
\QuickQuiz{
Can \co{P1()}'s accesses be reordered in the litmus tests shown in
\cref{lst:memorder:What Happens Before RCU Readers?,%
lst:memorder:What Happens After RCU Readers?,%
lst:memorder:What Happens With Empty RCU Readers?}
in the same way that they were reordered going from
\cref{lst:memorder:RCU Fundamental Property}
to
\cref{lst:memorder:RCU Fundamental Property and Reordering}?
}\QuickQuizAnswer{
No, because none of these later litmus tests have more than one
access within their RCU read-side critical sections.
But what about swapping the accesses, for example, in
\cref{lst:memorder:What Happens Before RCU Readers?},
placing \co{P1()}'s \co{WRITE_ONCE()} within its critical
section and the \co{READ_ONCE()} before its critical section?
Swapping the accesses allows both instances of \co{r2} to
have a final value of zero, in other words, although RCU read-side
critical sections' ordering properties can extend outside of
those critical sections, the same is not true of their
reordering properties.
Checking this with \co{herd} and explaining why is left as an
exercise for the reader.
}\QuickQuizEnd
\subsubsection{Multiple RCU Readers and Updaters}
\label{sec:memorder:Multiple RCU Readers and Updaters}
Because \co{synchronize_rcu()} has ordering semantics that are at least
as strong as \co{smp_mb()}, no matter how many processes there are in
an SB litmus test
(such as \cref{lst:memorder:RCU Updaters Provide Full Ordering}),
placing \co{synchronize_rcu()} between each process's
accesses prohibits the cycle.
In addition, the cycle is prohibited in an SB test where one process
uses \co{synchronize_rcu()} and the other uses \co{rcu_read_lock()} and
\co{rcu_read_unlock()}, as shown by
\cref{lst:memorder:RCU Fundamental Property}.
However, if both processes use \co{rcu_read_lock()} and
\co{rcu_read_unlock()}, the cycle will be allowed, as shown by
\cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}.
Is it possible to say anything general about which RCU-protected
litmus tests will be prohibited and which will be allowed?
This section takes up that question.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv}
\caption{One RCU Grace Period and Two Readers}
\label{lst:memorder:One RCU Grace Period and Two Readers}
\end{listing}
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv}
\caption{Two RCU Grace Periods and Two Readers}
\label{lst:memorder:Two RCU Grace Periods and Two Readers}
\end{listing}
More specifically, what if the litmus test has one RCU grace
period and two RCU readers, as shown in
\cref{lst:memorder:One RCU Grace Period and Two Readers}?
The \co{herd} tool says that this cycle is allowed, but it would be
good to know \emph{why}.\footnote{
Especially given that Paul changed his mind several times about
this particular litmus test when working with \ppl{Jade}{Alglave} to
generalize RCU ordering semantics.}
\begin{figure*}
\centering
\resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2R}}
\caption{Cycle for One RCU Grace Period and Two RCU Readers}
\label{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers}
\end{figure*}
\begin{figure*}
\centering
\resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/RCU2G2R}}
\caption{No Cycle for Two RCU Grace Periods and Two RCU Readers}
\label{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}
\end{figure*}
The key point is that even strongly ordered CPUs such as x86 can
and will reorder \co{P1()}'s and \co{P2()}'s \co{WRITE_ONCE()} and
\co{READ_ONCE()}.
With that reordering,
\cref{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers}
shows how the cycle forms:
\begin{enumerate}
\item \co{P0()}'s read from \co{x1} precedes \co{P1()}'s write, as
depicted by the dashed arrow near the bottom of the diagram.
\item Because \co{P1()}'s write follows the end of \co{P0()}'s grace period,
\co{P1()}'s read from \co{x2} cannot precede the beginning of
\co{P0()}'s grace period.
\item \co{P1()}'s read from \co{x2} precedes \co{P2()}'s write.
\item Because \co{P2()}'s write to \co{x2} precedes the end of
\co{P0()}'s grace period, it is completely legal for \co{P2()}'s
read from \co{x0} to precede the beginning of \co{P0()}'s grace period.
\item Therefore, \co{P2()}'s read from \co{x0} can precede \co{P0()}'s
write, thus allowing the cycle to form.
\end{enumerate}
But what happens when another grace period is added?
This situation is shown in
\cref{lst:memorder:Two RCU Grace Periods and Two Readers},
an SB litmus test in which \co{P0()} and \co{P1()} have RCU grace periods
and \co{P2()} and \co{P3()} have RCU readers.
Again, the CPUs can reorder the accesses within RCU read-side critical
sections, as shown in
\cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}.
For this cycle to form, \co{P2()}'s critical section must
end after \co{P1()}'s grace period and \co{P3()}'s must end after the
beginning of that same grace period, which happens to also be after the
end of \co{P0()}'s grace period.
Therefore, \co{P3()}'s critical section must start after the beginning
of \co{P0()}'s grace period, which in turn means that \co{P3()}'s
read from \co{x0} cannot possibly precede \co{P0()}'s write.
Therefore, the cycle is forbidden because RCU read-side critical sections
cannot span full RCU grace periods.
However, a closer look at
\cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}
makes it clear that adding a third reader would allow the cycle.
This is because this third reader could end before the end of \co{P0()}'s
grace period, and thus start before the beginning of that same grace
period.
This in turn suggests the general rule, which is:
In these sorts of RCU-only litmus tests, if there are at least as many
RCU grace periods as there are RCU read-side critical sections,
the cycle is forbidden.\footnote{
Interestingly enough, Alan Stern proved that within the context
of LKMM, the two-part fundamental property of RCU expressed
in \cref{sec:defer:RCU Fundamentals} actually implies
this seemingly more general result, which is called the RCU
axiom~\cite{Alglave:2018:FSC:3173162.3177156}.}
\subsubsection{RCU and Other Ordering Mechanisms}
\label{sec:memorder:RCU and Other Ordering Mechanisms}
But what about litmus tests that combine RCU with other ordering
mechanisms?
The general rule is that it takes only one mechanism to forbid a cycle.
For example, refer back to
\cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}.
Applying the general rule from the previous section, because this litmus
test has two RCU read-side critical sections and no RCU grace periods,
the cycle is allowed.
But what if \co{P0()}'s \co{WRITE_ONCE()} is replaced by an
\co{smp_store_release()} and \co{P1()}'s \co{READ_ONCE()} is replaced
by an \co{smp_load_acquire()}?
RCU would still allow the cycle, but the release-acquire pair would
forbid it.
Because it only takes one mechanism to forbid a cycle, the release-acquire
pair would prevail, thus forbidding the cycle.
\begin{figure*}
\centering
\resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2Rmb}}
\caption{Cycle for One RCU Grace Period, Two RCU Readers, and Memory Barrier}
\label{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier}
\end{figure*}
For another example, refer back to
\cref{lst:memorder:One RCU Grace Period and Two Readers}.
Because this litmus test has two RCU readers but only one grace period,
its cycle is allowed.
But suppose that an \co{smp_mb()} was placed between \co{P1()}'s
pair of accesses.
In this new litmus test, because of the addition of the \co{smp_mb()},
\co{P2()}'s as well as \co{P1()}'s critical sections would extend beyond the
end of \co{P0()}'s grace period, which in turn would prevent \co{P2()}'s
read from \co{x0} from preceding \co{P0()}'s write, as depicted by the
red dashed arrow in
\cref{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier}.
In this case, RCU and the \IXh{full}{memory barrier} work together to forbid
the cycle, with RCU preserving ordering between \co{P0()} and both
\co{P1()} and \co{P2()}, and with the \co{smp_mb()} preserving
ordering between \co{P1()} and \co{P2()}.
\QuickQuiz{
What would happen if the \co{smp_mb()} was instead added between
\co{P2()}'s accesses in
\cref{lst:memorder:One RCU Grace Period and Two Readers}?
}\QuickQuizAnswer{
The cycle would again be forbidden.
Further analysis is left as an exercise for the reader.
}\QuickQuizEnd
In short, where RCU's semantics were once purely pragmatic, they are
now fully
formalized~\cite{PaulMcKenney2005RCUSemantics,MathieuDesnoyers2012URCU,AlexeyGotsman2013ESOPRCU,Alglave:2018:FSC:3173162.3177156}.
% \subsection{SRCU}
% \label{sec:memorder:SRCU}
% @@@ After LWN article
% Nesting vs. value passed from \co{srcu_read_lock()} to
% \co{srcu_read_unlock()}.
% When augmented by \co{smp_mb__after_srcu_read_unlock()}.
\subsection{Higher-Level Primitives:
Discussion}
\label{sec:memorder:Higher-Level Primitives: Discussion}
It is quite beneficial to verify code in terms of a higher-level primitive
instead of in terms of the low-level memory accesses used in a particular
implementation of that primitive.
First, this allows code using`those primitives to be
verified against an abstract representation of those primitives,
thus making that code less vulnerable to implementation changes.
Second, partitioning the verification at API boundaries results in
combinatorial implosion, greatly reducing the overhead of formal
verification.
It is hoped that verifying against detailed semantics for higher-level
primitives will greatly increase the effectiveness of static analysis
and model checking.
\section{Hardware Specifics}
\label{sec:memorder:Hardware Specifics}
\OriginallyPublished{sec:memorder:Hardware Specifics}{Memory-Barrier Instructions For Specific CPUs}{Linux Journal}{PaulMcKenney2005i,PaulMcKenney2005j}
%
\epigraph{Rock beats paper!}{Derek Williams}
Each CPU family has its own peculiar approach to memory ordering, which
can make portability a challenge, as you can see in
\cref{tab:memorder:Summary of Memory Ordering}.
\begin{table*}[tb] % @@@ Omitting 'p' prevents unordered floats in 2c builds
\rowcolors{4}{}{lightgray}
\small
\centering
\newcommand{\cpufml}[1]{\begin{picture}(6,50)(0,0)\rotatebox{90}{#1}\end{picture}}
\renewcommand*{\arraystretch}{1.2}\OneColumnHSpace{-.35in}
\ebresizewidth{
\begin{tabular}{llccccccccc}
\toprule
\multicolumn{2}{l}{~} & \multicolumn{9}{c}{CPU Family} \\
\cmidrule{3-11}
\multicolumn{2}{c}{\raisebox{.5ex}{Property}}
& \cpufml{Alpha}
& \cpufml{\ARMv7-A/R}
& \cpufml{\ARMv8}
& \cpufml{Itanium}
& \cpufml{MIPS}
& \cpufml{\Power{}}
& \cpufml{SPARC TSO}
& \cpufml{x86}
& \cpufml{z~Systems}
\\
\cmidrule(r){1-2} \cmidrule{3-11}
% Alpha ARMv8 ARMv7 Itanium MIPS PPC SPARC x86 z Systems
\cellcolor{white}
Memory Ordering
& Loads Reordered After Loads or Stores?
& Y & Y & Y & Y & Y & Y & ~ & ~ & ~ \\
& Stores Reordered After Stores?
& Y & Y & Y & Y & Y & Y & ~ & ~ & ~ \\
\cellcolor{white}
& Stores Reordered After Loads?
& Y & Y & Y & Y & Y & Y & Y & Y & Y \\
& \parbox[c][6ex]{2in}{\raggedright Atomic Instructions Reordered With\par Loads or Stores?}
& Y & Y & Y & ~ & Y & Y & ~ & ~ & ~ \\
\cellcolor{white}
& Dependent Loads Reordered?
& Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\
& Dependent Stores Reordered?
& ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\
\cellcolor{white}
& Non-Sequentially Consistent?
& Y & Y & Y & Y & Y & Y & Y & Y & Y \\
& Non-Multicopy Atomic?
& Y & Y & Y & Y & Y & Y & Y & Y & ~ \\
\cellcolor{white}
& Non-Other-Multicopy Atomic?
& Y & Y & ~ & Y & Y & Y & ~ & ~ & ~ \\
& Non-Cache Coherent?
& ~ & ~ & ~ & Y & ~ & ~ & ~ & ~ & ~ \\
\cmidrule(r){1-2} \cmidrule{3-11}
\cellcolor{white}
Instructions
& Load-Acquire/Store-Release?
& F & F & i & I & F & b & ~ & ~ & ~ \\
& Atomic RMW Instruction Type?
& L & L & L & C & L & L & C & C & C \\
\cellcolor{white}
& Incoherent Instruction Cache/Pipeline?
& Y & Y & Y & Y & Y & Y & Y & Y & Y \\
\bottomrule
\end{tabular}
}
\vspace{5pt}\hfill
\ebresizeverb{.6}{
\framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt}
\rowcolors{1}{}{}
\renewcommand*{\arraystretch}{1}
\begin{tabular}{llcl}
{ \bf Key: }
& \multicolumn{3}{l}{Load-Acquire/Store-Release?} \\
~ & ~ & b: & Lightweight memory barrier \\
~ & ~ & F: & Full memory barrier \\
~ & ~ & i: & Instruction with lightweight ordering \\
~ & ~ & I: & Instruction with heavyweight ordering \\
~ &\multicolumn{3}{l}{Atomic RMW Instruction Type?} \\
~ & ~ & C: & Compare-and-exchange instruction \\
~ & ~ & L: & Load-linked/store-conditional instruction \\
\end{tabular}
}
}\OneColumnHSpace{-0.4in}
\caption{Summary of Memory Ordering}
\label{tab:memorder:Summary of Memory Ordering}
\end{table*}
In fact, some software environments simply prohibit
direct use of memory-ordering operations, restricting the programmer
to mutual-exclusion primitives that incorporate them to the extent that
they are required.
Please note that this section is not intended to be a reference manual
covering all (or even most) aspects of each CPU family, but rather
a high-level overview providing a rough comparison.
For full details, see the reference manual for the CPU of interest.
Getting back to
\cref{tab:memorder:Summary of Memory Ordering},
the first group of rows look at memory-ordering
properties and the second group looks at instruction properties.
Please note that these properties hold at the machine-instruction
level.
Compilers can and do reorder far more aggressively than does hardware.
Use marked accesses such as \co{READ_ONCE()} and \co{WRITE_ONCE()}
to constrain the compiler's optimizations and prevent undesireable
reordering.
The first three rows indicate whether a given CPU allows the four
possible combinations of loads and stores to be reordered, as discussed
in
\cref{sec:memorder:Ordering: Why and How?} and
\crefrange{sec:memorder:Load Followed By Load}{sec:memorder:Store Followed By Store}.
The next row (``Atomic Instructions Reordered With Loads or Stores?\@'')
indicates whether a given CPU allows loads and stores
to be reordered with atomic instructions.
The fifth and sixth rows cover reordering and dependencies,
which was covered in
\crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies}
and which is explained in more detail in
\cref{sec:memorder:Alpha}.
The short version is that Alpha requires memory barriers for readers
as well as updaters of linked data structures, however, these memory
barriers are provided by the Alpha architecture-specific code in
v4.15 and later Linux kernels.
The next row, ``Non-Sequentially Consistent'', indicates whether
the CPU's normal load and store instructions are constrained by
sequential consistency.
Performance considerations have dictated that no modern mainstream
system is sequentially consistent.
The next three rows cover \IXalt{multicopy atomicity}{multi-copy atomic},
which was defined in
\cref{sec:memorder:Multicopy Atomicity}.
The first is full-up (and rare)
\IXalt{multicopy atomicity}{multi-copy atomic}, the second is the
weaker \IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic},
and the third is the weakest
\IXalthy{non-multicopy atomicity}{non-}{multi-copy atomic}.
The next row, ``Non-Cache Coherent'', covers accesses from multiple
threads to a single variable, which was discussed in
\cref{sec:memorder:Cache Coherence}.
The final three rows cover instruction-level choices and issues.
The first row indicates how each CPU implements load-acquire
and store-release, the second row classifies CPUs by atomic-instruction
type, and the third and final row
indicates whether a given CPU has an incoherent
instruction cache and pipeline.
Such CPUs require special instructions be executed for self-modifying
code.
%Parenthesized CPU names indicate modes that are architecturally allowed,
%but rarely used in practice.
The common ``just say no'' approach to memory-ordering operations
can be eminently reasonable where it applies,
but there are environments, such as the Linux kernel, where direct
use of memory-ordering operations is required.
Therefore,
Linux provides a carefully chosen least-common-denominator
set of memory-ordering primitives, which are as follows:
\begin{description}
\item [\tco{smp_mb()}] (\IXh{full}{memory barrier}) that orders both loads and
stores.
This means that loads and stores preceding the memory barrier
will be committed to memory before any loads and stores
following the memory barrier.
\item [\tco{smp_rmb()}] (\IXh{read}{memory barrier}) that orders only loads.
\item [\tco{smp_wmb()}] (\IXh{write}{memory barrier}) that orders only stores.
\item [\tco{smp_mb__before_atomic()}] that forces ordering of accesses
preceding the \co{smp_mb__before_atomic()} against accesses following
a later RMW atomic operation.
This is a noop on systems that fully order atomic RMW operations.
\item [\tco{smp_mb__after_atomic()}] that forces ordering of accesses
preceding an earlier RMW atomic operation against accesses
following the \co{smp_mb__after_atomic()}.
This is also a noop on systems that fully order atomic RMW operations.
\item [\tco{smp_mb__after_spinlock()}] that forces ordering of accesses
preceding a lock acquisition against accesses
following the \co{smp_mb__after_spinlock()}.
This is also a noop on systems that fully order lock acquisitions.
\item [\tco{mmiowb()}] that forces ordering on MMIO writes that are guarded
by global spinlocks, and is more thoroughly described
in a 2016 LWN article on MMIO~\cite{PaulEMcKenney2016LinuxKernelMMIO}.
\end{description}
The \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()}
primitives also force
the compiler to eschew any optimizations that would have the effect
of reordering memory optimizations across the barriers.
\QuickQuiz{
What happens to code between an atomic operation and an
\co{smp_mb__after_atomic()}?
}\QuickQuizAnswer{
First, please don't do this!
But if you do, this intervening code will either be ordered
after the atomic operation or before the
\co{smp_mb__after_atomic()}, depending on the architecture,
but not both.
This also applies to \co{smp_mb__before_atomic()} and
\co{smp_mb__after_spinlock()}, that is, both the uncertain
ordering of the intervening code and the plea to avoid such code.
}\QuickQuizEnd
These primitives generate code only in SMP kernels, however, several
have UP versions (\co{mb()}, \co{rmb()}, and \co{wmb()},
respectively) that generate a memory barrier even in UP kernels.
The \co{smp_} versions should be used in most cases.
However, these latter primitives are useful when writing drivers,
because MMIO accesses must remain ordered even in UP kernels.
In absence of memory-ordering operations, both CPUs and compilers would
happily rearrange these accesses, which at best would make the device
act strangely, and could crash your kernel or even damage your hardware.
So most kernel programmers need not worry about the memory-ordering
peculiarities of each and every CPU, as long as they stick to these
interfaces and to the fully ordered atomic operations.\footnote{
For a full list, expand the patterns in
\path{Documentation/atomic_t.txt}.}
If you are working deep in a given CPU's architecture-specific code,
of course, all bets are off.
Furthermore,
all of Linux's locking primitives (spinlocks, reader-writer locks,
semaphores, RCU, \ldots) include any needed ordering primitives.
So if you are working with code that uses these primitives properly,
you need not worry about Linux's memory-ordering primitives.
That said, deep knowledge of each CPU's
\IXalt{memory-consistency model}{memory model}
can be very helpful when debugging, to say nothing of when writing
architecture-specific code or synchronization primitives.
Besides, they say that a little knowledge is a very dangerous thing.
Just imagine the damage you could do with a lot of knowledge!
For those who wish to understand more about individual CPUs'
memory consistency models, the next sections describe those of a few
popular and prominent CPUs.
Although there is no substitute for actually reading a given CPU's
documentation, these sections do give a good overview.
\subsection{Alpha}
\label{sec:memorder:Alpha}
It may seem strange to say much of anything about a CPU whose end of life
has long since passed, but Alpha is interesting because it is the only
mainstream CPU that reorders dependent loads, and has thus had outsized
influence on concurrency APIs, including within the Linux kernel.
The need for core Linux-kernel code to accommodate Alpha ended
with version v4.15 of the Linux kernel, and all traces of this
accommodation were removed in v5.9 with the removal of the
\apikh{smp_read_barrier_depends()} and \co{read_barrier_depends()} APIs.
This section is nevertheless retained in the Third Edition
because here in early 2023 there are still a few Linux kernel hackers
still working on pre-v4.15 versions of the Linux kernel.
In addition, the modifications to \co{READ_ONCE()} that permitted
these APIs to be removed have not necessarily propagated to all
userspace projects that might still support Alpha.
\begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
The dependent-load difference between Alpha and the other CPUs is
illustrated by the code shown in
\cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}.
This \co{smp_store_release()}
guarantees that the element initialization
in \clnrefrange{init:b}{init:e} is executed before the element is added to the
list on \clnref{add}, so that the lock-free search will work correctly.
That is, it makes this guarantee on all CPUs {\em except} Alpha.
\end{fcvref}
\begin{listing}
\begin{fcvlabel}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
\begin{VerbatimL}[commandchars=\\\[\]]
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GFP_ATOMIC);
spin_lock(&mutex);
p->next = head.next; \lnlbl[init:b]
p->key = key;
p->data = data; \lnlbl[init:e]
smp_store_release(&head.next, p); \lnlbl[add]
spin_unlock(&mutex);
}
struct el *search(long searchkey)
{
struct el *p;
p = READ_ONCE_OLD(head.next); \lnlbl[h:next]
while (p != &head) {
/* Prior to v4.15, BUG ON ALPHA!!! */ \lnlbl[BUG]
if (p->key == searchkey) { \lnlbl[key]
return (p);
}
p = READ_ONCE_OLD(p->next); \lnlbl[next]
};
return (NULL);
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Insert and Lock-Free Search (No Ordering)}
\label{lst:memorder:Insert and Lock-Free Search (No Ordering)}
\end{listing}
\begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
Given the pre-v4.15 implementation of \co{READ_ONCE()}, indicated by
\co{READ_ONCE_OLD()} in the listing, Alpha actually allows the code on
\clnref{key} of
\cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
to see the old garbage values that were present before the initialization
on \clnrefrange{init:b}{init:e}.
\Cref{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels}
shows how this can happen on
an aggressively parallel machine with partitioned caches, so that
alternating \IXpl{cache line} are processed by the different partitions
of the caches.
For example, the load of \co{head.next} on \clnref{h:next} of
\cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
might access cache bank~0,
and the load of \co{p->key} on \clnref{key} and of \co{p->next} on \clnref{next}
might access cache bank~1.
On Alpha, the \co{smp_store_release()} will guarantee that the cache
invalidations performed by \clnrefrange{init:b}{init:e} of
\cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
(for \co{p->next}, \co{p->key}, and \co{p->data}) will reach
the interconnect before that of \clnref{add} (for \co{head.next}), but
makes absolutely no guarantee about the order of
propagation through the reading CPU's cache banks.
For example, it is possible that the reading CPU's cache bank~1 is very
busy, but cache bank~0 is idle.
This could result in the cache invalidations for the new element
(\co{p->next}, \co{p->key}, and \co{p->data}) being
delayed, so that the reading CPU loads the new value for \co{head.next},
but loads the old cached values for \co{p->key} and \co{p->next}.
Yes, this does mean that Alpha can in effect fetch
the data pointed to {\em before} it fetches the pointer itself, strange
but true.
\end{fcvref}
See the documentation~\cite{Compaq01,WilliamPugh2000Gharachorloo}
called out earlier for more information,
or if you think that I am just making all this up.\footnote{
Of course, the astute reader will have already recognized that
Alpha is nowhere near as mean and nasty as it could be,
the (thankfully) mythical architecture in
\cref{sec:app:whymb:Ordering-Hostile Architecture}
being a case in point.}
The benefit of this unusual approach to ordering is that Alpha can use
simpler cache hardware, which in turn permitted higher clock frequencies
in Alpha's heyday.
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/Alpha}}
\caption{Why \tco{smp_read_barrier_depends()} is Required in Pre-v4.15 Linux Kernels}
\label{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels}
\end{figure}
One could place an \co{smp_rmb()} primitive
between the pointer fetch and dereference in order to force Alpha
to order the pointer fetch with the later dependent load.
However, this imposes unneeded overhead on systems (such as \ARM,
Itanium, PPC, and SPARC) that respect
\IXalth{address dependencies}{address}{dependency} on the read side.
A \co{smp_read_barrier_depends()} primitive was therefore added to the
Linux kernel to eliminate overhead on these systems, but was removed
in v5.9 of the Linux kernel in favor of augmenting Alpha's definition
of \co{READ_ONCE()}.
Thus, as of v5.9, core kernel code no longer needs to concern itself
with this aspect of DEC Alpha.
\begin{fcvref}[ln:memorder:Safe Insert and Lock-Free Search]
However, it is better to use \co{rcu_dereference()}
as shown on \clnref{deref1,deref2} of
\cref{lst:memorder:Safe Insert and Lock-Free Search},
which works safely and efficiently for all recent kernel versions.
\end{fcvref}
It is also possible to implement a software mechanism
that could be used in place of \co{smp_store_release()} to force
all reading CPUs to see the writing CPU's writes in order.
This software barrier could be implemented by sending \IXacrfpl{ipi}
to all other CPUs.
Upon receipt of such an IPI, a CPU would execute a memory-barrier
instruction, implementing a system-wide memory barrier similar to that
provided by the Linux kernel's \co{sys_membarrier()} system call.
Additional logic is required to avoid deadlocks.
Of course, CPUs that respect address dependencies would define such a barrier
to simply be \co{smp_store_release()}.
However, this approach was deemed by the Linux community
to impose excessive overhead~\cite{McKenney01f}, and to their point would
be completely inappropriate for systems having
aggressive real-time response requirements.
\begin{listing}
\begin{fcvlabel}[ln:memorder:Safe Insert and Lock-Free Search]
\begin{VerbatimL}[commandchars=\\\[\]]
struct el *insert(long key, long data)
{
struct el *p;
p = kmalloc(sizeof(*p), GFP_ATOMIC);
spin_lock(&mutex);
p->next = head.next;
p->key = key;
p->data = data;
smp_store_release(&head.next, p);
spin_unlock(&mutex);
}
struct el *search(long searchkey)
{
struct el *p;
p = rcu_dereference(head.next); \lnlbl[deref1]
while (p != &head) {
if (p->key == searchkey) {
return (p);
}
p = rcu_dereference(p->next); \lnlbl[deref2]
};
return (NULL);
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Safe Insert and Lock-Free Search}
\label{lst:memorder:Safe Insert and Lock-Free Search}
\end{listing}
The Linux memory-barrier primitives took their names from the Alpha
instructions, so \co{smp_mb()} is \co{mb}, \co{smp_rmb()} is \co{rmb},
and \co{smp_wmb()} is \co{wmb}.
Alpha is the only CPU whose \co{READ_ONCE()} includes an \co{smp_mb()}.
\QuickQuizSeries{%
\QuickQuizB{
Why does Alpha's \co{READ_ONCE()} include an
\co{mb()} rather than \co{rmb()}?
}\QuickQuizAnswerB{
Alpha has only \co{mb} and \co{wmb} instructions,
so \co{smp_rmb()} would be implemented by the Alpha \co{mb}
instruction in either case.
In addition, at the time that the Linux kernel started relying on
dependency ordering, it was not clear that Alpha ordered dependent
stores, and thus \co{smp_mb()} was therefore the safe choice.
However, given the aforementioned v5.9 changes to \co{READ_ONCE()}
and a few of Alpha's atomic read-modify-write operations,
no Linux-kernel core code need concern itself with DEC Alpha,
thus greatly reducing Paul E.~McKenney's incentive to remove
Alpha support from the kernel.
}\QuickQuizEndB
%
\QuickQuizE{
Isn't DEC Alpha significant as having the weakest possible
memory ordering?
}\QuickQuizAnswerE{
Although DEC Alpha does take considerable flak, it does avoid
reordering reads from the same CPU to the same variable.
It also avoids the out-of-thin-air problem that plagues
the Java and C11 memory
models~\cite{Boehm:2014:OGA:2618128.2618134,conf/esop/BattyMNPS15,MarkBatty2013OOTA-WorkingNote,HansBoehm2020ConcurrentUB,DavidGoldblatt2019NoElegantOOTAfix,AlanJeffrey2014JavaDRF,PaulEMcKenney2020RelaxedGuideRelaxed,PaulEMcKenney2016OOTA,Sevcik:2011:SOS:1993316.1993534,Vafeiadis:2015:CCO:2775051.2676995}.
}\QuickQuizEndE
}
For more on Alpha, see its reference manual~\cite{ALPHA2002}.
\subsection{\ARMv7-A/R}
\label{sec:memorder:ARMv7-A/R}
The \ARM\ family of CPUs is popular in deep embedded applications,
particularly for power-constrained microcontrollers.
Its \IXalthmr{memory model}{Armv7}{memory model} is similar to that of \Power{}
(see \cref{sec:memorder:POWER / PowerPC}), but \ARM\ uses a
different set of memory-barrier instructions~\cite{ARMv7A:2010}:
\begin{description}
\item [\tco{DMB}] (data memory barrier) causes the specified type of
operations to \emph{appear} to have completed before any
subsequent operations of the same type.
The ``type'' of operations can be all operations or can be
restricted to only writes (similar to the Alpha \co{wmb}
and the \Power{} \co{eieio} instructions).
In addition, \ARM\ allows \IX{cache coherence} to have one of three
scopes:
Single processor, a subset of the processors
(``inner'') and global (``outer'').
\item [\tco{DSB}] (data synchronization barrier) causes the specified
type of operations to actually complete before any subsequent
operations (of any type) are executed.
The ``type'' of operations is the same as that of \co{DMB}.
The \co{DSB} instruction was called \co{DWB} (drain write buffer
or data write barrier, your choice) in early versions of the
\ARM\ architecture.
\item [\tco{ISB}] (instruction synchronization barrier) flushes the CPU
pipeline, so that all instructions following the \co{ISB}
are fetched only after the \co{ISB} completes.
For example, if you are writing a self-modifying program
(such as a JIT), you should execute an \co{ISB} between
generating the code and executing it.
\end{description}
None of these instructions exactly match the semantics of Linux's
\co{rmb()} primitive, which must therefore be implemented as a full
\co{DMB}.
The \co{DMB} and \co{DSB} instructions have a recursive definition
of accesses ordered before and after the barrier, which has an effect
similar to that of \Power{}'s cumulativity, both of which are
stronger than \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}'s
cumulativity described in
\cref{sec:memorder:Cumulativity}.
\ARM\ also implements \IXalth{control dependencies}{control}{dependency},
so that if a conditional
branch depends on a load, then any store executed after that conditional
branch will be ordered after the load.
However, loads following the conditional branch will \emph{not}
be guaranteed to be ordered unless there is an \co{ISB}
instruction between the branch and the load.
Consider the following example:
\begin{fcvlabel}[ln:memorder:ARM:load-store control dependency]
\begin{VerbatimN}[commandchars=\\\[\]]
r1 = x; \lnlbl[x]
if (r1 == 0) \lnlbl[if]
nop(); \lnlbl[nop]
y = 1; \lnlbl[y]
r2 = z; \lnlbl[z1]
ISB(); \lnlbl[isb]
r3 = z; \lnlbl[z2]
\end{VerbatimN}
\end{fcvlabel}
\begin{fcvref}[ln:memorder:ARM:load-store control dependency]
In this example, load-store control dependency ordering causes
the load from \co{x} on \clnref{x} to be ordered before the store to
\co{y} on \clnref{y}.
However, \ARM\ does not respect load-load control dependencies, so that
the load on \clnref{x} might well happen \emph{after} the
load on \clnref{z1}.
On the other hand, the combination of the conditional branch on \clnref{if}
and the \co{ISB} instruction on \clnref{isb} ensures that
the load on \clnref{z2} happens after the load on \clnref{x}.
Note that inserting an additional \co{ISB} instruction somewhere between
\clnref{if,z1} would enforce ordering between \clnref{x,z1}.
\end{fcvref}
\subsection{\ARMv8}
\label{sec:memorder:ARMv8}
\begin{figure}
\centering
\resizebox{2in}{!}{\includegraphics{cartoons/r-2014-LDLAR}}
\caption{Half Memory Barrier}
\ContributedBy{fig:memorder:Half Memory Barrier}{Melissa Brossard}
\end{figure}
\ARM's \ARMv8 CPU family~\cite{ARMv8A:2017}
includes 64-bit capabilities,
in contrast to their 32-bit-only CPU described in
\cref{sec:memorder:ARMv7-A/R}.
\IXalthmr{\ARMv8's memory model}{Armv8}{memory model}
closely resembles its \ARMv7 counterpart,
but adds load-acquire (\co{LDLARB}, \co{LDLARH}, and \co{LDLAR})
and store-release (\co{STLLRB}, \co{STLLRH}, and \co{STLLR})
instructions.
These instructions act as ``half memory barriers'', so that
\ARMv8 CPUs can reorder previous accesses with a later \co{LDLAR}
instruction, but are prohibited from reordering an earlier \co{LDLAR}
instruction with later accesses, as fancifully depicted in
\cref{fig:memorder:Half Memory Barrier}.
Similarly, \ARMv8 CPUs can reorder an earlier \co{STLLR} instruction with
a subsequent access, but are prohibited from reordering
previous accesses with a later \co{STLLR} instruction.
As one might expect, this means that these instructions directly support
the C11 notion of load-acquire and store-release.
However, \ARMv8 goes well beyond the C11 memory model by mandating that
the combination of a store-release and load-acquire act as a full
barrier under certain circumstances.
For example, in \ARMv8, given a store followed by a store-release followed
a load-acquire followed by a load, all to different variables and all from
a single CPU, all CPUs
would agree that the initial store preceded the final load.
Interestingly enough, most TSO architectures (including x86 and the
mainframe) do not make this guarantee, as the two loads could be
reordered before the two stores.
\ARMv8 is one of only two architectures that needs the
\co{smp_mb__after_spinlock()} primitive to be a full barrier,
due to its relatively weak lock-acquisition implementation in
the Linux kernel.
\ARMv8 also has the distinction of being the first CPU whose vendor publicly
defined its memory ordering with an executable formal model~\cite{ARMv8A:2017}.
\subsection{Itanium}
\label{sec:memorder:Itanium}
Itanium offers a \IXalth{weak consistency}{weak}{memory consistency}
\IXalthmr{model}{Itanium}{memory model}, so that in absence of explicit
memory-barrier instructions or dependencies, Itanium is within its rights
to arbitrarily reorder memory references~\cite{IntelItanium02v2}.
Itanium has a memory-fence instruction named \co{mf}, but also has
``half-memory fence'' modifiers to loads, stores, and to some of its atomic
instructions~\cite{IntelItanium02v3}.
The \co{acq} modifier prevents subsequent memory-reference instructions
from being reordered before the \co{acq}, but permits
prior memory-reference instructions to be reordered after the \co{acq},
similar to the \ARMv8 load-acquire instructions.
Similarly, the \co{rel} modifier prevents prior memory-reference
instructions from being reordered after the \co{rel}, but allows
subsequent memory-reference instructions to be reordered before
the \co{rel}.
These half-memory fences are useful for critical sections, since
it is safe to push operations into a critical section, but can be
fatal to allow them to bleed out.
However, as one of the few CPUs with this property, Itanium at one
time defined Linux's semantics of memory ordering associated with lock
acquisition and release.\footnote{
PowerPC is now the architecture with this dubious privilege.}
Oddly enough, actual Itanium hardware is rumored to implement
both load-acquire and store-release instructions as full barriers.
Nevertheless, Itanium was the first mainstream CPU to introduce the concept
(if not the reality) of load-acquire and store-release into its
instruction set.
\QuickQuiz{
Given that hardware can have a half memory barrier, why don't
locking primitives allow the compiler to move memory-reference
instructions into lock-based critical sections?
}\QuickQuizAnswer{
In fact, as we saw in \cref{sec:memorder:ARMv8} and will
see in \cref{sec:memorder:POWER / PowerPC}, hardware really does
implement partial memory-ordering instructions and it also turns
out that these really are used to construct locking primitives.
However, these locking primitives use full compiler barriers,
thus preventing the compiler from reordering memory-reference
instructions both out of and into the corresponding critical
section.
\begin{listing}
\begin{fcvlabel}[ln:memorder:synchronize-rcu]
\begin{VerbatimL}[commandchars=\@\[\]]
static inline int rcu_gp_ongoing(unsigned long *ctr)
{
unsigned long v;
v = LOAD_SHARED(*ctr);@lnlbl[load]
return v && (v != rcu_gp_ctr);
}
static void update_counter_and_wait(void)
{
struct rcu_reader *index;
STORE_SHARED(rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
barrier();
list_for_each_entry(index, &registry, node) {@lnlbl[loop]
while (rcu_gp_ongoing(&index->ctr))@lnlbl[call2]
msleep(10);
}
}
void synchronize_rcu(void)
{
unsigned long was_online;
was_online = rcu_reader.ctr;
smp_mb();
if (was_online)@lnlbl[if]
STORE_SHARED(rcu_reader.ctr, 0);@lnlbl[store]
mutex_lock(&rcu_gp_lock);@lnlbl[acqmutex]
update_counter_and_wait();@lnlbl[call1]
mutex_unlock(&rcu_gp_lock);
if (was_online)
STORE_SHARED(rcu_reader.ctr, LOAD_SHARED(rcu_gp_ctr));
smp_mb();
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Userspace RCU Code Reordering}
\label{lst:memorder:Userspace RCU Code Reordering}
\end{listing}
To see why the compiler is forbidden from doing reordering that
is permitted by hardware, consider the following sample code
in \cref{lst:memorder:Userspace RCU Code Reordering}.
This code is based on the userspace RCU update-side
code~\cite[Supplementary Materials Figure 5]{MathieuDesnoyers2012URCU}.
\begin{fcvref}[ln:memorder:synchronize-rcu]
Suppose that the compiler reordered \clnref{if,store} into
the critical section starting at \clnref{acqmutex}.
Now suppose that two updaters start executing \co{synchronize_rcu()}
at about the same time.
Then consider the following sequence of events:
\begin{enumerate}
\item CPU~0 acquires the lock at \clnref{acqmutex}.
\item \Clnref{if} determines that CPU~0 was online, so it clears
its own counter at \clnref{store}.
(Recall that \clnref{if,store} have been reordered by the
compiler to follow \clnref{acqmutex}).
\item CPU~0 invokes \co{update_counter_and_wait()} from
\clnref{call1}.
\item CPU~0 invokes \co{rcu_gp_ongoing()} on itself at
\clnref{call2}, and \clnref{load} sees that CPU~0 is
in a quiescent state.
Control therefore returns to \co{update_counter_and_wait()},
and \clnref{loop} advances to CPU~1.
\item CPU~1 invokes \co{synchronize_rcu()}, but because CPU~0
already holds the lock, CPU~1 blocks waiting for this
lock to become available.
Because the compiler reordered \clnref{if,store} to follow
\clnref{acqmutex}, CPU~1 does not clear its own counter,
despite having been online.
\item CPU~0 invokes \co{rcu_gp_ongoing()} on CPU~1 at
\clnref{call2}, and \clnref{load} sees that CPU~1 is
not in a quiescent state.
The \co{while} loop at \clnref{call2} therefore never
exits.
\end{enumerate}
So the compiler's reordering results in a deadlock.
In contrast, hardware reordering is temporary, so that CPU~1
might undertake its first attempt to acquire the mutex on
\clnref{acqmutex} before executing \clnref{if,store}, but it
will eventually execute \clnref{if,store}.
Because hardware reordering only results in a short delay, it
can be tolerated.
On the other hand, because compiler reordering results in a
deadlock, it must be prohibited.
Some research efforts have used hardware transactional memory
to allow compilers to safely reorder more aggressively, but
the overhead of hardware transactions has thus far made
such optimizations unattractive.
% @@@ Citation for compilers use of HTM in this manner?
\end{fcvref}
}\QuickQuizEnd
The Itanium \co{mf} instruction is used for the \co{smp_rmb()},
\co{smp_mb()}, and \co{smp_wmb()} primitives in the Linux kernel.
Despite persistent rumors to the contrary, the \qco{mf} mnemonic stands
for ``memory fence''.
Itanium also offers a global total order for release operations,
including the \co{mf} instruction.
This provides the notion of transitivity, where if a given code fragment
sees a given access as having happened, any later code fragment will
also see that earlier access as having happened.
Assuming, that is, that all the code fragments involved correctly use
memory barriers.
Finally, Itanium is the only architecture supporting the Linux kernel
that can reorder normal loads to the same variable.
The Linux kernel avoids this issue because \co{READ_ONCE()} emits
a \co{volatile} load, which is compiled as a \co{ld,acq} instruction,
which forces ordering of all \co{READ_ONCE()} invocations by a given
CPU, including those to the same variable.
\subsection{MIPS}
The \IXhmr{MIPS}{memory model}~\cite[page~479]{MIPSvII-A-2016}
appears to resemble that of \ARM, Itanium, and \Power{},
being weakly ordered by default, but respecting dependencies.
MIPS has a wide variety of memory-barrier instructions, but ties them
not to hardware considerations, but rather to the use cases provided
by the Linux kernel and the C++11 standard~\cite{RichardSmith2019N4800}
in a manner similar to the \ARMv8 additions:
\begin{description}[style=nextline]
\item[\tco{SYNC}]
Full barrier for a number of hardware operations in addition
to memory references, which is used to implement the v4.13
Linux kernel's \co{smp_mb()} for OCTEON systems.
\item[\tco{SYNC_WMB}]
Write memory barrier, which can be used on OCTEON systems
to implement the
\co{smp_wmb()} primitive in the v4.13 Linux kernel via the
\co{syncw} mnemonic.
Other systems use plain \co{sync}.
\item[\tco{SYNC_MB}]
Full memory barrier, but only for memory operations.
This may be used to implement the
C++ \co{atomic_thread_fence(memory_order_seq_cst)}.
\item[\tco{SYNC_ACQUIRE}]
Acquire memory barrier, which could be used to implement
C++'s \co{atomic_thread_fence(memory_order_acquire)}.
In theory, it could also be used to implement the v4.13 Linux-kernel
\co{smp_load_acquire()} primitive, but in practice
\co{sync} is used instead.
\item[\tco{SYNC_RELEASE}]
Release memory barrier, which may be used to implement
C++'s \co{atomic_thread_fence(memory_order_release)}.
In theory, it could also be used to implement the v4.13 Linux-kernel
\co{smp_store_release()} primitive, but in practice
\co{sync} is used instead.
\item[\tco{SYNC_RMB}]
Read memory barrier, which could in theory be used to implement the
\co{smp_rmb()} primitive in the Linux kernel, except that current
MIPS implementations supported by the v4.13 Linux kernel do not
need an explicit instruction to force ordering.
Therefore, \co{smp_rmb()} instead simply constrains the compiler.
\item[\tco{SYNCI}]
Instruction-cache synchronization, which is used in conjunction with
other instructions to allow self-modifying code, such as that produced
by just-in-time (JIT) compilers.
\end{description}
Informal discussions with MIPS architects indicates that MIPS has a
definition of transitivity or cumulativity similar to that of
\ARM\ and \Power{}\@.
However, it appears that different MIPS implementations can have
different memory-ordering properties, so it is important to consult
the documentation for the specific MIPS implementation you are using.
\subsection{\Power{} / PowerPC}
\label{sec:memorder:POWER / PowerPC}
The \Power{} and PowerPC CPU families have a wide variety of memory-barrier
instructions~\cite{PowerPC94,MichaelLyons05a}:
\begin{description}
\item [\tco{sync}] causes all preceding operations to {\em appear to have}
completed before any subsequent operations are started.
This instruction is therefore quite expensive.
\item [\tco{lwsync}] (lightweight sync) orders loads with respect to
subsequent loads and stores, and also orders stores.
However, it does {\em not} order stores with respect to subsequent
loads.
The \co{lwsync} instruction may be used to implement
load-acquire and store-release operations.
Interestingly enough, the \co{lwsync} instruction enforces
the same within-CPU ordering as does x86, z~Systems, and coincidentally,
SPARC TSO\@.
However, placing the \co{lwsync} instruction between each
pair of memory-reference instructions will \emph{not}
result in x86, z~Systems, or SPARC TSO memory ordering.
On these other systems, if a pair of CPUs independently execute
stores to different variables, all other CPUs will agree on the
order of these stores.
Not so on PowerPC, even with an \co{lwsync} instruction between each
pair of memory-reference instructions, because PowerPC is
\IXalthy{non-multicopy atomic}{non-}{multi-copy atomic}.
\item [\tco{eieio}] (enforce in-order execution of I/O, in case you
were wondering) causes all preceding cacheable stores to appear
to have completed before all subsequent stores.
However, stores to cacheable memory are ordered separately from
stores to non-cacheable memory, which means that \co{eieio}
will not force an MMIO store to precede a spinlock release.
This instruction may well be unique in having a five-vowel mnemonic.
\item [\tco{isync}] forces all preceding instructions to appear to have
completed before any subsequent instructions start execution.
This means that the preceding instructions must have progressed
far enough that any traps they might generate have either happened
or are guaranteed not to happen, and that any side-effects of
these instructions (for example, page-table changes) are seen by the
subsequent instructions.
However, it does \emph{not} force all memory references to be
ordered, only the actual execution of the instruction itself.
Thus, the loads might return old still-cached values and the
\co{isync} instruction does not force values previously stored
to be flushed from the store buffers.
\end{description}
Unfortunately, none of these instructions line up exactly with Linux's
\co{wmb()} primitive, which requires \emph{all} stores to be ordered,
but does not require the other high-overhead actions of the \co{sync}
instruction.
The \co{rmb()} primitive doesn't have a matching light-weight instruction
either.
But there is no choice:
{ppc64} versions of \co{wmb()}, \co{rmb()}, and \co{mb()} are defined
to be the heavyweight \co{sync} instruction.
However, Linux's \co{smp_wmb()} primitive is never used for MMIO
(since a driver must carefully order MMIOs in UP as well as
SMP kernels, after all), so it is defined to be the lighter weight
\co{eieio} or \co{lwsync} instruction~\cite{PaulEMcKenney2016LinuxKernelMMIO}.
The \co{smp_mb()} primitive is also defined to be the \co{sync}
instruction, while \co{smp_rmb()} is defined to be the lighter-weight
\co{lwsync} instruction.
\Power{} features ``cumulativity'', which can be used to obtain
transitivity.
When used properly, any code seeing the results of an earlier
code fragment will also see the accesses that this earlier code
fragment itself saw.
Much more detail is available from
McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}.
\Power{} respects control dependencies in much the same way that \ARM\
does, with the exception that the \Power{} \co{isync} instruction
is substituted for the \ARM\ \co{ISB} instruction.
Like \ARMv8, \Power{} requires \co{smp_mb__after_spinlock()} to be
a full memory barrier.
In addition, \Power{} is the only architecture requiring
\co{smp_mb__after_unlock_lock()} to be a full memory barrier.
In both cases, this is because of the weak ordering properties
of \Power{}'s locking primitives, due to the use of the \co{lwsync}
instruction to provide ordering for both acquisition and release.
Many members of the \Power{} architecture have incoherent instruction
caches, so that a store to memory will not necessarily be reflected
in the instruction cache.
Thankfully, few people write self-modifying code these days, but JITs
and compilers do it all the time.
Furthermore, recompiling a recently run program looks just like
self-modifying code from the CPU's viewpoint.
The \co{icbi} instruction (instruction cache block invalidate)
invalidates a specified cache line from
the instruction cache, and may be used in these situations.
\subsection{SPARC TSO}
Although SPARC's TSO (total-store order) is used by both Linux and
Solaris, the architecture also defines PSO (partial store order) and RMO
(relaxed-memory order).
Any program that runs in RMO will also run in either PSO or TSO, and similarly,
a program that runs in PSO will also run in TSO\@.
Moving a shared-memory parallel program in the other direction may
require careful insertion of memory barriers.
Although SPARC's PSO and RMO modes are not used much these days, they
did give rise to a very flexible memory-barrier instruction~\cite{SPARC94}
that permits fine-grained control of ordering:
\begin{description}
\item [\tco{StoreStore}] orders preceding stores before subsequent stores.
(This option is used by the Linux \co{smp_wmb()} primitive.)
\item [\tco{LoadStore}] orders preceding loads before subsequent stores.
\item [\tco{StoreLoad}] orders preceding stores before subsequent loads.
\item [\tco{LoadLoad}] orders preceding loads before subsequent loads.
(This option is used by the Linux \co{smp_rmb()} primitive.)
\item [\tco{Sync}] fully completes all preceding operations before starting
any subsequent operations.
\item [\tco{MemIssue}] completes preceding memory operations before subsequent
memory operations, important for some instances of memory-mapped
I/O.
\item [\tco{Lookaside}] does the same as MemIssue,
but only applies to preceding stores
and subsequent loads, and even then only for stores and loads that
access the same memory location.
\end{description}
So, why is \qco{membar #MemIssue} needed?
Because a \qco{membar #StoreLoad} could permit a subsequent
load to get its value from a store buffer, which would be
disastrous if the write was to an MMIO register that induced side effects
on the value to be read.
In contrast, \qco{membar #MemIssue} would wait until the store buffers
were flushed before permitting the loads to execute,
thereby ensuring that the load actually gets its value from the MMIO register.
Drivers could instead use \qco{membar #Sync}, but the lighter-weight
\qco{membar #MemIssue} is preferred in cases where the additional function
of the more-expensive \qco{membar #Sync} are not required.
The \qco{membar #Lookaside} is a lighter-weight version of
\qco{membar #MemIssue}, which is useful when writing to a given MMIO register
affects the value that will next be read from that register.
However, the heavier-weight \qco{membar #MemIssue} must be used when
a write to a given MMIO register affects the value that will next be
read from {\em some other} MMIO register.
SPARC requires a \co{flush} instruction be used between the time that
the instruction stream is modified and the time that any of these
instructions are executed~\cite{SPARC94}.
This is needed to flush any prior value for that location from
the SPARC's instruction cache.
Note that \co{flush} takes an address, and will flush only that address
from the instruction cache.
On SMP systems, all CPUs' caches are flushed, but there is no
convenient way to determine when the off-CPU flushes complete,
though there is a reference to an implementation note.
But again, the Linux kernel runs SPARC in TSO mode, so
all of the above \co{membar} variants are strictly of historical
interest.
In particular, the \co{smp_mb()} primitive only needs to use \co{#StoreLoad}
because the other three reorderings are prohibited by TSO\@.
\subsection{x86}
Historically, the x86 CPUs provided ``process ordering'' so that all CPUs
agreed on the order of a given CPU's writes to memory.
This allowed the \co{smp_wmb()}
primitive to be a no-op for the CPU~\cite{IntelXeonV3-96a}.
Of course, a compiler directive was also required to prevent optimizations
that would reorder across the \co{smp_wmb()} primitive.
In ancient times, certain x86 CPUs gave no ordering guarantees for loads, so
the \co{smp_mb()} and \co{smp_rmb()} primitives expanded to \co{lock;addl}.
This atomic instruction acts as a barrier to both loads and stores.
But those were ancient times.
More recently, Intel has published a
\IXalthmr{memory model}{x86}{memory model} for
x86~\cite{Intelx86MemoryOrdering2007}.
It turns out that Intel's modern CPUs enforce tighter ordering than was
claimed in the previous specifications, so this model simply mandates
this modern behavior.
Even more recently, Intel published an updated memory model for
x86~\cite[Section 8.2]{Intel64IA32v3A2011}, which mandates a total global order
for stores, although individual CPUs are still permitted to see their
own stores as having happened earlier than this total global order
would indicate.
This exception to the total ordering is needed to allow important
hardware optimizations involving store buffers.
In addition, x86 provides
\IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic}, for example,
so that if CPU~0 sees a store by CPU~1, then CPU~0 is guaranteed to see
all stores that CPU~1 saw prior to its store.
Software may use atomic operations to override these hardware optimizations,
which is one reason that atomic operations tend to be more expensive
than their non-atomic counterparts.
It is also important to note that atomic instructions operating
on a given memory location should all be of the same
size~\cite[Section 8.1.2.2]{Intel64IA32v3A2016}.
For example, if you write a program where one CPU atomically increments
a byte while another CPU executes a 4-byte atomic increment on
that same location, you are on your own.
Some SSE instructions are weakly ordered (\co{clflush}
and non-temporal move instructions~\cite{IntelXeonV2b-96a}).
Code that uses these non-temporal move instructions
can use \co{mfence} for \co{mb()},
\co{lfence} for \co{rmb()}, and \co{sfence} for \co{wmb()}.\footnote{
\co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()} don't suffice
for ordering non-temporal move instructions since Linux v4.15.}
A few older variants of the x86 CPU have a mode bit that enables out-of-order
stores, and for these CPUs, \co{smp_wmb()} must also be defined to
be \co{lock;addl}.
A 2017 kernel commit by Michael S.~Tsirkin replaced \co{mfence} with
\co{lock;addl} in \co{smp_mb()}, achieving a 60 percent performance
boost~\cite{Tsirkin2017}.
The change used a 4-byte negative offset from \co{SP} to avoid
slowness due to false data dependencies, instead of directly
accessing memory pointed to by \co{SP}.
\co{clflush} users still need to use \co{mfence} for ordering.
Therefore, they were converted to use \co{mb()}, which uses \co{mfence}
as before, instead of \co{smp_mb()}.
Although newer x86 implementations accommodate self-modifying code
without any special instructions, to be fully compatible with
past and potential future x86 implementations, a given CPU must
execute a jump instruction or a serializing instruction (e.g., \co{cpuid})
between modifying the code and executing
it~\cite[Section 8.1.3]{Intel64IA32v3A2011}.
\subsection{z Systems}
The z~Systems machines make up the IBM mainframe family, previously
known as the 360, 370, 390 and zSeries~\cite{IBMzSeries04a}.
Parallelism came late to z~Systems, but given that these mainframes first
shipped in the mid 1960s, this is not saying much.
The \qco{bcr 15,0} instruction is used for the Linux \co{smp_mb()} primitives,
but compiler constraints suffices for both the
\co{smp_rmb()} and \co{smp_wmb()} primitives.
It also has strong memory-ordering semantics, as shown in
\cref{tab:memorder:Summary of Memory Ordering}.
In particular, all CPUs will agree on the order of unrelated stores from
different CPUs, that is, the z~Systems CPU family is
\IXalt{fully multicopy atomic}{multi-copy atomic},
and is the only commercially available system with this property.
As with most CPUs, the z~Systems architecture does not guarantee a
cache-coherent instruction stream, hence,
self-modifying code must execute a serializing instruction between updating
the instructions and executing them.
That said, many actual z~Systems machines do in fact accommodate self-modifying
code without serializing instructions.
The z~Systems instruction set provides a large set of serializing instructions,
including compare-and-swap, some types of branches (for example, the
aforementioned \qco{bcr 15,0} instruction), and test-and-set.
\subsection{Hardware Specifics:
Discussion}
\label{sec:memorder:Hardware Specifics: Discussion}
There is considerable variation among these CPU families, and this section
only scratched the surface of a few families that are either heavily used
or historically significant.
Those wishing more detail are invited to consult the reference manuals.
However, perhaps the biggest benefit of the Linux-kernel memory model is
that you can ignore these details when writing architecture-independent
Linux-kernel code.
\QuickQuizAnswersChp{qqzmemorder}