memorder/memorder.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % memorder/memorder.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{qqzmemorder}
 \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{the Linux kernel}{Howells2009membartxt}
 \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{Linux Weekly News}{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2}
 \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{ASPLOS '18}{Alglave:2018:FSC:3173162.3177156}
 %
 \Epigraph{The art of progress is to preserve order amid change and to preserve change amid order.}{Alfred North Whitehead}

 Causality and sequencing are deeply intuitive, and hackers often
 have a strong grasp of these concepts.
 These intuitions can be quite helpful when writing, analyzing, and
 debugging not only sequential code, but also parallel code that makes
 use of standard mutual-exclusion mechanisms such as locking.
 Unfortunately, these intuitions break down completely in complex
 concurrent code, such as that in  the Linux-kernel, which often uses
 weakly ordered atomic operations and memory barriers.
 One example of such code implements the standard mutual-exclusion
 mechanisms themselves, while another example implements fast
 paths that use weaker synchronization.
 Insults to intuition notwithstanding, some argue that weakness is a
 virtue~\cite{JadeAlglave2013-WeaknessIsVirtue}.
 Vice or virtue, this chapter will help you gain an understanding of
 memory ordering, that, with practice, will be sufficient to implement
 synchronization primitives and performance-critical fast paths.

 First, \cref{sec:memorder:Memory-Model Intuitions}
 provides some reliable intuitions and useful rules of thumb, which can
 often provide a useful alternative to learning the full Linux-kernel
 memory model (LKMM)\@.
 However, those working on fast-paths or concurrency primitives will
 need the additional detail provided by the following sections.
 \Cref{sec:memorder:Ordering: Why and How?}
 will demonstrate that real computer systems can reorder memory references,
 give some reasons why they do so, and provide some information on how
 to prevent undesired reordering.
 \Cref{sec:memorder:Tricks and Traps,%
 sec:memorder:Compile-Time Consternation}
 will cover the types of pain that hardware and compilers, respectively,
 can inflict on unwary parallel programmers.
 \Cref{sec:memorder:Higher-Level Primitives}
 gives an overview of the benefits of modeling memory ordering at
 higher levels of abstraction.
 Finally,
 \cref{sec:memorder:Hardware Specifics}
 follows up with more detail on a few representative hardware platforms.

 \QuickQuiz{
 	This chapter has been rewritten since the first edition,
 	and heavily edited since the second edition.
 	Did memory ordering change all \emph{that} since 2014,
 	let alone since 2021?
 }\QuickQuizAnswer{
 	The earlier memory-ordering section had its roots in a pair of
 	Linux Journal articles~\cite{PaulMcKenney2005i,PaulMcKenney2005j}
 	dating back to 2005.
 	Since then, the C and C++ memory models~\cite{PeteBecker2011N3242}
 	have been formalized
 	(and critiqued~\cite{MarkBatty2013OOTA-WorkingNote,Boehm:2014:OGA:2618128.2618134,Vafeiadis:2015:CCO:2775051.2676995,conf/esop/BattyMNPS15,Lahav:2017:RSC:3140587.3062352,OlivierGiroux2017-P0668R1}),
 	executable formal memory models for computer systems have become the
 	norm~\cite{Maranget2012TutorialARMPower,PaulEMcKenney2011ppcmem,test6-pdf,JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306,JadeAlglave2013-cav,Alglave:2014:HCM:2594291.2594347,PaulEMcKenney2014weakaxiom,Flur:2017:MCA:3093333.3009839,ARMv8A:2017},
 	and there is even a memory model for the Linux
 	kernel~\cite{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2,Alglave:2018:FSC:3173162.3177156},
 	along with a paper describing differences between the C11 and
 	Linux memory models~\cite{PaulEMcKenney2016P0124R6-LKMM}.

 	The \IXacrf{kcsan}~\cite{MarcoElver2020FearDataRaceDetector1,MarcoElver2020FearDataRaceDetector2},
 	based in part on
 	RacerD~\cite{SamBlackshear2018RacerD}
 	and implementing \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model},
 	has also been added to the Linux kernel
 	and is now heavily used.

 	Finally, there are now better ways of describing the Linux-kernel
 	memory model (LKMM).

 	Given all this progress, substantial change was required.
 }\QuickQuizEnd

 \section{Memory-Model Intuitions}
 \label{sec:memorder:Memory-Model Intuitions}
 %
 \epigraph{Almost all people are intelligent.
 	  It is method that they lack.}
 	 {F. W. Nichol}

 This section is for people who would like to avoid learning all the
 details of the \IXalthmr{Linux-kernel memory model (\acr{lkmm})}
 {Linux kernel}{memory model}, the better
 to get on with their concurrency lives.
 The good news is that learning a very small fraction of LKMM can get
 you most of its benefits.
 % @@@ \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 % @@@ and \cref{sec:memorder:Basic Rules of Thumb},
 % @@@ summarizing the intervening discussion with some appeals to transitive
 % @@@ intuitions and with more sophisticated rules of thumb.

 But first, it is necessary to understand that useful ordering is a
 property not of any single component of the system, but of the system
 as a whole.
 For example, it does not help for the hardware to preserve ordering if
 the compiler does not also do so.
 Nor does it help for both the hardware and compiler to provide ways
 to preserve ordering if developers fail to use them.
 The Linux kernel adds architecture-specific assembly-language primitives
 to the mix, which must also preserve the required orderings.
 In short, all the components of the system must work together to provide
 any needed ordering.

 Next, it is necessary to understand the temporal and non-temporal
 nature of communication from one thread to another when using memory
 as the communications medium, as will be discussed in detail in
 \cref{sec:memorder:Multicopy Atomicity}.
 The key point is that although loads and stores are conceptually simple,
 on real multicore hardware significant periods of time are required for
 their effects to become visible to all other threads.
 Strange though it might seem, this means that a given store's value
 might not be returned by a load that happens later in wall-clock time,
 and it also means that a given store's value might be overwritten by a
 store that happens earlier in wall-clock time.

 The simple and intuitive case occurs when one thread loads a value that
 some other thread stored.
 This straightforward cause-and-effect case exhibits temporal behavior, so
 that the software can safely assume that the store instruction completed
 before the load instruction started.
 In real life, the load instruction might well have started quite some
 time before the store instruction did, but all modern hardware must
 carefully hide such cases from the software.
 Software will thus see the expected temporal cause-and-effect
 behavior when one thread loads a value that some other thread stores,
 as will be discussed in \cref{sec:memorder:Happens-Before}.

 This intuitive temporal behavior provides the basis for the next section's
 transitive intuitions.

 \subsection{Transitive Intuitions}
 \label{sec:memorder:Transitive Intuitions}

 This section lists intuitions regarding single threads or variables,
 locking, release-acquire chains, RCU, and fully ordered code.

 \subsubsection{Singular Intuitive Bliss}
 \label{sec:memorder:Singular Intuitive Bliss}

 A program that has only one variable or only one thread will see
 all accesses in order.
 There is quite a bit of code that can attain adequate performance
 when running single-threaded on modern computer systems, but
 this book is primarily about software that needs multiple CPUs.
 On, then, to the next section.

 \subsubsection{Locking Intuitions}
 \label{sec:memorder:Locking Intuitions}

 Another transitive intuition involves that much-maligned workhorse,
 locking, as will be described in more detail in
 \cref{sec:memorder:Locking},
 to say nothing of
 \cref{chp:Locking}.
 This section contains a graphical description followed by a verbal
 description.

 \begin{figure*}
 \centering
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/locktrans}}
 \caption{Locking Intuitions}
 \label{fig:memorder:Locking Intuitions}
 \end{figure*}

 The graphical description is shown in
 \cref{fig:memorder:Locking Intuitions},
 which shows a lock being acquired and released by CPUs~0, 1, and~2
 in that order.
 The solid black arrows depict the unlock-lock ordering.
 The dotted lines emanating from them to the wide green arrows
 show the effects on ordering.
 In particular:

 \begin{enumerate}
 \item	The fact that CPU~0's unlock precedes CPU~1's lock ensures that
 	any access executed by CPU~0 within or before its critical
 	section will be seen by accesses executed by CPU~1 within
 	and after its critical section.
 \item	The fact that CPU~0's unlock precedes CPU~2's lock ensures that
 	any access executed by CPU~0 within or before its critical
 	section will be seen by accesses executed by CPU~2 within
 	and after its critical section.
 \item	The fact that CPU~1's unlock precedes CPU~2's lock ensures that
 	any access executed by CPU~1 within or before its critical
 	section will be seen by accesses executed by CPU~2 within and
 	after its critical section.
 \end{enumerate}

 In short, lock-based ordering is transitive through CPUs~0, 1, and~2.
 A key point is that this ordering extends beyond the critical sections,
 so that everything before an earlier lock release is seen by everything
 after a later lock acquisition.

 For those who prefer words to diagrams, code holding a given lock will
 see the accesses in all prior critical sections for that same lock,
 transitively.
 And if such code sees the accesses in a given critical section, it will
 also see the accesses in all of that CPU's code preceding
 that critical section.
 In other words, when a CPU releases a given lock, all
 of that lock's subsequent critical sections will see the accesses in
 all of that CPU's code preceding that lock release.

 Inversely, code holding a given lock will be protected from seeing the
 accesses in any subsequent critical sections for that same lock, again,
 transitively.
 And if such code is protected against seeing the accesses in a given
 critical section, it will also be protected against seeing the accesses
 in all of that CPU's code following that critical section.
 In other words, when a CPU acquires a given lock, all of
 that lock's previous critical sections will be protected from seeing
 the accesses in all of that CPU's code following that lock
 acquisition.

 But what does it mean to ``see accesses'' and exactly what accesses
 are seen?

 To start, an access is either a load or a store, possibly occurring as part
 of a read-modify-write operation.

 If a CPU's code prior to its release of a given lock contains
 an access A to a given variable, then for an access B to that same variable
 contained in any CPU's code following a later acquisition
 of that same lock:

 \begin{enumerate}
 \item	If A and B are both loads, then B will return either the same
 	value that A did or some later value.
 \item	If A is a load and B is a store, then B will overwrite either the
 	value loaded by A or some later value.
 \item	If A is a store and B is a load, then B will return either the
 	value stored by A or some later value.
 \item	If A and B are both stores, then B will overwrite either the value
 	stored by A or some later value.
 \end{enumerate}

 Here, ``some later value'' is shorthand for ``the value stored by some
 intervening access''.

 \QuickQuiz{
 	But what about reader-writer locking?
 }\QuickQuizAnswer{
 	Reader-writer locking works the same as does exclusive locking,
 	with the proviso that ordering is guaranteed only between a
 	pair of conflicting critical sections.
 	This means that a read-side critical section will be ordered
 	before a later write-side critical section and vice versa.
 	But this also means that there is no guarantee of ordering
 	between a pair of read-side critical sections unless there is
 	an intervening write-side critical section.

 	As of mid-2023, LKMM does not yet model reader-writer locking,
 	but this is on the to-do list.
 }\QuickQuizEnd

 Locking is strongly intuitive, which is one reason why it has survived
 so many attempts to eliminate it.
 This is also one reason why you should use it where it applies.

 \subsubsection{Release-Acquire Intuitions}
 \label{sec:memorder:Release-Acquire Intuitions}

 Release-acquire chains also behave in a transitively intuitive manner
 not unlike that of locking, except that release-acquire chains do not
 provide mutual exclusion.
 This section also contains a graphical description followed by a verbal
 description.

 \begin{figure*}
 \centering
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/relacqtrans}}
 \caption{Release-Acquire Intuitions}
 \label{fig:memorder:Release-Acquire Intuitions}
 \end{figure*}

 The graphical description is shown in
 \cref{fig:memorder:Release-Acquire Intuitions},
 which shows a release-acquire chain extending through CPUs~0, 1, and~2.
 The solid black arrows depict the release-acquire ordering.
 The dotted lines emanating from them to the wide green arrows show the
 effects on ordering.

 \begin{enumerate}
 \item	The fact that CPU~0's release of A is read by CPU~1's acquire of A
 	ensures that any accesses executed by CPU~0 prior to its release
 	will be seen by any accesses executed by CPU~1 after its acquire.
 \item	The fact that CPU~1's release of B is read by CPU~2's acquire of B
 	ensures that any accesses executed by CPU~1 prior to its release
 	will be seen by any accesses executed by CPU~2 after its acquire.
 \item	Note also that CPU~0's release of A is read by CPU~1's acquire of
 	A, which precedes CPU 1's release of B, which is read by CPU~2's
 	acquire of B\@.
 	Taken together, all this ensures that any accesses executed by
 	CPU~0 prior to its release will be seen by any accesses executed
 	by CPU~2 after its acquire.
 \end{enumerate}

 This illustrates that properly constructed release-acquire ordering is
 transitive through CPUs~0, 1, and~2, and in fact may be extended through
 as many CPUs as needed.\footnote{
 	But please note that stray stores to either A or B can break
 	the release-acquire chain, as illustrated by
 	\cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}.}

 For those who prefer words to diagrams, when an acquire loads the value
 stored by a release, discussed in
 \cref{sec:memorder:Release-Acquire Chains},
 then the code following that release will see all accesses preceding
 the acquire.
 More precisely, if CPU~0 does an acquire that loads the value stored by
 CPU~1's release, than all the subsequent accesses executed by CPU~0 will
 see the all of CPU~1's accesses prior to its release.

 Similarly, the accesses preceding that release access will be protected
 from seeing the accesses following the acquire access.
 (More precision is left as an exercise to the reader.)

 Releases and acquires can be chained, for example CPU~0's release
 stores the value loaded by CPU~1's acquire, a later release by CPU~1
 stores the value loaded by CPU~2's acquire, and so on.
 The accesses following a given acquire will see the accesses preceding
 each prior release in the chain, and, inversely, the accesses preceding a
 given release will be protected from seeing the accesses following each
 later acquire in the chain.
 Some long-chain examples will be illustrated by
 \cref{lst:memorder:Long LB Release-Acquire Chain,%
 lst:memorder:Long ISA2 Release-Acquire Chain,%
 lst:memorder:Long Z6.2 Release-Acquire Chain}
 on
 \cpagerefrange{lst:memorder:Long LB Release-Acquire Chain}{lst:memorder:Long Z6.2 Release-Acquire Chain},
 respectively.

 The seeing and not seeing of accesses works the same way as described in
 \cref{sec:memorder:Locking Intuitions}.

 However, as will be illustrated by
 \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
 on
 \cpageref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)},
 the acquire access must load exactly what was stored by the release access.
 Any intervening store that is not itself part of that same release-acquire
 chain will break the chain.
 And again, release-acquire chains do not provide mutual exclusion.

 Nevertheless, properly constructed release-acquire chains are transitive,
 intuitive, and useful.

 \subsubsection{RCU Intuitions}
 \label{sec:memorder:RCU Intuitions}

 As noted in \cref{sec:defer:RCU Fundamentals} on
 \cpageref{sec:defer:RCU Fundamentals}, RCU provides a number
 of ordering guarantees.

 The first is the publish-subscribe mechanism described in
 \cref{sec:defer:Publish-Subscribe Mechanism}
 on
 \cpageref{sec:defer:Publish-Subscribe Mechanism}.
 This resembles the acquire-release chains discussed in the previous
 section, but substitutes a member of the \co{rcu_dereference()} family
 of primitives for the \co{smp_load_acquire()}.
 Unlike \co{smp_load_acquire()}, the ordering implied by
 \co{rcu_dereference()} applies only to subsequent accesses that
 dereference the pointer returned by that \co{rcu_dereference()}
 as shown in
 \cref{fig:defer:Publication/Subscription Constraints}
 on
 \cpageref{fig:defer:Publication/Subscription Constraints}.

 The second guarantee says that if any part of an RCU read-side
 critical section precedes the beginning of a grace period, then
 the entirety of that critical section precedes the end of that
 grace period, as shown in
 \cref{fig:defer:RCU Reader and Later Grace Period}
 on
 \cpageref{fig:defer:RCU Reader and Later Grace Period}.

 The third guarantee says that if any part of an RCU read-side critical
 section follows the end of a grace period, then the entirety of that
 critical section follows the beginning of that grace period, as shown in
 \cref{fig:defer:RCU Reader and Earlier Grace Period}
 on
 \cpageref{fig:defer:RCU Reader and Earlier Grace Period}.

 Both of these two guarantees are discussed in
 \cref{sec:defer:Wait For Pre-Existing RCU Readers}
 on
 \cpageref{sec:defer:Wait For Pre-Existing RCU Readers},
 with more examples shown in
 \cref{fig:defer:RCU Reader Within Grace Period,%
 fig:defer:Summary of RCU Grace-Period Ordering Guarantees}
 on
 \cpageref{fig:defer:RCU Reader Within Grace Period,fig:defer:Summary of RCU Grace-Period Ordering Guarantees}.
 These two guarantees have further version-maintenance consequences that
 are discussed in
 \cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}
 on
 \cpageref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}.

 These guarantees will be discussed somewhat more formally in
 \cref{sec:memorder:RCU}
 on
 \cpageref{sec:memorder:RCU}.

 Much of the sophistication of RCU lies not in its guarantees, but in its
 use cases, which are the subject of
 \cref{sec:defer:RCU Usage}
 starting on
 \cpageref{sec:defer:RCU Usage}.

 \subsubsection{Fully Ordered Intuitions}
 \label{sec:memorder:Fully Ordered Intuitions}

 A more extreme example of transitivity places at least one \co{smp_mb()}
 between each pair of accesses.
 All accesses seen by any given access will also be seen by all later
 accesses.

 The resulting program will be fully ordered, if somewhat slow.
 Such programs will be sequentially consistent and much loved by
 formal-verification experts who specialize in tried-and-true 1980s
 proof techniques.
 But slow or not, \co{smp_mb()} is always there when you need it!

 Nevertheless, there are situations that cannot be addressed by these
 intuitive approaches.
 The next section therefore presents a more complete, if less transitive,
 set of rules of thumb.

 \subsection{Rules of Thumb}
 \label{sec:memorder:Rules of Thumb}

 The transitive intuitions presented in the previous section are
 very appealing, at least as \IX{memory model}s go.
 Unfortunately, hardware is under no obligation to provide temporal
 cause-and-effect illusions when one thread's store overwrites a value either
 loaded or stored by some other thread.
 It is quite possible that, from the software's viewpoint, an earlier store
 will overwrite a later store's value, but only if those two stores were
 executed by different threads, as will be illustrated by
 \cref{fig:memorder:Store-to-Store is Counter-Temporal}
 on
 \cpageref{fig:memorder:Store-to-Store is Counter-Temporal}.
 Similarly, a later load might well read a value overwritten by an
 earlier store, but again only if that load and store were executed by
 different threads, as will be illustrated by
 \cref{fig:memorder:Load-to-Store is Counter-Temporal}
 on
 \cpageref{fig:memorder:Load-to-Store is Counter-Temporal}.
 This counter-intuitive behavior occurs due to the need to buffer
 stores in order to achieve adequate performance, as will be discussed in
 \cref{sec:memorder:Propagation}
 on
 \cpageref{sec:memorder:Propagation}.

 \begin{figure}
 \centering
 \resizebox{.5\onecolumntextwidth}{!}{\includegraphics{memorder/ifthen}}
 \caption{If-Then Nature of Memory Ordering}
 \label{fig:memorder:If-Then Nature of Memory Ordering}
 \end{figure}

 As a result, situations where one thread reads a value written by some
 other thread can make do with far weaker ordering than can situations
 where one thread overwrites a value loaded or stored by some other thread.
 These differences are captured by the following rules of thumb.
 However, it is important to note that none of these rules involve
 mutual exclusion.
 There is instead an if-then nature to these rules, as illustrated by
 \cref{fig:memorder:If-Then Nature of Memory Ordering}.
 In the figure, if CPU~1's reference to Y follows that of CPU~1,
 and if both CPUs provide sufficient ordering between their two
 accesses, then CPU~0's access to X will also precede that of
 CPU~1.

 The first rule of thumb is that memory-ordering operations are only
 required where there is a possibility of interaction between at least
 two variables shared among at least two threads, which underlies the
 singular intuitive bliss presented in
 \cref{sec:memorder:Singular Intuitive Bliss}.
 In light of the intervening material, this single sentence encapsulates
 many of the basic rules of thumb that will be presented in
 \cref{sec:memorder:Basic Rules of Thumb}
 on
 \cpageref{sec:memorder:Basic Rules of Thumb},
 for example, keeping in mind that ``memory-barrier pairing'' is a
 two-thread special case of ``cycle''.
 And, as always, if a single-threaded program will provide sufficient
 performance, why bother with parallelism?\footnote{
 	Hobbyists and researchers should of course feel free to ignore
 	this and many other cautions.}
 After all, avoiding parallelism also avoids the added cost and complexity
 of memory-ordering operations.

 The second rule of thumb involves load-buffering situations:
 If all thread-to-thread communication in a given cycle use store-to-load
 links (that is, the next thread's load returns the value stored by
 the previous thread), minimal ordering suffices.
 Minimal ordering includes dependencies and acquires as well as all stronger
 ordering operations.
 Because a lock acquisition must load the lock-word value stored by any
 prior release of that lock, this rule of thumb underlies the locking
 intuitions presented in
 \cref{sec:memorder:Locking Intuitions}.

 The third rule of thumb involves release-acquire chains:
 If all but one of the links in a given cycle is a store-to-load
 link, it is sufficient to use release-acquire pairs for each of
 those store-to-load links, as will be illustrated by
 \cref{lst:memorder:Long ISA2 Release-Acquire Chain,%
 lst:memorder:Long Z6.2 Release-Acquire Chain}
 starting on
 \cpageref{lst:memorder:Long ISA2 Release-Acquire Chain}.
 This rule underlies the release-acquire intuitions presented in
 \cref{sec:memorder:Release-Acquire Intuitions}.

 You can replace a given acquire with a dependency in environments permitting
 this, keeping in mind that the
 \IXalthmr{C11 standard's memory model}{C11}{memory model} does \emph{not}
 fully respect dependencies.
 Therefore, a dependency leading to a load must be headed by
 a \co{READ_ONCE()} or an \co{rcu_dereference()}:
 A plain C-language load is not sufficient.
 In addition, carefully review
 \cref{sec:memorder:Address- and Data-Dependency Difficulties,%
 sec:memorder:Control-Dependency Calamities},
 on
 \cpageref{sec:memorder:Address- and Data-Dependency Difficulties,%
 sec:memorder:Control-Dependency Calamities},
 because a dependency broken by your compiler will not order anything.
 The two threads sharing the sole non-store-to-load link can
 sometimes substitute \co{WRITE_ONCE()} plus \co{smp_wmb()} for
 \co{smp_store_release()} on the one hand,
 and \co{READ_ONCE()} plus \co{smp_rmb()} for \co{smp_load_acquire()}
 on the other.
 However, the wise developer will check such substitutions carefully,
 for example, using the \co{herd} tool as described in
 \cref{sec:formal:Axiomatic Approaches}
 on
 \cpageref{sec:formal:Axiomatic Approaches}.

 \QuickQuiz{
 	Why is it necessary to use heavier-weight ordering for
 	load-to-store and store-to-store links, but not for
 	store-to-load links?
 	What on earth makes store-to-load links so special???
 }\QuickQuizAnswer{
 	Recall that load-to-store and store-to-store links can be
 	counter-temporal, as illustrated by
 	\cref{fig:memorder:Load-to-Store is Counter-Temporal,%
 	fig:memorder:Store-to-Store is Counter-Temporal} in
 	\cref{sec:memorder:Propagation}.
 	This counter-temporal nature of load-to-store and store-to-store
 	links necessitates strong ordering.

 	In constrast, store-to-load links are temporal, as illustrated by
 	\cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test,%
 	lst:memorder:Load-Buffering Control-Dependency Litmus Test}.
 	This temporal nature of store-to-load links permits use of
 	minimal ordering.
 }\QuickQuizEnd

 The fourth and final rule of thumb identifies where full memory barriers
 (or stronger) are required:
 If a given cycle contains two or more non-store-to-load links (that is, a
 total of two or more links that are either load-to-store or store-to-store
 links), you will need at least one full barrier between each pair of
 non-store-to-load links in that cycle, as will be illustrated by
 \cref{lst:memorder:W+WRC Litmus Test With More Barriers}
 on
 \cpageref{lst:memorder:W+WRC Litmus Test With More Barriers}
 and most especially by the example presented in
 \cref{sec:memorder:A Counter-Intuitive Case Study}
 on
 \cpageref{sec:memorder:A Counter-Intuitive Case Study}.
 % @@@ \cref{lst:memorder:W+WRC Litmus Test With More Barriers}
 % @@@ as well as in the answer to
 % @@@ \QuickQuizARef{\MemorderQQLitmusTestR}.
 Full barriers include \co{smp_mb()}, successful full-strength non-\co{void}
 atomic RMW operations, and other atomic RMW operations in conjunction with
 either \co{smp_mb__before_atomic()} or \co{smp_mb__after_atomic()}.
 Any of RCU's grace-period-wait primitives (\co{synchronize_rcu()} and
 friends) also act as full barriers, but at far greater expense than
 \co{smp_mb()}.
 With strength comes expense, though full barriers
 usually hurt performance more than they hurt scalability.
 The extreme logical endpoint of this rule of thumb underlies the
 fully ordered intuitions presented in
 \cref{sec:memorder:Fully Ordered Intuitions}.

 Recapping the rules:

 \begin{enumerate}
 \item	Memory-ordering operations are required only if at least
 	two variables are shared by at least two threads.
 \item	If all links in a cycle are store-to-load links, then
 	minimal ordering suffices.
 \item	If all but one of the links in a cycle are store-to-load links,
 	then each store-to-load link may use a release-acquire pair.
 \item	Otherwise, at least one full barrier is required between
 	each pair of non-store-to-load links.
 \end{enumerate}

 Note that an architecture is permitted to provide stronger guarantees, as
 will be discussed in
 \cref{sec:memorder:Hardware Specifics},
 but these guarantees may only be relied upon in code that runs only for
 that architecture.
 In addition, more accurate
 \IX{memory model}s~\cite{Alglave:2018:FSC:3173162.3177156}
 may give stronger guarantees with lower-overhead operations than do
 these rules of thumb, albeit at the expense of greater complexity.
 In these more formal memory-ordering papers, a store-to-load link is an
 example of a reads-from (rf) link, a load-to-store link is an example
 of a from-reads (fr) link, and a store-to-store link is an example of
 a coherence (co) link.

 One final word of advice:
 Use of raw memory-ordering primitives is a last resort.
 It is almost always better to use existing primitives, such as locking
 or RCU, thus letting those primitives do the memory ordering for you.

 With that said, the next section describes why hardware misorders memory
 references and some of the things that you can do about it.

 \section{Ordering:
 		   Why and How?}
 \label{sec:memorder:Ordering: Why and How?}
 %
 \epigraph{Nothing is orderly till people take hold of it.
 	  Everything in creation lies around loose.}
 	 {Henry Ward Beecher, updated}

 One motivation for memory ordering can be seen in the trivial-seeming
 litmus test in
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
 (\path{C-SB+o-o+o-o.litmus}),
 which at first glance might
 appear to guarantee that the \co{exists} clause never triggers.\footnote{
 	Purists would instead insist that the \co{exists} clause is
 	never \emph{satisfied}, but we use ``trigger'' here by
 	analogy with assertions.}
 After all, if \nbco{0:r2=0} as shown in the \co{exists} clause,\footnote{
 	That is, Thread~\co{P0()}'s instance of local variable \co{r2}
 	equals zero.
 	See \cref{sec:formal:Anatomy of a Litmus Test}
 	for documentation of litmus-test nomenclature.}
 we might hope that Thread~\co{P0()}'s load from~\co{x1} into \co{r2}
 must have happened before Thread~\co{P1()}'s store to~\co{x1}, which
 might raise further hopes that Thread~\co{P1()}'s load from~\co{x0}
 into \co{r2} must happen after Thread~\co{P0()}'s store to~\co{x0},
 so that \nbco{1:r2=2}, thus never triggering the \co{exists} clause.
 The example is symmetric, so similar reasoning might lead
 us to hope that \nbco{1:r2=0} guarantees that \nbco{0:r2=2}.
 Unfortunately, the lack of memory barriers dashes these hopes.
 The CPU is within its rights to reorder
 the statements within both Thread~\co{P0()} and Thread~\co{P1()},
 even on relatively strongly ordered systems such as x86.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-SB+o-o+o-o@whole.fcv}
 \caption{Memory Misordering:
 			     Store-Buffering Litmus Test}
 \label{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}
 \end{listing}

 \QuickQuiz{
 	The compiler can also reorder Thread~\co{P0()}'s and
 	Thread~\co{P1()}'s memory accesses in
 	\cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test},
 	right?
 }\QuickQuizAnswer{
 	In general, compiler optimizations carry out more extensive
 	and profound reorderings than CPUs can.
 	However, in this case, the volatile accesses in
 	\co{READ_ONCE()} and \co{WRITE_ONCE()}
 	prevent the compiler from reordering.
 	And also from doing much else as well, so the examples in this
 	chapter will be making heavy use of
 	\co{READ_ONCE()} and \co{WRITE_ONCE()}.
 	See \cref{sec:memorder:Compile-Time Consternation}
 	for more detail on the need for \co{READ_ONCE()} and \co{WRITE_ONCE()}.
 }\QuickQuizEnd

 This willingness to reorder can be confirmed using tools such as
 \co{litmus7}~\cite{Alglave:2014:HCM:2594291.2594347},
 which found that the counter-intuitive ordering happened
 314 times out of 100,000,000 trials on an x86 laptop.
 Oddly enough, the perfectly legal outcome where both loads return the
 value 2 occurred less frequently, in this case, only 167 times.\footnote{
 	Please note that results are sensitive to the exact hardware
 	configuration,
 	how heavily the system is loaded, and much else besides.
 	So why not try it out on your own system?}
 The lesson here is clear:
 Increased counter-intuitivity does not necessarily imply decreased probability!
 % Run on June 23, 2017:
 % litmus7 -r 1000 -carch X86 C-SB+o-o+o-o.litmus
 % Test C-SB+o-o+o-o Allowed
 % Histogram (4 states)
 % 314   *>0:r2=0; 1:r2=0;
 % 49999625:>0:r2=2; 1:r2=0;
 % 49999894:>0:r2=0; 1:r2=2;
 % 167   :>0:r2=2; 1:r2=2;

 The following sections show exactly how this intuition breaks down,
 and then put forward some mental models of memory ordering that can help
 you avoid these pitfalls.

 \Cref{sec:memorder:Why Hardware Misordering?}
 gives a brief overview of why hardware misorders memory accesses, and then
 \cref{sec:memorder:How to Force Ordering?}
 gives an equally brief overview of how you can thwart such misordering.
 Finally, \cref{sec:memorder:Basic Rules of Thumb}
 lists some basic rules of thumb, which will be further refined in
 later sections.
 These sections focus on hardware reordering, but rest assured that compilers
 reorder much more aggressively than hardware ever dreamed of doing.
 Compiler-induced reordering will be taken up in
 \cref{sec:memorder:Compile-Time Consternation}.

 \subsection{Why Hardware Misordering?}
 \label{sec:memorder:Why Hardware Misordering?}

 But why does memory misordering happen in the first place?
 Can't CPUs keep track of ordering on their own?
 Isn't that why we have computers in the first place, to keep track of things?

 \begin{figure*}
 \centering
 \resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch}}
 \caption{Intel Core 2 Architecture}
 \ContributedBy{fig:memorder:Intel Core 2 Architecture}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted}
 \end{figure*}

 Many people do indeed expect their computers to keep track of things,
 but many also insist that they keep track of things quickly.
 In fact, so intense is the focus on performance that modern CPUs
 are extremely complex, as can be seen in the simplified block diagram in
 \cref{fig:memorder:Intel Core 2 Architecture}.
 Those needing to squeeze the last few percent of performance from their
 systems will in turn need to pay close attention to the fine details of this
 figure when tuning their software.
 Except that this close attention to detail means that when a given CPU
 degrades with age, the software will no longer run quickly on it.
 For example, if the leftmost ALU fails, software tuned to take full
 advantage of all of the ALUs might well run more slowly than untuned
 software.
 One solution to this problem is to take systems out of service as soon
 as any of their CPUs start degrading.

 \begin{figure*}
 \centering
 \resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch-simplified}}
 \caption{Intel Core 2 Architecture Simplified}
 \ContributedBy{fig:memorder:Intel Core 2 Architecture Simplified}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted}
 \end{figure*}

 Another option is to recall the lessons of
 \cref{chp:Hardware and its Habits},
 especially the lesson that for many important workloads, main memory
 cannot keep up with modern CPUs, which can execute hundreds of
 instructions in the time required to fetch a single variable from memory.
 For such workloads, the detailed internal structure of the CPU is
 irrelevant, and the CPU can instead be approximated by the blue shapes in
 \cref{fig:memorder:Intel Core 2 Architecture Simplified}
 labeled CPU, store buffer, and cache.

 Because of these data-intensive workloads, CPUs sport increasingly large
 caches, as was seen back in
 \cref{fig:cpu:System Hardware Architecture},
 which means that although the first load by a given CPU from a given
 variable will result in an expensive \emph{cache miss} as was discussed in
 \cref{sec:cpu:Cache Misses}, subsequent
 repeated loads from that variable by that CPU might execute
 very quickly because the initial cache miss will have loaded that
 variable into that CPU's cache.

 However, it is also necessary to accommodate frequent concurrent stores
 from multiple CPUs to a set of shared variables.
 In cache-coherent systems, if the caches hold multiple copies of a given
 variable, all the copies of that variable must have the same value.
 This works extremely well for concurrent loads, but not so well for
 concurrent stores:
 Each store must do something about all copies of the old value
 (another cache miss!), which, given the finite speed of light and
 the atomic nature of matter, will be slower than impatient software
 hackers would like.
 And these strings of stores are the reason for the blue block
 labelled store buffer in
 \cref{fig:memorder:Intel Core 2 Architecture Simplified}.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{memorder/SystemArchSB}}
 \caption{System Architecture With Store Buffers}
 \label{fig:memorder:System Architecture With Store Buffers}
 \end{figure}

 Removing the internal CPU complexity from
 \cref{fig:memorder:Intel Core 2 Architecture Simplified},
 adding a second CPU, and showing main memory results in
 \cref{fig:memorder:System Architecture With Store Buffers}.
 When a given CPU stores to a variable
 not present in that CPU's cache, then the new value
 is instead placed in that CPU's store buffer.
 The CPU can then proceed immediately, without having to wait for the
 store to do something about all the old values of that variable
 residing in other CPUs' caches.

 \begin{figure}
 \centering
 \resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}}
 \caption{CPUs Can Do Things Out of Order}
 \ContributedBy{fig:memorder:CPUs Can Do Things Out of Order}{Melissa Broussard}
 \end{figure}

 Although store buffers can greatly increase performance, they can cause
 instructions and memory references to execute out of order, which can
 in turn cause serious confusion, as fancifully illustrated in
 \cref{fig:memorder:CPUs Can Do Things Out of Order}.

 \begin{table*}
 \rowcolors{6}{}{lightgray}
 \renewcommand*{\arraystretch}{1.1}
 \small
 \centering\OneColumnHSpace{-0.1in}
 \ebresizewidth{
 \begin{tabular}{rllllll}
 	\toprule
 	& \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
 	\cmidrule(l){2-4} \cmidrule(l){5-7}
 	& Instruction & Store Buffer & Cache &
 		Instruction & Store Buffer & Cache \\
 	\cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7}
 	1 & (Initial state) & & \tco{x1==0} &
 		(Initial state) & & \tco{x0==0} \\
 	2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
 		\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
 	3 & \tco{r2 = x1;} (0) & \tco{x0==2} & \tco{x1==0} &
 		\tco{r2 = x0;} (0) & \tco{x1==2} & \tco{x0==0} \\
 	4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
 		(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
 	5 & (Finish store) & & \tco{x0==2} &
 		(Finish store) & & \tco{x1==2} \\
 	\bottomrule
 \end{tabular}
 }
 \caption{Memory Misordering:
 			     Store-Buffering Sequence of Events}
 \label{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
 \end{table*}

 In particular, store buffers cause the memory misordering
 illustrated by
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
 \Cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
 shows the steps leading to this misordering.
 Row~1 shows the initial state, where CPU~0 has \co{x1} in its cache
 and CPU~1 has \co{x0} in its cache, both variables having a value of zero.
 \begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole]
 Row~2 shows the state change due to each CPU's store (\clnref{st0,st1} of
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
 Because neither CPU has the stored-to variable in its cache, both CPUs
 record their stores in their respective store buffers.
 \end{fcvref}

 \QuickQuiz{
 	But wait!!!
 	On row~2 of
 	\cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
 	both \co{x0} and \co{x1} each have two values at the same time,
 	namely zero and two.
 	How can that possibly work???
 }\QuickQuizAnswer{
 	There is an underlying cache-coherence protocol that straightens
 	things out, which are discussed in
 	\cref{sec:app:whymb:Cache-Coherence Protocols}.
 	But if you think that a given variable having two values at
 	the same time is surprising, just wait until you get to
 	\cref{sec:memorder:Variables With Multiple Values}!
 }\QuickQuizEnd

 \begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole]
 Row~3 shows the two loads (\clnref{ld0,ld1} of
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}).
 Because the variable being loaded by each CPU is in that CPU's cache,
 each load immediately returns the cached value, which in both cases
 is zero.
 \end{fcvref}

 But the CPUs are not done yet:
 Sooner or later, they must empty their store buffers.
 Because caches move data around in relatively large blocks called
 \emph{cachelines}, and because each cacheline can hold several
 variables, each CPU must get the cacheline into its own cache so
 that it can update the portion of that cacheline corresponding
 to the variable in its store buffer, but without disturbing any
 other part of the cacheline.
 Each CPU must also ensure that the cacheline is not present in any other
 CPU's cache, for which a read-invalidate operation is used.
 As shown on row~4, after both read-invalidate operations complete,
 the two CPUs have traded cachelines, so that CPU~0's cache now contains
 \co{x0} and CPU~1's cache now contains \co{x1}.
 Once these two variables are in their new homes, each CPU can flush
 its store buffer into the corresponding \IX{cache line}, leaving each
 variable with its final value as shown on row~5.

 \QuickQuiz{
 	But don't the values also need to be flushed from the cache
 	to main memory?
 }\QuickQuizAnswer{
 	Perhaps surprisingly, not necessarily!
 	On some systems,
 	if the two variables are being used heavily, they might
 	be bounced back and forth between the CPUs' caches and never
 	land in main memory.
 }\QuickQuizEnd

 In summary, store buffers are needed to allow CPUs to handle
 store instructions efficiently, but they can result in
 counter-intuitive memory misordering.

 But what do you do if your algorithm really needs its memory
 references to be ordered?
 For example, suppose that you are communicating with a driver using
 a pair of flags, one that says whether or not the driver is running
 and the other that says whether there is a request pending for that
 driver.
 The requester needs to set the request-pending flag, then check
 the driver-running flag, and if false, wake the driver.
 Once the driver has serviced all the pending requests that it knows about,
 it needs to clear its driver-running flag, then check the request-pending
 flag to see if it needs to restart.
 This very reasonable approach cannot work unless there is some way
 to make sure that the hardware processes the stores and loads in order.
 This is the subject of the next section.

 \subsection{How to Force Ordering?}
 \label{sec:memorder:How to Force Ordering?}

 It turns out that there are compiler directives and synchronization
 primitives (such as locking and RCU) that are responsible for maintaining
 the illusion of ordering through use of \emph{\IXBpl{memory barrier}} (for
 example, \co{smp_mb()} in the Linux kernel).
 These memory barriers can be explicit instructions, as they are on
 \ARM, \Power{}, Itanium, and Alpha, or they can be implied by other instructions,
 as they often are on x86.
 Since these standard synchronization primitives preserve the illusion of
 ordering, your path of least resistance is to simply use these primitives,
 thus allowing you to stop reading this section.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-SB+o-mb-o+o-mb-o@whole.fcv}
 \caption{Memory Ordering:
 			  Store-Buffering Litmus Test}
 \label{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
 \end{listing}

 However, if you need to implement the synchronization primitives
 themselves, or if you are simply interested in understanding how memory
 ordering works, read on!
 The first stop on the journey is
 \cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
 (\path{C-SB+o-mb-o+o-mb-o.litmus}),
 which places an \co{smp_mb()} Linux-kernel \IXh{full}{memory barrier} between
 the store and load in both \co{P0()} and \co{P1()}, but is otherwise
 identical to
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.
 % Test C-SB+o-mb-o+o-mb-o Allowed
 % Histogram (3 states)
 % 49553298:>0:r2=2; 1:r2=0;
 % 49636449:>0:r2=0; 1:r2=2;
 % 810253:>0:r2=2; 1:r2=2;
 % No
 These barriers prevent the counter-intuitive outcome from happening
 on 100,000,000 trials on my x86 laptop.
 Interestingly enough, the added overhead due to these barriers causes the
 legal outcome where both loads return the value two to happen more
 than 800,000 times, as opposed to only 167 times for the
 barrier-free code in
 \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}.

 \begin{table*}
 \rowcolors{6}{}{lightgray}
 \renewcommand*{\arraystretch}{1.1}
 \small
 \centering\OneColumnHSpace{-0.1in}
 \ebresizewidth{
 \begin{tabular}{rllllll}
 	\toprule
 	& \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\
 	\cmidrule(l){2-4} \cmidrule(l){5-7}
 	& Instruction & Store Buffer & Cache &
 		Instruction & Store Buffer & Cache \\
 	\cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7}
 	1 & (Initial state) & & \tco{x1==0} &
 		(Initial state) & & \tco{x0==0} \\
 	2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} &
 		\tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\
 	3 & \tco{smp_mb();} & \tco{x0==2} & \tco{x1==0} &
 		\tco{smp_mb();} & \tco{x1==2} & \tco{x0==0} \\
 	4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} &
 		(Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\
 	5 & (Finish store) & & \tco{x0==2} &
 		(Finish store) & & \tco{x1==2} \\
 	6 & \tco{r2 = x1;} (2) & & \tco{x1==2} &
 		\tco{r2 = x0;} (2) & & \tco{x0==2} \\
 	\bottomrule
 \end{tabular}
 }
 \caption{Memory Ordering:
 			  Store-Buffering Sequence of Events}
 \label{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}
 \end{table*}

 These barriers have a profound effect on ordering, as can be seen in
 \cref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}.
 Although the first two rows are the same as in
 \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}
 and although the \co{smp_mb()} instructions on row~3
 do not change state
 in and of themselves, they do cause the stores to complete
 (rows~4 and~5) before the
 loads (row~6), which rules out the counter-intuitive outcome shown in
 \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}.
 Note that variables \co{x0} and \co{x1} each still have more than one
 value on row~2, however, as promised earlier, the \co{smp_mb()}
 invocations straighten things out in the end.

 Although full barriers such as \co{smp_mb()} have extremely strong
 ordering guarantees, their strength comes at a high price in terms
 of foregone hardware and compiler optimizations.
 A great many situations can be handled with much weaker ordering guarantees
 that use much cheaper memory-ordering instructions, or, in some case, no
 memory-ordering instructions at all.

 \begin{table*}
 \small
 \centering\OneColumnHSpace{-0.7in}
 \renewcommand*{\arraystretch}{1.1}
 \rowcolors{7}{lightgray}{}
 \ebresizewidth{
 \begin{tabular}{lcccccccccccc}\toprule
 	& & \multicolumn{4}{c}{Prior Ordered Operation} &
 		\multicolumn{7}{c}{Subsequent Ordered Operation} \\
 	\cmidrule(l){3-6} \cmidrule(l){7-13}
 	Operation Providing Ordering & C &
 		Self & R & W & RMW & Self & R & W & DR & DW & RMW & SV \\
 	\cmidrule(r){1-1} \cmidrule{2-2} \cmidrule(l){3-6} \cmidrule(l){7-13}
 	Store, for example, \tco{WRITE_ONCE()} &  &
 		   Y &   &   &     &      &   &   &    &    &     &  Y \\
 	Load, for example, \tco{READ_ONCE()} &  &
 		   Y &   &   &     &      &   &   &  Y &  Y &     &  Y \\
 	\tco{_relaxed()} RMW operation &  &
 		   Y &   &   &     &      &   &   &  Y &  Y &     &  Y \\
 	\tco{*_dereference()} &  &
 		   Y &   &   &     &      &   &   &  Y &  Y &     &  Y \\
 	Successful \tco{*_acquire()} &   &
 		   R &   &   &     &      & Y & Y &  Y &  Y &   Y &  Y \\
 	Successful \tco{*_release()} & C &
 		     & Y & Y &   Y &    W &   &   &    &    &     &  Y \\
 	\tco{smp_rmb()} &   &
 		     & Y &   &   R &      & Y &   &  Y &    &   R &    \\
 	\tco{smp_wmb()} &   &
 		     &   & Y &   W &      &   & Y &    &  Y &   W &    \\
 	\tco{smp_mb()} and \tco{synchronize_rcu()} & CP &
 		     & Y & Y &   Y &      & Y & Y &  Y &  Y &   Y &    \\
 	Successful full-strength non-\tco{void} RMW & CP &
 		   Y & Y & Y &   Y &    Y & Y & Y &  Y &  Y &   Y &  Y \\
 	\tco{smp_mb__before_atomic()} & CP &
 		     & Y & Y &   Y &      & a & a & a  & a  &   Y &    \\
 	\tco{smp_mb__after_atomic()} & CP &
 		     & a & a &   Y &      & Y & Y &  Y &  Y &   Y &    \\
 	\bottomrule
 \end{tabular}
 }

 \vspace{5pt}\hfill
 \ebresizeverb{.8}{
 \framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt}
 \rowcolors{1}{}{}
 \begin{tabular}{lrl}
 	Key:	& C: & Ordering is cumulative \\
 		& P: & Ordering propagates \\
 		& R: & Read, for example, \tco{READ_ONCE()}, or read portion of RMW \\
 		& W: & Write, for example, \tco{WRITE_ONCE()}, or write portion of RMW \\
 		& Y: & Provides the specified ordering \\
 		& a: & Provides specified ordering given intervening RMW atomic operation \\
 		& DR: & Dependent read (address dependency, \cref{sec:memorder:Address Dependencies}) \\
 		& DW: & Dependent write (address, data, or control dependency, \crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies}) \\
 		& RMW: & \IX{Atomic read-modify-write operation} \\
 		& Self: & Orders self, as opposed to accesses both before
 			  and after \\
 		& SV: & Orders later accesses to the same variable \\
 	\multicolumn{3}{l}{\emph{Applies to Linux kernel v4.15 and later.}} \\
 \end{tabular}
 }\OneColumnHSpace{-0.9in}
 }
 \caption{Linux-Kernel Memory-Ordering Cheat Sheet}
 \label{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 \end{table*}

 \Cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 provides a cheatsheet of the Linux kernel's ordering primitives and their
 guarantees.
 Each row corresponds to a primitive or category of primitives that might
 or might not provide ordering, with the columns labeled
 ``Prior Ordered Operation'' and ``Subsequent Ordered Operation''
 being the operations that might (or might not) be ordered against.
 Cells containing ``Y'' indicate that ordering is supplied unconditionally,
 while other characters indicate that ordering is supplied only partially or
 conditionally.
 Blank cells indicate that no ordering is supplied.

 The ``Store'' row also covers the store portion of an
 \IXalt{atomic RMW operation}{atomic read-modify-write operation}.
 In addition, the ``Load'' row covers the load
 component of a successful value-returning \co{_relaxed()} RMW atomic
 operation, although the combined ``\co{_relaxed()} RMW operation''
 line provides a convenient combined reference in the value-returning case.
 A CPU executing unsuccessful value-returning atomic RMW operations must
 invalidate the corresponding variable from all other CPUs' caches.
 Therefore, unsuccessful value-returning atomic RMW operations have many
 of the properties of a store, which means that the ``\co{_relaxed()}
 RMW operation'' line also applies to unsuccessful value-returning atomic
 RMW operations.

 The \co{*_acquire} row covers \co{smp_load_acquire()},
 \co{cmpxchg_acquire()}, \co{xchg_acquire()}, and so on; the \co{*_release}
 row covers \co{smp_store_release()}, \co{rcu_assign_pointer()},
 \co{cmpxchg_release()}, \co{xchg_release()}, and so on; and
 the ``Successful full-strength non-\co{void} RMW'' row covers
 \co{atomic_add_return()}, \co{atomic_add_unless()}, \co{atomic_dec_and_test()},
 \co{cmpxchg()}, \co{xchg()}, and so on.
 The ``Successful'' qualifiers apply to primitives such as
 \co{atomic_add_unless()}, \co{cmpxchg_acquire()}, and \co{cmpxchg_release()},
 which have no effect on either memory or on ordering when they indicate
 failure, as indicated by the earlier ``\co{_relaxed()} RMW operation'' row.

 Column ``C'' indicates cumulativity and propagation, as explained in
 \cref{sec:memorder:Cumulativity,%
 sec:memorder:Propagation}.
 In the meantime, this column can usually be ignored when there
 are at most two threads involved.

 \QuickQuizSeries{%
 \QuickQuizB{
 	The rows in
 	\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 	seem quite random and confused.
 	Whatever is the conceptual basis of this table???
 }\QuickQuizAnswerB{
 	The rows correspond roughly to hardware mechanisms of increasing
 	power and overhead.

 	The \co{WRITE_ONCE()} row captures the fact that accesses to
 	a single variable are always fully ordered, as indicated by
 	the ``SV'' column.
 	Note that all other operations providing ordering against accesses
 	to multiple variables also provide this same-variable ordering.

 	The \co{READ_ONCE()} row captures the fact that (as of 2021) compilers
 	and CPUs do not indulge in user-visible speculative stores, so that
 	any store whose address, data, or execution depends on a prior load
 	is guaranteed to happen after that load completes.
 	However, this guarantee assumes that these dependencies have
 	been constructed carefully, as described in
 	\cref{sec:memorder:Address- and Data-Dependency Difficulties,%
 	sec:memorder:Control-Dependency Calamities}.

 	The ``\co{_relaxed()} RMW operation'' row captures the fact
 	that a value-returning \co{_relaxed()} RMW has done a load and a
 	store, which are every bit as good as a \co{READ_ONCE()} and a
 	\co{WRITE_ONCE()}, respectively.

 	The \co{*_dereference()} row captures the address and data
 	dependency ordering provided by \co{rcu_dereference()} and friends.
 	Again, these dependencies must been constructed carefully,
 	as described in
 	\cref{sec:memorder:Address- and Data-Dependency Difficulties}.

 	The ``Successful \co{*_acquire()}'' row captures the fact that many
 	CPUs have special ``acquire'' forms of loads and of atomic RMW
 	instructions,
 	and that many other CPUs have lightweight memory-barrier
 	instructions that order prior loads against subsequent loads
 	and stores.

 	The ``Successful \co{*_release()}'' row captures the fact that many
 	CPUs have special ``release'' forms of stores and of atomic RMW
 	instructions, and that many other CPUs have lightweight memory-barrier
 	instructions that order prior loads and stores against
 	subsequent stores.

 	The \co{smp_rmb()} row captures the fact that many CPUs have
 	lightweight memory-barrier instructions that order prior loads against
 	subsequent loads.
 	Similarly,
 	the \co{smp_wmb()} row captures the fact that many CPUs have
 	lightweight memory-barrier instructions that order prior stores against
 	subsequent stores.

 	None of the ordering operations thus far require prior stores to be
 	ordered against subsequent loads, which means that these operations
 	need not interfere with store buffers, whose main purpose in life
 	is in fact to reorder prior stores against subsequent loads.
 	The lightweight nature of these operations is precisely due to
 	their policy of store-buffer non-interference.
 	However, as noted earlier, it is sometimes necessary to interfere
 	with the store buffer in order to prevent prior stores from being
 	reordered against later stores, which brings us to the remaining
 	rows in this table.

 	The \co{smp_mb()} row corresponds to the \IXh{full}{memory barrier}
 	available on most platforms, with Itanium being the exception
 	that proves the rule.
 	However, even on Itanium, \co{smp_mb()} provides full ordering
 	with respect to \co{READ_ONCE()} and \co{WRITE_ONCE()},
 	as discussed in \cref{sec:memorder:Itanium}.

 	The ``Successful full-strength non-\co{void} RMW'' row captures
 	the fact that on some platforms (such as x86) atomic RMW instructions
 	provide full ordering both before and after.
 	The Linux kernel therefore requires that full-strength non-\co{void}
 	atomic RMW operations provide full ordering in cases where these
 	operations succeed.
 	(Full-strength atomic RMW operation's names do not end in
 	\co{_relaxed}, \co{_acquire}, or \co{_release}.)
 	As noted earlier, the case where these operations do not succeed
 	is covered by the ``\co{_relaxed()} RMW operation'' row.

 	However, the Linux kernel does not require that either \co{void}
 	or \co{_relaxed()} atomic RMW operations provide any ordering
 	whatsoever, with the canonical example being \co{atomic_inc()}.
 	Therefore, these operations, along with failing non-\co{void}
 	atomic RMW operations may be preceded by \co{smp_mb__before_atomic()}
 	and followed by \co{smp_mb__after_atomic()} to provide full
 	ordering for any accesses preceding or following both.
 	No ordering need be provided for accesses between the
 	\co{smp_mb__before_atomic()} (or, similarly, the
 	\co{smp_mb__after_atomic()}) and the atomic RMW operation, as
 	indicated by the ``a'' entries on the \co{smp_mb__before_atomic()}
 	and \co{smp_mb__after_atomic()} rows of the table.

 	In short, the structure of this table is dictated by the
 	properties of the underlying hardware, which are constrained by
 	nothing other than the laws of physics, which were covered back in
 	\cref{chp:Hardware and its Habits}.
 	That is, the table is not random, although it is quite possible
 	that you are confused.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Why is
 	\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 	missing \co{smp_mb__after_unlock_lock()} and
 	\co{smp_mb__after_spinlock()}?
 }\QuickQuizAnswerE{
 	These two primitives are rather specialized, and at present
 	seem difficult to fit into
 	\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
 	The \co{smp_mb__after_unlock_lock()} primitive is intended to be placed
 	immediately after a lock acquisition, and ensures that all CPUs
 	see all accesses in prior critical sections as happening before
 	all accesses following the \co{smp_mb__after_unlock_lock()}
 	and also before all accesses in later critical sections.
 	Here ``all CPUs'' includes those CPUs not holding that lock,
 	and ``prior critical sections'' includes all prior critical sections
 	for the lock in question as well as all prior critical sections
 	for all other locks that were released by the same CPU that executed
 	the  \co{smp_mb__after_unlock_lock()}.

 	The \co{smp_mb__after_spinlock()} provides the same guarantees
 	as does \co{smp_mb__after_unlock_lock()}, but also provides
 	additional visibility guarantees for other accesses performed
 	by the CPU that executed the \co{smp_mb__after_spinlock()}.
 	Given any store S performed prior to any earlier lock acquisition
 	and any load L performed after the \co{smp_mb__after_spinlock()},
 	all CPUs will see S as happening before~L\@.
 	In other words, if a CPU performs a store S, acquires a lock,
 	executes an \co{smp_mb__after_spinlock()}, then performs a
 	load L, all CPUs will see S as happening before~L\@.
 }\QuickQuizEndE
 }

 It is important to note that this table is just a cheat sheet,
 and is therefore in no way a replacement for a good understanding
 of memory ordering.
 To begin building such an understanding, the next section will
 present some basic rules of thumb.

 \subsection{Basic Rules of Thumb}
 \label{sec:memorder:Basic Rules of Thumb}

 This section presents some basic rules of thumb that are ``good and
 sufficient'' for a great many situations.
 In fact, you could write a great deal of concurrent code having
 excellent performance and scalability without needing anything more
 than these rules of thumb.
 More sophisticated rules of thumb will be presented in
 \cref{sec:memorder:Memory-Model Intuitions}.

 \QuickQuiz{
 	But how can I know that a given project can be designed
 	and coded within the confines of these rules of thumb?
 }\QuickQuizAnswer{
 	Much of the purpose of the remainder of this chapter is
 	to answer exactly that question!
 }\QuickQuizEnd

 \paragraph{A given thread sees its own accesses in order.}
 This rule assumes that loads and stores from/to shared variables use
 \co{READ_ONCE()} and \co{WRITE_ONCE()}, respectively.
 Otherwise, the compiler can profoundly scramble\footnote{
 	Many compiler writers prefer the word ``optimize''.}
 your code, and sometimes the CPU can do a bit of scrambling as well,
 as discussed in \cref{sec:memorder:Itanium}.

 \paragraph{Interrupts and signal handlers are part of a thread.}
 Both interrupt and signal handlers happen between a pair of adjacent
 instructions in a thread.
 This means that a given handler appears to execute atomically
 from the viewpoint of the interrupted thread, at least at the
 assembly-language level.
 However, the C and C++ languages do not define the results of handlers
 and interrupted threads sharing plain variables.
 Instead, such shared variables must be \co{sig_atomic_t}, lock-free
 atomics, or \co{volatile}.

 On the other hand, because the handler executes within the interrupted
 thread's context, the memory ordering used to synchronize communication
 between the handler and the thread can be extremely lightweight.
 For example, the counterpart of an acquire load is a \co{READ_ONCE()}
 followed by a \co{barrier()} compiler directive and the counterpart
 of a release store is a \co{barrier()} followed by a \co{WRITE_ONCE()}.
 The counterpart of a full memory barrier is \co{barrier()}.
 Finally, disabling interrupts or signals (as the case may be) within
 the thread excludes handlers.

 % @@@ Using flags to avoid expensive interrupt-disabling instructions
 % @@@ and signal-blocking system calls.  Harder than it looks!

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{memorder/memorybarrier}}
 \caption{Memory Barriers Provide Conditional If-Then Ordering}
 \label{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
 \end{figure}

 \paragraph{Ordering has conditional if-then semantics.}
 \Cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
 illustrates this for memory barriers.
 Assuming that both memory barriers are strong enough, if CPU~1's access
 Y1 happens after CPU~0's access Y0, then CPU~1's access X1 is guaranteed
 to happen after CPU~0's access X0.

 \QuickQuiz{
 	How can you tell which memory barriers are strong enough for
 	a given use case?
 }\QuickQuizAnswer{
 	Ah, that is a deep question whose answer requires most of the
 	rest of this chapter.
 	But the short answer is that \co{smp_mb()} is almost always
 	strong enough, albeit at some cost.
 }\QuickQuizEnd

 \begin{fcvref}[ln:formal:C-SB+o-mb-o+o-mb-o:whole]
 \Cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test}
 is a case in point.
 The \co{smp_mb()} on \clnref{P0:mb,P1:mb} serve as the barriers,
 the store to \co{x0} on \clnref{P0:st} as X0, the load from \co{x1}
 on \clnref{P0:ld} as Y0, the store to \co{x1} on \clnref{P1:st} as Y1,
 and the load from \co{x0} on \clnref{P1:ld} as X1.
 Applying the if-then rule step by step, we know that the store to
 \co{x1} on \clnref{P1:st} happens after the load from \co{x1} on \clnref{P0:ld} if
 \co{P0()}'s local variable \co{r2} is set to the value zero.
 The if-then rule would then state that the load from \co{x0} on
 \clnref{P1:ld} happens after the store to \co{x0} on \clnref{P0:st}.
 In other words,
 \co{P1()}'s local variable \co{r2} is guaranteed
 to end up with the value two \emph{only if}
 \co{P0()}'s local variable \co{r2} ends up with the value zero.
 This underscores the point that memory ordering guarantees are
 conditional, not absolute.
 \end{fcvref}

 Although
 \cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering}
 specifically mentions memory barriers, this same if-then rule applies
 to the rest of the Linux kernel's ordering operations.

 \paragraph{Ordering operations must be paired.}
 If you carefully order the operations in one thread, but then fail to do
 so in another thread, then there is no ordering.
 Both threads must provide ordering for the if-then rule to apply.\footnote{
 	In \cref{sec:memorder:Propagation}, pairing will be
 	generalized to cycles.}

 \paragraph{Ordering operations almost never speed things up.}
 If you find yourself tempted to add a memory barrier in an attempt
 to force a prior store to be flushed to memory faster, resist!
 Adding ordering usually slows things down.
 Of course, there are situations where adding instructions speeds things
 up, as was shown by
 \cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR} on
 \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
 but careful benchmarking is required in such cases.
 And even then, it is quite possible that although you sped things up
 a little bit on \emph{your} system, you might well have slowed things
 down significantly on your users' systems.
 Or on your future system.

 \paragraph{Ordering operations are not magic.}
 When your program is failing due to some \IX{race condition}, it is often
 tempting to toss in a few memory-ordering operations in an attempt
 to barrier your bugs out of existence.
 A far better reaction is to use higher-level primitives in a carefully
 designed manner.
 With concurrent programming, it is almost always better to design your
 bugs out of existence than to hack them down to lower probabilities.

 \paragraph{These are only rough rules of thumb.}
 Although these rules of thumb cover the vast majority of situations
 seen in actual practice, as with any set of rules of thumb, they
 do have their limits.
 The next section will demonstrate some of these limits by introducing
 trick-and-trap litmus tests that are intended to insult your
 intuition while increasing your understanding.
 These litmus tests will also illuminate many of the concepts
 represented by the Linux-kernel memory-ordering cheat sheet shown in
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet},
 and can be automatically analyzed given proper
 tooling~\cite{Alglave:2018:FSC:3173162.3177156}.
 \Cref{sec:memorder:Memory-Model Intuitions} will
 circle back to this cheat sheet, presenting a more sophisticated set of
 rules of thumb in light of learnings from all the intervening tricks
 and traps.

 \QuickQuiz{
 	Wait!!!
 	Where do I find this tooling that automatically analyzes
 	litmus tests???
 }\QuickQuizAnswer{
 	Get version v4.17 (or later) of the Linux-kernel source code,
 	then follow the instructions in \path{tools/memory-model/README}
 	to install the needed tools.
 	Then follow the further instructions to run these tools on the
 	litmus test of your choice.
 }\QuickQuizEnd

 \section{Tricks and Traps}
 \label{sec:memorder:Tricks and Traps}
 %
 \epigraph{Knowing where the trap is---that's the first step in evading it.}
 	 {Duke Leto Atreides, \emph{Dune}, Frank Herbert}

 Now that you know that hardware can reorder memory accesses and that you
 can prevent it from doing so, the next step is to get you to admit
 that your intuition has a problem.
 This painful task is taken up by
 \cref{sec:memorder:Variables With Multiple Values},
 which presents some code demonstrating that scalar variables can
 take on multiple values simultaneously,
 and by
 \crefthro{sec:memorder:Memory-Reference Reordering}
 {sec:memorder:Multicopy Atomicity},
 which show a series of intuitively correct code fragments that fail miserably
 on real hardware.
 Once your intuition has made it through the grieving process, later
 sections will summarize the basic rules that memory ordering follows.

 But first, let's take a quick look at just how many values a single
 variable might have at a single point in time.

 \subsection{Variables With Multiple Values}
 \label{sec:memorder:Variables With Multiple Values}

 It is natural to think of a variable as taking on a well-defined
 sequence of values in a well-defined, global order.
 Unfortunately, the next stop on the journey says ``goodbye'' to this comforting fiction.
 Hopefully, you already started to say ``goodbye'' in response to row~2 of
 \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events,%
 tab:memorder:Memory Ordering: Store-Buffering Sequence of Events},
 and if so, the purpose of this section is to drive this point home.

 \begin{fcvref}[ln:memorder:Software Logic Analyzer]
 To this end, consider the program fragment shown in
 \cref{lst:memorder:Software Logic Analyzer}.
 This code fragment is executed in parallel by several CPUs.
 \Clnref{setid} sets a shared variable to the current CPU's ID, \clnref{init}
 initializes several variables from a \co{gettb()} function that
 delivers the value of a fine-grained hardware ``timebase'' counter that is
 synchronized among all CPUs (not available from all CPU architectures,
 unfortunately!), and the loop from \clnrefrange{loop:b}{loop:e}
 records the length of
 time that the variable retains the value that this CPU assigned to it.
 Of course, one of the CPUs will ``win'', and would thus never exit
 the loop if not for the check on \clnrefrange{iftmout}{break}.
 \end{fcvref}

 \QuickQuiz{
 	What assumption is the code fragment
 	in \cref{lst:memorder:Software Logic Analyzer}
 	making that might not be valid on real hardware?
 }\QuickQuizAnswer{
 	The code assumes that as soon as a given CPU stops
 	seeing its own value, it will immediately see the
 	final agreed-upon value.
 	On real hardware, some of the CPUs might well see several
 	intermediate results before converging on the final value.
 	The actual code used to produce the data in the figures
 	discussed later in this section was therefore somewhat more
 	complex.
 }\QuickQuizEnd

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Software Logic Analyzer]
 \begin{VerbatimL}[commandchars=\\\[\]]
 state.variable = mycpu;			\lnlbl[setid]
 lasttb = oldtb = firsttb = gettb();	\lnlbl[init]
 while (state.variable == mycpu) {	\lnlbl[loop:b]
 	lasttb = oldtb;
 	oldtb = gettb();
 	if (lasttb - firsttb > 1000)	\lnlbl[iftmout]
 		break;			\lnlbl[break]
 }					\lnlbl[loop:e]
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Software Logic Analyzer}
 \label{lst:memorder:Software Logic Analyzer}
 \end{listing}

 Upon exit from the loop, \co{firsttb} will hold a timestamp
 taken shortly after the assignment and \co{lasttb} will hold
 a timestamp taken before the last sampling of the shared variable
 that still retained the assigned value, or a value equal to \co{firsttb}
 if the shared variable had changed before entry into the loop.
 This allows us to plot each CPU's view of the value of \co{state.variable}
 over a 532-nanosecond time period, as shown in
 \cref{fig:memorder:A Variable With Multiple Simultaneous Values}.
 This data was collected in 2006 on 1.5\,GHz \Power{5} system with 8 cores,
 each containing a pair of hardware threads.
 CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test.
 The timebase counter period was about 5.32\,ns, sufficiently fine-grained
 to allow observations of intermediate cache states.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{memorder/MoreThanOneValue}}
 \caption{A Variable With Multiple Simultaneous Values}
 \label{fig:memorder:A Variable With Multiple Simultaneous Values}
 \end{figure}

 Each horizontal bar represents the observations of a given CPU over time,
 with the gray regions to the left indicating the time before the
 corresponding CPU's first measurement.
 During the first 5\,ns, only CPU~3 has an opinion about the value of the
 variable.
 During the next 10\,ns, CPUs~2 and~3 disagree on the value of the variable,
 but thereafter agree that the value is~``2'', which is in fact
 the final agreed-upon value.
 However, CPU~1 believes that the value is~``1'' for almost 300\,ns, and
 CPU~4 believes that the value is~``4'' for almost 500\,ns.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How could CPUs possibly have different views of the
 	value of a single variable \emph{at the same time?}
 }\QuickQuizAnswerB{
 	As discussed in
 	\cref{sec:memorder:Why Hardware Misordering?},
 	many CPUs have store buffers that record the values of
 	recent stores, which do not become globally visible until
 	the corresponding cache line makes its way to the CPU\@.
 	Therefore, it is quite possible for each CPU to see its own value
 	for a given variable (in its own store buffer) at a single point
 	in time---and for main memory to hold yet another value.
 	One of the reasons that memory barriers were invented was
 	to allow software to deal gracefully with situations like
 	this one.

 	Fortunately, software rarely cares about the fact that multiple
 	CPUs might see multiple values for the same variable.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Why do CPUs~2 and~3 come to agreement so quickly, when it
 	takes so long for CPUs~1 and~4 to come to the party?
 }\QuickQuizAnswerE{
 	CPUs~2 and~3 are a pair of hardware threads on the same
 	core, sharing the same cache hierarchy, and therefore have
 	very low communications latencies.
 	This is a \IXacr{numa}, or, more accurately, a \IXacr{nuca} effect.

 	This leads to the question of why CPUs~2 and~3 ever disagree
 	at all.
 	One possible reason is that they each might have a small amount
 	of private cache in addition to a larger shared cache.
 	Another possible reason is instruction reordering, given the
 	short 10-nanosecond duration of the disagreement and the
 	total lack of memory-ordering operations in the code fragment.
 }\QuickQuizEndE
 }

 And if you think that the situation with four CPUs was intriguing, consider
 \cref{fig:memorder:A Variable With More Simultaneous Values},
 which shows the same situation, but with 15~CPUs each assigning their
 number to a single shared variable at time~$t=0$. Both diagrams in the
 figure are drawn in the same way as
 \cref{fig:memorder:A Variable With Multiple Simultaneous Values}.
 The only difference is that the unit of horizontal axis is timebase ticks,
 with each tick lasting about 5.3~nanoseconds.
 The entire sequence therefore lasts a bit longer than the events recorded in
 \cref{fig:memorder:A Variable With Multiple Simultaneous Values},
 consistent with the increase in number of CPUs.
 The upper diagram shows the overall picture, while the lower one zooms
 in on the first 50~timebase ticks.
 Again, CPU~0 coordinates the test, so does not record any values.

 \begin{figure*}
 \centering
 \IfEbookSize{
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}}
 }{
 \resizebox{5in}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}}
 }
 \caption{A Variable With More Simultaneous Values}
 \ContributedBy{fig:memorder:A Variable With More Simultaneous Values}{Akira Yokosawa}
 \end{figure*}

 All CPUs eventually agree on the final value of~9, but not before
 the values~15 and~12 take early leads.
 Note that there are fourteen different opinions on the variable's value
 at time~21 indicated by the vertical line in the lower diagram.
 Note also that all CPUs see sequences whose orderings are consistent with
 the directed graph shown in
 \cref{fig:memorder:Possible Global Orders With More Simultaneous Values}.
 Nevertheless, these figures underscore the importance of
 proper use of memory-ordering operations.

 \begin{figure}
 \centering
 \resizebox{2.0in}{!}{\includegraphics{memorder/store15tred}}
 \caption{Possible Global Orders With More Simultaneous Values}
 \label{fig:memorder:Possible Global Orders With More Simultaneous Values}
 \end{figure}

 How many values can a single variable take on at a single point in
 time?
 As many as one per store buffer in the system!
 We have therefore entered a regime where we must bid a fond farewell to
 comfortable intuitions about values of variables and the passage of time.
 This is the regime where memory-ordering operations are needed.

 But remember well the lessons from
 \cref{chp:Hardware and its Habits,%
 chp:Partitioning and Synchronization Design}.
 Having all CPUs store concurrently to the same variable
 is no way to design a parallel program, at least
 not if performance and scalability are at all important to you.

 Unfortunately, memory ordering has many other ways of insulting your
 intuition, and not all of these ways conflict with performance and
 scalability.
 The next section overviews reordering of unrelated memory reference.

 \subsection{Memory-Reference Reordering}
 \label{sec:memorder:Memory-Reference Reordering}

 \Cref{sec:memorder:Why Hardware Misordering?}
 showed that even relatively strongly ordered systems like x86
 can reorder prior stores with later loads, at least when the
 store and load are to different variables.
 This section builds on that result, looking at the other combinations of
 loads and stores.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-o@whole.fcv}
 \caption{Message-Passing Litmus Test (No Ordering)}
 \label{lst:memorder:Message-Passing Litmus Test (No Ordering)}
 \end{listing}

 \subsubsection{Load Followed By Load}
 \label{sec:memorder:Load Followed By Load}

 \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-o:whole]
 \Cref{lst:memorder:Message-Passing Litmus Test (No Ordering)}
 (\path{C-MP+o-wmb-o+o-o.litmus})
 shows the classic \emph{message-passing} litmus test, where \co{x0} is
 the message and \co{x1} is a flag indicating whether or not a message
 is available.
 In this test, the \co{smp_wmb()} forces \co{P0()} stores to be ordered,
 but no ordering is specified for the loads.
 Relatively strongly ordered architectures, such as x86, do enforce ordering.
 However, weakly ordered architectures often do
 not~\cite{JadeAlglave2011ppcmem}.
 Therefore, the \co{exists} clause on \clnref{exists} of the listing \emph{can}
 trigger.
 \end{fcvref}

 One rationale for reordering loads from different locations is that doing
 so allows execution to proceed when an earlier load misses the cache,
 but the values for later loads are already present.

 \QuickQuiz{
 	But why make load-load reordering visible to the user?
 	Why not just use speculative execution to allow execution to
 	proceed in the common case where there are no intervening
 	stores, in which case the reordering cannot be visible anyway?
 }\QuickQuizAnswer{
 	They can and many do, otherwise systems containing
 	strongly ordered CPUs would be slow indeed.
 	However, speculative execution does have its downsides, especially
 	if speculation must be rolled back frequently, particularly
 	on battery-powered systems.
 	Speculative execution can also introduce side channels, which
 	might in turn be exploited to exfiltrate information.
 	But perhaps future systems will be able to overcome these
 	disadvantages.
 	Until then, we can expect vendors to continue producing
 	weakly ordered CPUs.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-rmb-o@whole.fcv}
 \caption{Enforcing Order of Message-Passing Litmus Test}
 \label{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-rmb-o:whole]
 Thus, portable code relying on ordered loads must
 add explicit ordering, for example, the \co{smp_rmb()} shown on
 \clnref{rmb} of
 \cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test}
 (\path{C-MP+o-wmb-o+o-rmb-o.litmus}), which prevents
 the \co{exists} clause from triggering.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-o+o-o@whole.fcv}
 \caption{Load-Buffering Litmus Test (No Ordering)}
 \label{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
 \end{listing}

 \subsubsection{Load Followed By Store}
 \label{sec:memorder:Load Followed By Store}

 \begin{fcvref}[ln:formal:C-LB+o-o+o-o:whole]
 \Cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
 (\path{C-LB+o-o+o-o.litmus})
 shows the classic \emph{load-buffering} litmus test.
 Although relatively strongly ordered systems such as x86
 or the IBM Mainframe do not reorder prior loads with subsequent stores,
 many weakly ordered architectures really do allow such
 reordering~\cite{JadeAlglave2011ppcmem}.
 Therefore, the \co{exists} clause on \clnref{exists} really can trigger.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-r+a-o@whole.fcv}
 \caption{Enforcing Ordering of Load-Buffering Litmus Test}
 \label{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-LB+o-r+a-o:whole]
 Although it is rare for actual hardware to
 exhibit this reordering~\cite{LucMaranget2017aarch64},
 one situation where it might be desirable to do so is when a load
 misses the cache, the store buffer is nearly full, and the cacheline for
 a subsequent store is ready at hand.
 Therefore, portable code must enforce any required ordering, for example,
 as shown in
 \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
 (\path{C-LB+o-r+a-o.litmus}).
 The \co{smp_store_release()} and \co{smp_load_acquire()} guarantee that
 the \co{exists} clause on \clnref{exists} never triggers.
 \end{fcvref}

 \QuickQuiz{
 	Why should it be rare for systems to reorder prior loads with
 	later stores?
 }\QuickQuizAnswer{
 	Keep in mind that conventional computer systems are not permitted
 	to make a given CPU's speculative stores visible to other CPUs.

 	This constraint has consequences.

 	For example, suppose that address translation has not yet completed
 	for that prior load.
 	In that case, it is quite possible that this address translation
 	will result in a segmentation violation, which might in turn mean
 	that the later store would never execute.
 	Therefore, a given store cannot make its value visible to other
 	CPUs until address translation has completed for each and every
 	prior load.

 	Of course, memory latencies can be quite long compared to
 	address-translation delays, especially if all the information
 	required for a given address translation is already present
 	in the translation lookaside buffer (TLB)\@.
 	Although these long memory latencies should provide ample
 	opportunity for store-to-load reordering, the fact its that such
 	reordering is demonstrably rare, especially on the large systems
 	that would be expected to have the longest memory latencies.

 	And the reason for that is that large systems are most likely
 	to have parity or error-correcting code (ECC) checking on memory.

 	An uncorrectable parity or ECC error would cause a load to take a
 	fault, which would prevent subsequent stores from ever executing
 	just as surely as would a segmentation violation.
 	Therefore, on systems featuring parity or ECC, a subsequent store
 	cannot expose its value to other CPUs until after parity or
 	ECC checking has completed for all prior loads.
 	A given load's checking clearly cannot be carried out until that
 	load's value has been returned (check bits and all), which has
 	the side effect of prohibiting reordering of prior loads with
 	later stores.

 	So the only systems that can exhibit load-to-store reordering are
 	those lacking both parity and ECC, for example, small embedded
 	systems and GPGPUs.

 	Given that load-to-store reordering is not possible on common-case
 	computer systems, why do hardware memory models continue to
 	permit it?
 	Only CPU architects know for sure, but here are some possible
 	reasons:

 	\begin{enumerate}
 	\item	There might be increased demand for small systems that
 		do not require parity or ECC.
 	\item	Memory might become more reliable, so that larger systems
 		no longer require parity or ECC.
 	\item	Systems having multiple hardware threads per core might
 		permit cross-hardware-thread speculation, which would
 		permit speculative stores to make their values visible
 		to other threads in that core.
 	\item	Operating systems and hardware might work together to
 		terminate any processes affected by an uncorrectable
 		parity or ECC error.
 		This might allow speculative stores to make their
 		values visible to other CPUs, give or take things
 		like debuggers and assertions.
 	\end{enumerate}

 	In all of these cases, there might well be compelling performance
 	advantages to once again permitting load-to-store reordering.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-o+o-rmb-o@whole.fcv}
 \caption{Message-Passing Litmus Test, No Writer Ordering (No Ordering)}
 \label{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)}
 \end{listing}

 \subsubsection{Store Followed By Store}
 \label{sec:memorder:Store Followed By Store}

 \Cref{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)}
 (\path{C-MP+o-o+o-rmb-o.litmus})
 once again shows the classic message-passing litmus test, with the
 \co{smp_rmb()} providing ordering for \co{P1()}'s loads, but without
 any ordering for \co{P0()}'s stores.
 Again, the relatively strongly ordered architectures do enforce ordering,
 but weakly ordered architectures do not necessarily do
 so~\cite{JadeAlglave2011ppcmem}, which means that the
 \co{exists} clause can trigger.
 One situation in which such reordering could be beneficial is when
 the store buffer is full, another store is ready to execute, but the
 cacheline needed by the oldest store is not yet available.
 In this situation, allowing stores to complete out of order would
 allow execution to proceed.
 Therefore, portable code must explicitly order the stores, for
 example, as shown in
 \cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test},
 thus preventing the \co{exists} clause from triggering.

 \QuickQuiz{
 	Why should strongly ordered systems pay the performance price
 	of unnecessary \co{smp_rmb()} and \co{smp_wmb()} invocations?
 	Shouldn't weakly ordered systems shoulder the full cost of
 	their misordering choices???
 }\QuickQuizAnswer{
 	That is in fact exactly what happens.
 	On strongly ordered systems, \co{smp_rmb()} and \co{smp_wmb()}
 	emit no instructions, but instead just constrain the compiler.
 	Thus, in this case, weakly ordered systems do in fact shoulder
 	the full cost of their memory-ordering choices.
 }\QuickQuizEnd

 \subsection{Address Dependencies}
 \label{sec:memorder:Address Dependencies}

 An \emph{\IXBh{address}{dependency}} occurs when the value returned by a load
 instruction is used to compute the address used by a later memory-reference
 instruction.
 This means that the exact same sequence of instructions used to traverse
 a linked data structure in single-threaded code provides weak but extremely
 useful ordering in concurrent code.

 \QuickQuiz{
 	Must an address dependency begin with a load instruction?
 	Why not something like \co{xchg_relaxed()}, which also loads
 	a value from memory?
 }\QuickQuizAnswer{
 	Yes, pretty much any instruction that loads a value from memory
 	can start an address dependency, including certain atomic operations
 	such as \co{xchg_relaxed()}.
 	The same is true of the control and data dependencies discussed in
 	\cref{sec:memorder:Control Dependencies,sec:memorder:Data Dependencies}.

 	However, in all three cases, it is often the case that if you
 	are going to start your dependency with an atomic operation,
 	it will be more convenient and maintainable to instead use
 	an atomic operation with acquire semantics, in this example,
 	\co{xchg_acquire()} or maybe even just the fully ordered
 	\co{xchg()}.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-addr-o@whole.fcv}
 \caption{Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
 \label{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
 \end{listing}

 \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
 \Cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
 (\path{C-MP+o-wmb-o+o-addr-o.litmus})
 shows a linked variant of the message-passing pattern.
 The head pointer is \co{x1}, which initially
 references the \co{int} variable \co{y} (\clnref{init:x1}), which is in turn
 initialized to the value $1$ (\clnref{init:y}).
 \co{P0()} updates head pointer \co{x1} to reference \co{x0} (\clnref{P0:x1}),
 but only after initializing it to $2$ (\clnref{P0:x0}) and forcing ordering
 (\clnref{P0:wmb}).
 \co{P1()} picks up the head pointer \co{x1} (\clnref{P1:x1}), and then loads
 the referenced value (\clnref{P1:ref}).
 There is thus an address dependency from the load on \clnref{P1:x1} to the
 load on \clnref{P1:ref}.
 In this case, the value returned by \clnref{P1:x1} is exactly the address
 used by \clnref{P1:ref}, but many variations are possible,
 \end{fcvref}
 including field access using the C-language \co{->} operator,
 addition, subtraction, and array indexing.\footnote{
 	But note that in the Linux kernel, the address dependency must
 	be carried through the pointer to the array, not through the
 	array index.}

 \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
 One might hope that \clnref{P1:x1}'s load from the head pointer would be ordered
 before \clnref{P1:ref}'s dereference, which is in fact the case on Linux v4.15
 and later.
 However, prior to v4.15, this was not the case on DEC Alpha, which could
 in effect use a speculated value for the dependent load, as described
 in more detail in \cref{sec:memorder:Alpha}.
 Therefore, on older versions of Linux,
 \cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}'s
 \co{exists} clause \emph{can} trigger.
 \end{fcvref}

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
 \begin{VerbatimL}[commandchars=\@\[\]]
 C C-MP+o-wmb-o+ld-addr-o

 {
 y=1;
 x1=y;
 }

 P0(int* x0, int** x1) {
 	WRITE_ONCE(*x0, 2);
 	smp_wmb();
 	WRITE_ONCE(*x1, x0);
 }

 P1(int** x1) {
 	int *r2;
 	int r3;

 	r2 = lockless_dereference(*x1); // Obsolete @lnlbl[deref]
 	r3 = READ_ONCE(*r2);			    @lnlbl[read]
 }

 exists (1:r2=x0 /\ 1:r3=1)
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
 \label{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
 \end{listing}

 \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole]
 \Cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
 % \path{C-MP+o-wmb-o+ld-addr-o.litmus} available at commit bc4b1c3f3b35
 % ("styleguide: Loosen restriction on comment in litmus test")
 shows how to make this work reliably on pre-v4.15 Linux kernels running on
 DEC Alpha, by replacing \co{READ_ONCE()} on \clnref{P1:x1} of
 \cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}
 with \apikh{lockless_dereference()},\footnote{
 	Note that \co{lockless_dereference()} is not needed on v4.15 and
 	later, and therefore is not available in these later Linux kernels.
 	Nor is it needed in versions of this book containing this footnote.}
 which acts like \co{READ_ONCE()} on all platforms other than DEC Alpha,
 where it acts like a \co{READ_ONCE()} followed by an \co{smp_mb()},
 thereby forcing the required ordering on all platforms, in turn
 preventing the \co{exists} clause from triggering.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-S+o-wmb-o+o-addr-o@whole.fcv}
 \caption{S Address-Dependency Litmus Test}
 \label{lst:memorder:S Address-Dependency Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
 But what happens if the dependent operation is a store rather than
 a load, for example, in the \emph{S}
 litmus test~\cite{JadeAlglave2011ppcmem} shown in
 \cref{lst:memorder:S Address-Dependency Litmus Test}
 (\path{C-S+o-wmb-o+o-addr-o.litmus})?
 Because no production-quality platform speculates stores,
 it is not possible for the \co{WRITE_ONCE()} on \clnref{P0:x0} to overwrite
 the \co{WRITE_ONCE()} on \clnref{P1:r2}, meaning that the \co{exists}
 clause on \clnref{exists} cannot trigger, even on DEC Alpha, even
 in pre-v4.15 Linux kernels.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	But how do we know that \emph{all} platforms really avoid
 	triggering the \co{exists} clauses in
 	\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
 	lst:memorder:S Address-Dependency Litmus Test}?
 }\QuickQuizAnswerB{
 	Answering this requires identifying three major groups of platforms:
 	\begin{enumerate*}[(1)]
 	\item Total-store-order (TSO) platforms,
 	\item Weakly ordered platforms, and
 	\item DEC Alpha.
 	\end{enumerate*}

 	\begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
 	The TSO platforms order all pairs of memory references except for
 	prior stores against later loads.
 	Because the address dependency on \clnref{deref,read} of
 	\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}
 	is instead a load followed by another load, TSO platforms preserve
 	this address dependency.
 	\end{fcvref}
 	\begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
 	They also preserve the address dependency on \clnref{P1:x1,P1:r2} of
 	\cref{lst:memorder:S Address-Dependency Litmus Test}
 	because this is a load followed by a store.
 	Because address dependencies must start with a load, TSO platforms
 	implicitly but completely respect them, give or take compiler
 	optimizations, hence the need for \co{READ_ONCE()}.
 	\end{fcvref}

 	Weakly ordered platforms don't necessarily maintain ordering of
 	unrelated accesses.
 	However, the address dependencies in
 	\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
 	lst:memorder:S Address-Dependency Litmus Test} are not unrelated:
 	There is an address dependency.
 	The hardware tracks dependencies and maintains the needed
 	ordering.

 	\begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)]
 	There is one (famous) exception to this rule for weakly ordered
 	platforms, and that exception is DEC Alpha for load-to-load
 	address dependencies.
 	And this is why, in Linux kernels predating v4.15, DEC Alpha
 	requires the explicit memory barrier supplied for it by the
 	now-obsolete \apikh{lockless_dereference()} on \clnref{deref} of
 	\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}.
 	\end{fcvref}
 	\begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole]
 	However, DEC Alpha does track load-to-store address dependencies,
 	which is why \clnref{P1:x1} of
 	\cref{lst:memorder:S Address-Dependency Litmus Test}
 	does not need a \co{lockless_dereference()}, even in Linux
 	kernels predating v4.15.
 	\end{fcvref}

 	To sum up, current platforms either respect address dependencies
 	implicitly, as is the case for TSO platforms (x86, mainframe,
 	SPARC,~\dots), have hardware tracking for address dependencies
 	(\ARM, PowerPC, MIPS,~\dots), have the required memory barriers
 	supplied by \co{READ_ONCE()} (DEC Alpha in Linux kernel v4.15 and
 	later), or supplied by
 	\co{rcu_dereference()} (DEC Alpha in Linux kernel v4.14 and earlier).
 }\QuickQuizEndB
 %
 \QuickQuizM{
 	Why the use of \co{smp_wmb()} in
 	\cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),%
 	lst:memorder:S Address-Dependency Litmus Test}?
 	Wouldn't \co{smp_store_release()} be a better choice?
 }\QuickQuizAnswerM{
 	In most cases, \co{smp_store_release()} is indeed a better choice.
 	However, \co{smp_wmb()} was there first in the Linux kernel,
 	so it is still good to understand how to use it.
 }\QuickQuizEndM
 %
 \QuickQuizE{
 	SP, MP, LB, and now~S\@.
 	Where do all these litmus-test abbreviations come from and
 	how can anyone keep track of them?
 }\QuickQuizAnswerE{
 	The best scorecard is the infamous
 	\co{test6.pdf}~\cite{test6-pdf}.
 	Unfortunately, not all of the abbreviations have catchy
 	expansions like SB (store buffering), MP (message passing),
 	and LB (load buffering), but at least the list of abbreviations
 	is readily available.
 }\QuickQuizEndE
 }

 However, it is important to note that address dependencies can
 be fragile and easily broken by compiler optimizations, as discussed in
 \cref{sec:memorder:Address- and Data-Dependency Difficulties}.

 \subsection{Data Dependencies}
 \label{sec:memorder:Data Dependencies}

 A \emph{\IXBh{data}{dependency}} occurs when the value returned by a load
 instruction is used to compute the data stored by a later store
 instruction.
 Note well the ``data'' above:
 If the value returned by a load was instead used to compute the address
 used by a later store instruction, that would instead be an address
 dependency, which was covered in
 \cref{sec:memorder:Address Dependencies}.
 However, the existence of data dependencies means that the exact same
 sequence of instructions used to update a linked data structure in
 single-threaded code provides weak but extremely useful ordering in
 concurrent code.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-r+o-data-o@whole.fcv}
 \caption{Load-Buffering Data-Dependency Litmus Test}
 \label{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole]
 \Cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
 (\path{C-LB+o-r+o-data-o.litmus})
 is similar to
 \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test},
 except that \co{P1()}'s ordering between \clnref{ld,st} is
 enforced not by an \IX{acquire load}, but instead by a data dependency:
 The value loaded by \clnref{ld} is what \clnref{st} stores.
 The ordering provided by this data dependency is sufficient to prevent
 the \co{exists} clause from triggering.
 \end{fcvref}

 Just as with address dependencies, data dependencies are
 fragile and can be easily broken by compiler optimizations, as discussed in
 \cref{sec:memorder:Address- and Data-Dependency Difficulties}.
 In fact, data dependencies can be even more fragile than are address
 dependencies.
 The reason for this is that address dependencies normally involve
 pointer values.
 In contrast, as shown in
 \cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test},
 it is tempting to carry data dependencies through integral values,
 which the compiler has much more freedom to optimize into nonexistence.
 For but one example, if the integer loaded was multiplied by the constant
 zero, the compiler would know that the result was zero, and could therefore
 substitute the constant zero for the value loaded, thus breaking
 the dependency.

 \QuickQuiz{
 	\begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole]
 	But wait!!!
 	\Clnref{ld} of
 	\cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}
 	uses \co{READ_ONCE()}, which marks the load as volatile,
 	which means that the compiler absolutely must emit the load
 	instruction even if the value is later multiplied by zero.
 	So how can the compiler possibly break this data dependency?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	Yes, the compiler absolutely must emit a load instruction for
 	a volatile load.
 	But if you multiply the value loaded by zero, the compiler is
 	well within its rights to substitute a constant zero for the
 	result of that multiplication, which will break the data
 	dependency on many platforms.

 	Worse yet, if the dependent store does not use \co{WRITE_ONCE()},
 	the compiler could hoist it above the load, which would cause
 	even TSO platforms to fail to provide ordering.
 }\QuickQuizEnd

 In short, you can rely on data dependencies only if you prevent the
 compiler from breaking them.

 \subsection{Control Dependencies}
 \label{sec:memorder:Control Dependencies}

 A \emph{\IXBh{control}{dependency}} occurs when the value returned by a load
 instruction is tested to determine whether or not a later store instruction
 is executed.
 In other words, a simple conditional branch or conditional-move
 instruction can act as a weak but low-overhead memory-barrier instruction.
 However, note well the ``later store instruction'':
 Although all platforms respect load-to-store dependencies, many platforms
 do \emph{not} respect load-to-load control dependencies.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-r+o-ctrl-o@whole.fcv}
 \caption{Load-Buffering Control-Dependency Litmus Test}
 \label{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-LB+o-r+o-ctrl-o:whole]
 \Cref{lst:memorder:Load-Buffering Control-Dependency Litmus Test}
 (\path{C-LB+o-r+o-ctrl-o.litmus})
 shows another load-buffering example, this time using a control
 dependency (\clnref{if}) to order the load on \clnref{ld} and the store on
 \clnref{st}.
 The ordering is sufficient to prevent the \co{exists} from triggering.
 \end{fcvref}

 However, control dependencies are even more susceptible to being optimized
 out of existence than are data dependencies, and
 \cref{sec:memorder:Control-Dependency Calamities}
 describes some of the rules that must be followed in order to prevent
 your compiler from breaking your control dependencies.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-r+o-ctrl-o@whole.fcv}
 \caption{Message-Passing Control-Dependency Litmus Test (No Ordering)}
 \label{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
 \end{listing}

 \begin{fcvref}[ln:formal:C-MP+o-r+o-ctrl-o:whole]
 It is worth reiterating that control dependencies provide ordering only
 from loads to stores.
 Therefore, the load-to-load control dependency shown on \clnrefrange{ld1}{ld2} of
 \cref{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)}
 (\path{C-MP+o-r+o-ctrl-o.litmus})
 does \emph{not} provide ordering, and therefore does \emph{not}
 prevent the \co{exists} clause from triggering.
 \end{fcvref}

 In summary, control dependencies can be useful, but they are
 high-maintenance items.
 You should therefore use them only when performance considerations
 permit no other solution.

 \QuickQuiz{
 	Wouldn't control dependencies be more robust if they were
 	mandated by language standards???
 }\QuickQuizAnswer{
 	But of course!
 	And perhaps in the fullness of time they will be so mandated.
 }\QuickQuizEnd

 \subsection{Cache Coherence}
 \label{sec:memorder:Cache Coherence}

 On cache-coherent platforms, all CPUs agree on the order of loads and
 stores to a given variable.
 Fortunately, when \co{READ_ONCE()} and \co{WRITE_ONCE()} are used,
 almost all platforms are cache-coherent, as indicated by the ``SV''
 column of the cheat sheet shown in
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
 Unfortunately, this property is so popular that it has been named
 multiple times, with ``single-variable SC'',\footnote{
 	Recall that SC stands for sequentially consistent.}
 ``single-copy atomic''~\cite{Stone:1995:SP:623262.623912},
 and just plain ``coherence''~\cite{JadeAlglave2011ppcmem}
 having seen use.
 Rather than further compound the confusion by inventing yet another term
 for this concept, this book uses ``\IX{cache coherence}'' and ``coherence''
 interchangeably.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-CCIRIW+o+o+o-o+o-o@whole.fcv}
 \caption{Cache-Coherent IRIW Litmus Test}
 \label{lst:memorder:Cache-Coherent IRIW Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-CCIRIW+o+o+o-o+o-o:whole]
 \Cref{lst:memorder:Cache-Coherent IRIW Litmus Test}
 (\path{C-CCIRIW+o+o+o-o+o-o.litmus})
 shows a litmus test that tests for cache coherence,
 where ``IRIW'' stands
 for ``independent reads of independent writes''.
 Because this litmus test uses only one variable,
 \co{P2()} and \co{P3()} must agree
 on the order of \co{P0()}'s and \co{P1()}'s stores.
 In other words, if \co{P2()} believes that \co{P0()}'s store
 came first, then \co{P3()} had better not believe that
 \co{P1()}'s store came first.
 And in fact the \co{exists} clause on \clnref{exists} will trigger if this
 situation arises.
 \end{fcvref}

 \QuickQuiz{
 	But in
 	\cref{lst:memorder:Cache-Coherent IRIW Litmus Test},
 	wouldn't be just as bad if \co{P2()}'s \co{r1} and \co{r2}
 	obtained the values~2 and~1, respectively, while \co{P3()}'s
 	\co{r3} and \co{r4} obtained the values~1 and~2, respectively?
 }\QuickQuizAnswer{
 	Yes, it would.
 	Feel free to modify the \co{exists} clause to
 	check for that outcome and see what happens.
 }\QuickQuizEnd

 It is tempting to speculate that different-sized overlapping loads
 and stores to a single region of memory (as might be set up using
 the C-language \co{union} keyword) would provide similar ordering
 guarantees.
 However, Flur et al.~\cite{Flur:2017:MCA:3093333.3009839} discovered some
 surprisingly simple litmus tests that demonstrate that such guarantees
 can be violated on real hardware.
 It is therefore necessary to restrict code to non-overlapping
 same-sized aligned accesses to a given variable, at least if portability
 is a consideration.\footnote{
 	There is reason to believe that using atomic RMW operations
 	(for example, \co{xchg()}) for all the stores will
 	provide sequentially consistent ordering, but this has not
 	yet been proven either way.}

 Adding more variables and threads increases the scope for reordering
 and other counter-intuitive behavior, as discussed in the next section.

 \subsection{Multicopy Atomicity}
 \label{sec:memorder:Multicopy Atomicity}

 \begin{figure}
 \centering
 \resizebox{3.0in}{!}{\includegraphics{memorder/SystemArchBus}}
 \caption{Global System Bus And Multi-Copy Atomicity}
 \label{fig:memorder:Global System Bus And Multi-Copy Atomicity}
 \end{figure}

 Threads running on a fully
 \emph{multicopy atomic}~\cite{Stone:1995:SP:623262.623912}
 platform are guaranteed
 to agree on the order of stores, even to different variables.
 A useful mental model of such a system is the single-bus architecture
 shown in
 \cref{fig:memorder:Global System Bus And Multi-Copy Atomicity}.
 If each store resulted in a message on the bus, and if the bus could
 accommodate only one store at a time, then any pair of CPUs would
 agree on the order of all stores that they observed.
 Unfortunately, building a computer system as shown in the figure, without
 store buffers or even caches, would result in glacially slow computation.
 Most CPU vendors interested in providing
 \IXBalt{multicopy atomicity}{multi-copy atomic} therefore
 instead provide the slightly weaker
 \emph{other-multicopy atomicity}~\cite[Section B2.3]{ARMv8A:2017},
 which excludes the CPU doing a given store from the requirement that all
 CPUs agree on the order of all stores.\footnote{
 	As of early 2021, \ARMv8 and x86 provide other-multicopy atomicity,
 	IBM mainframe provides full multicopy atomicity, and PPC
 	provides no multicopy atomicity at all.
 	More detail is shown in
 	\cref{tab:memorder:Summary of Memory Ordering}
 	on
 	\cpageref{tab:memorder:Summary of Memory Ordering}.}
 This means that if only a subset of CPUs are doing stores, the
 other CPUs will agree on the order of stores, hence the ``other''
 in ``\IXBalthy{other-multicopy atomicity}{other-}{multi-copy atomic}''.
 Unlike multicopy-atomic platforms, within other-multicopy-atomic platforms,
 the CPU doing the store is permitted to observe its
 store early, which allows its later loads to obtain the newly stored
 value directly from the store buffer, which improves performance.

 \QuickQuiz{
 	Can you give a specific example showing different behavior for
 	multicopy atomic on the one hand and other-multicopy atomic
 	on the other?
 }\QuickQuizAnswer{
 	\Cref{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
 	(\path{C-MP-OMCA+o-o-o+o-rmb-o.litmus})
 	shows such a test.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP-OMCA+o-o-o+o-rmb-o@whole.fcv}
 \caption{Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
 \label{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic}
 \end{listing}

 	\begin{fcvref}[ln:formal:C-MP-OMCA+o-o-o+o-rmb-o:whole]
 	On a multicopy-atomic platform, \co{P0()}'s store to \co{x} on
 	\clnref{P0:st} must become visible to both \co{P0()} and \co{P1()}
 	simultaneously.
 	Because this store becomes visible to \co{P0()} on \clnref{P0:ld}, before
 	\co{P0()}'s store to \co{y} on \clnref{P0:y}, \co{P0()}'s store to
 	\co{x} must become visible before its store to \co{y} everywhere,
 	including \co{P1()}.
 	Therefore, if \co{P1()}'s load from \co{y} on \clnref{P1:y} returns the
 	value 1, so must its load from \co{x} on \clnref{P1:x}, given that
 	the \co{smp_rmb()} on \clnref{P1:rmb} forces these two loads to execute
 	in order.
 	Therefore, the \co{exists} clause on \clnref{exists} cannot trigger on a
 	multicopy-atomic platform.
 	\end{fcvref}

 	In contrast, on an other-multicopy-atomic platform, \co{P0()}
 	could see its own store early, so that there would be no constraint
 	on the order of visibility of the two stores from \co{P1()},
 	which in turn allows the \co{exists} clause to trigger.
 }\QuickQuizEnd

 Perhaps there will come a day when all platforms provide some flavor
 of multi-copy atomicity, but
 in the meantime, \IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic}
 platforms do exist, and so software must deal with them.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-WRC+o+o-data-o+o-rmb-o@whole.fcv}
 \caption{WRC Litmus Test With Dependencies (No Ordering)}
 \label{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
 \end{listing}

 \begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
 \Cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
 (\path{C-WRC+o+o-data-o+o-rmb-o.litmus})
 demonstrates multicopy atomicity, that is, on a multicopy-atomic platform,
 the \co{exists} clause on \clnref{exists} cannot trigger.
 In contrast, on a non-multicopy-atomic
 platform this \co{exists} clause can trigger, despite
 \co{P1()}'s accesses being ordered by a data dependency and \co{P2()}'s
 accesses being ordered by an \co{smp_rmb()}.
 Recall that the definition of multicopy atomicity requires that all
 threads agree on the order of stores, which can be thought of as
 all stores reaching all threads at the same time.
 Therefore, a non-multicopy-atomic platform can have a store reach
 different threads at different times.
 In particular, \co{P0()}'s store might reach \co{P1()} long before it
 reaches \co{P2()}, which raises the possibility that \co{P1()}'s store
 might reach \co{P2()} before \co{P0()}'s store does.
 \end{fcvref}

 \begin{figure}
 \centering
 \resizebox{3.0in}{!}{\includegraphics{memorder/NonMCAplatform}}
 \caption{Shared Store Buffers And Multi-Copy Atomicity}
 \label{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}
 \end{figure}

 This leads to the question of why a real system constrained by the
 usual laws of physics would ever trigger the \co{exists} clause of
 \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}.
 The cartoonish diagram of a such a real system is shown in
 \cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.
 CPU~0 and CPU~1 share a store buffer, as do CPUs~2 and~3.
 This means that CPU~1 can load a value out of the store buffer, thus
 potentially immediately seeing a value stored by CPU~0.
 In contrast, CPUs~2 and~3 will have to wait for the corresponding \IX{cache
 line} to carry this new value to them.

 \QuickQuiz{
 	Then who would even \emph{think} of designing a system with shared
 	store buffers???
 }\QuickQuizAnswer{
 	This is in fact a very natural design for any system having
 	multiple hardware threads per core.
 	Natural from a hardware point of view, that is!
 }\QuickQuizEnd

 \begin{table*}
 \small
 \centering\OneColumnHSpace{-0.8in}
 \renewcommand*{\arraystretch}{1.1}
 \rowcolors{10}{}{lightgray}
 \ebresizewidth{
 \begin{tabular}{rlllllll}\toprule
 	& \multicolumn{1}{c}{\tco{P0()}} & \multicolumn{2}{c}{\tco{P0()} \& \tco{P1()}} &
 		\multicolumn{1}{c}{\tco{P1()}} & \multicolumn{3}{c}{\tco{P2()}} \\
 	\cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(lr){5-5} \cmidrule(l){6-8}
 	& Instruction & Store Buffer & Cache & Instruction &
 			Instruction & Store Buffer & Cache \\
 	\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4}
 		\cmidrule(lr){5-5} \cmidrule(l){6-8}
 	1 & (Initial state) & & \tco{y==0} &
 		(Initial state) &
 			(Initial state) & & \tco{x==0} \\
 	2 & \tco{x = 1;} & \tco{x==1} & \tco{y==0} &
 		 & & & \tco{x==0} \\
 	3 & (Read-Invalidate \tco{x}) & \tco{x==1} & \tco{y==0} & \tco{r1 = x} (1)
 		 & & & \tco{x==0} \\
 	4 &  & \tco{x==1} \tco{y==1} & \tco{y==0} & \tco{y = r1}
 		 & \tco{r2 = y} & & \tco{x==0} \\
 	5 &  & \tco{x==1} & \tco{y==1} & (Finish store)
 		 & (Read \tco{y}) & & \tco{x==0} \\
 	6 & (Respond \tco{y}) & \tco{x==1} & \tco{y==1} &
 		 & (\tco{r2==1}) & & \tco{x==0} \tco{y==1} \\
 	7 & & \tco{x==1} & \tco{y==1} &
 		 & \tco{smp_rmb()} & & \tco{x==0} \tco{y==1} \\
 	8 & & \tco{x==1} & \tco{y==1} &
 		 & \tco{r3 = x (0)} & & \tco{x==0} \tco{y==1} \\
 	9 & & \tco{x==1} & \tco{x==0} \tco{y==1} &
 		 & (Respond \tco{x}) & & \tco{y==1} \\
 	10 & (Finish store) & & \tco{x==1} \tco{y==1} &
 		 &  & & \tco{y==1} \\
 	\bottomrule
 \end{tabular}
 }
 \caption{Memory Ordering:
 			  WRC Sequence of Events}
 \label{tab:memorder:Memory Ordering: WRC Sequence of Events}
 \end{table*}

 \Cref{tab:memorder:Memory Ordering: WRC Sequence of Events}
 shows one sequence of events that can result in the \co{exists} clause in
 \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
 triggering.
 This sequence of events will depend critically on \co{P0()} and
 \co{P1()} sharing both cache and a store buffer in the manner shown in
 \cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}.

 \QuickQuiz{
 	But just how is it fair that \co{P0()} and \co{P1()} must share a store
 	buffer and a cache, but \co{P2()} gets one each of its very own???
 }\QuickQuizAnswer{
 	Presumably there is a \co{P3()}, as is in fact shown in
 	\cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity},
 	that shares \co{P2()}'s store buffer and cache.
 	But not necessarily.
 	Some platforms allow different cores to disable different numbers
 	of threads, allowing the hardware to adjust to the needs of the
 	workload at hand.
 	For example, a single-threaded critical-path portion of the workload
 	might be assigned to a core with only one thread enabled, thus
 	allowing the single thread running that portion of the workload
 	to use the entire capabilities of that core.
 	Other more highly parallel but cache-miss-prone portions of the
 	workload might be assigned to cores with all hardware threads
 	enabled to provide improved throughput.
 	This improved throughput could be due to the fact that while one
 	hardware thread is stalled on a cache miss, the other hardware
 	threads can make forward progress.

 	In such cases, performance requirements override quaint human
 	notions of fairness.
 }\QuickQuizEnd

 \begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
 Row~1 shows the initial state, with the initial value of \co{y} in
 \co{P0()}'s and \co{P1()}'s shared cache, and the initial value of \co{x} in
 \co{P2()}'s cache.

 Row~2 shows the immediate effect of \co{P0()} executing its store on \clnref{P0:x}.
 Because the cacheline containing \co{x} is not in \co{P0()}'s and \co{P1()}'s
 shared cache, the new value (\co{1}) is stored in the shared store buffer.

 Row~3 shows two transitions.
 First, \co{P0()} issues a read-invalidate operation to fetch the cacheline
 containing \co{x} so that it can flush the new value for \co{x} out of
 the shared store buffer.
 Second, \co{P1()} loads from \co{x} (\clnref{P1:x}), an operation that completes
 immediately because the new value of \co{x} is immediately available
 from the shared store buffer.

 Row~4 also shows two transitions.
 First, it shows the immediate effect of \co{P1()} executing its store to
 \co{y} (\clnref{P1:y}), placing the new value into the shared store buffer.
 Second, it shows the start of \co{P2()}'s load from \co{y} (\clnref{P2:y}).

 Row~5 continues the tradition of showing two transitions.
 First, it shows \co{P1()} complete its store to \co{y}, flushing
 from the shared store buffer to the cache.
 Second, it shows \co{P2()} request the cacheline containing \co{y}.

 Row~6 shows \co{P2()} receive the cacheline containing \co{y}, allowing
 it to finish its load into \co{r2}, which takes on the value \co{1}.

 Row~7 shows \co{P2()} execute its \co{smp_rmb()} (\clnref{P2:rmb}), thus keeping
 its two loads ordered.

 Row~8 shows \co{P2()} execute its load from \co{x}, which immediately
 returns with the value zero from \co{P2()}'s cache.

 Row~9 shows \co{P2()} \emph{finally} responding to \co{P0()}'s request for
 the cacheline containing \co{x}, which was made way back up on row~3.

 Finally, row~10 shows \co{P0()} finish its store, flushing its value of
 \co{x} from the shared store buffer to the shared cache.

 Note well that the \co{exists} clause on \clnref{exists} has triggered.
 The values of \co{r1} and \co{r2} are both the value one, and
 the final value of \co{r3} the value zero.
 This strange result occurred because \co{P0()}'s new value of \co{x} was
 communicated to \co{P1()} long before it was communicated to \co{P2()}.
 \end{fcvref}

 \QuickQuiz{
 	\begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole]
 	Referring to
 	\cref{tab:memorder:Memory Ordering: WRC Sequence of Events},
 	why on earth would \co{P0()}'s store take so long to complete when
 	\co{P1()}'s store complete so quickly?
 	In other words, does the \co{exists} clause on \clnref{exists} of
 	\cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
 	really trigger on real systems?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	You need to face the fact that it really can trigger.
 	Akira Yokosawa used the \co{litmus7} tool to run this litmus test
 	on a \Power{8} system.
 	Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause.
 	Thus, triggering the \co{exists} clause is not merely a one-in-a-million
 	occurrence, but rather a one-in-a-hundred-million occurrence.
 	But it nevertheless really does trigger on real systems.
 }\QuickQuizEnd

 This counter-intuitive result happens because although dependencies
 do provide ordering, they provide it only within the confines of their
 own thread.
 This three-thread example requires stronger ordering, which
 is the subject of
 \crefthro{sec:memorder:Cumulativity}
 {sec:memorder:Release-Acquire Chains}.

 \subsubsection{Cumulativity}
 \label{sec:memorder:Cumulativity}

 The three-thread example shown in
 \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}
 requires \emph{cumulative} ordering, or \emph{cumulativity}.
 A cumulative memory-ordering operation orders not just any given
 access preceding it, but also earlier accesses by any thread to that
 same variable.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-WRC+o+o-r+a-o@whole.fcv}
 \caption{WRC Litmus Test With Release}
 \label{lst:memorder:WRC Litmus Test With Release}
 \end{listing}

 Dependencies do not provide cumulativity,
 which is why the ``C'' column is blank for the \co{READ_ONCE()} row
 of \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 on
 \cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
 However, as indicated by the ``C'' in their ``C'' column,
 release operations do provide cumulativity.
 Therefore,
 \cref{lst:memorder:WRC Litmus Test With Release}
 (\path{C-WRC+o+o-r+a-o.litmus})
 substitutes a release operation for
 \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}'s
 data dependency.
 \begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole]
 Because the release operation is cumulative, its ordering applies not only to
 \cref{lst:memorder:WRC Litmus Test With Release}'s
 load from \co{x} by \co{P1()} on \clnref{P1:x}, but also to the store to \co{x}
 by \co{P0()} on \clnref{P0:x}---but only if that load returns the value stored,
 which matches the \co{1:r1=1} in the \co{exists} clause on \clnref{exists}.
 This means that \co{P2()}'s load-acquire suffices to force the
 load from \co{x} on \clnref{P2:x} to happen after the store on \clnref{P0:x}, so
 the value returned is one, which does not match \co{2:r3=0}, which
 in turn prevents the \co{exists} clause from triggering.
 \end{fcvref}

 \begin{figure*}
 \centering
 \includegraphics{memorder/memorybarriercum}
 \caption{Cumulativity}
 \label{fig:memorder:Cumulativity}
 \end{figure*}

 \begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole]
 These ordering constraints are depicted graphically in
 \cref{fig:memorder:Cumulativity}.
 Note also that cumulativity is not limited to a single step back in time.
 If there was another load from \co{x} or store to \co{x} from any thread
 that came before the store on \clnref{P0:x}, that prior load or store would also
 be ordered before the load on \clnref{P2:x}, though only if both \co{r1} and
 \co{r2} both end up containing the value \co{1}.
 \end{fcvref}

 In short, use of cumulative ordering operations can suppress
 non-multicopy-atomic behaviors in some situations.
 Cumulativity nevertheless has limits, which are examined in the next section.

 \subsubsection{Propagation}
 \label{sec:memorder:Propagation}

 \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
 \Cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 (\path{C-W+RWC+o-r+a-o+o-mb-o.litmus})
 shows the limitations of cumulativity and store-release,
 even with a full memory barrier.
 The problem is that although the \co{smp_store_release()} on
 \clnref{P0:sr} has cumulativity, and although that cumulativity does
 order \co{P2()}'s load on \clnref{P2:ld}, the \co{smp_store_release()}'s
 ordering cannot propagate through the combination of \co{P1()}'s
 load (\clnref{P1:ld}) and \co{P2()}'s store (\clnref{P2:st}).
 This means that the \co{exists} clause on \clnref{exists} really can trigger.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-W+RWC+o-r+a-o+o-mb-o@whole.fcv}
 \caption{W+RWC Litmus Test With Release (No Ordering)}
 \label{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 \end{listing}

 \QuickQuiz{
 	But it is not necessary to worry about propagation unless
 	there are at least three threads in the litmus test, right?
 }\QuickQuizAnswer{
 	Wrong.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-R+o-wmb-o+o-mb-o@whole.fcv}
 \caption{R Litmus Test With Write Memory Barrier (No Ordering)}
 \label{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
 \end{listing}

 	\begin{fcvref}[ln:formal:C-R+o-wmb-o+o-mb-o:whole]
 	\Cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
 	(\path{C-R+o-wmb-o+o-mb-o.litmus})
 	shows a two-thread litmus test that requires propagation due to
 	the fact that it only has store-to-store and load-to-store
 	links between its pair of threads.
 	Even though \co{P0()} is fully ordered by the \co{smp_wmb()} and
 	\co{P1()} is fully ordered by the \co{smp_mb()}, the
 	counter-temporal nature of the links means that
 	the \co{exists} clause on \clnref{exists} really can trigger.
 	To prevent this triggering, the \co{smp_wmb()} on \clnref{wmb}
 	must become an \co{smp_mb()}, bringing propagation into play
 	twice, once for each non-temporal link.
 	\end{fcvref}
 }\QuickQuizEnd

 \QuickQuizLabel{\MemorderQQLitmusTestR}

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/fr}}
 \caption{Load-to-Store is Counter-Temporal}
 \label{fig:memorder:Load-to-Store is Counter-Temporal}
 \end{figure}

 This situation might seem completely counter-intuitive, but keep
 in mind that the speed of light is finite and computers are of
 non-zero size.
 It therefore takes time for the effect of the \co{P2()}'s store to
 \co{x} to propagate to \co{P1()}, which in turn means that it is possible
 that \co{P1()}'s read from \co{x} happens much later in time, but
 nevertheless still sees the old value of zero.
 This situation is depicted in
 \cref{fig:memorder:Load-to-Store is Counter-Temporal}:
 Just because a load sees the old value does \emph{not} mean that
 this load executed at an earlier time than did the store of the
 new value.

 Note that
 \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 also shows the limitations of memory-barrier pairing, given that
 there are not two but three processes.
 These more complex litmus tests can instead be said to have \emph{cycles},
 where memory-barrier pairing is the special case of a two-thread cycle.
 \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
 The cycle in
 \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 goes through \co{P0()} (\clnref{P0:st,P0:sr}), \co{P1()} (\clnref{P1:la,P1:ld}),
 \co{P2()} (\clnref{P2:st,P2:mb,P2:ld}), and back to \co{P0()} (\clnref{P0:st}).
 The \co{exists} clause delineates this cycle:
 The \co{1:r1=1} indicates that the \co{smp_load_acquire()} on \clnref{P1:la}
 returned the value stored by the \co{smp_store_release()} on \clnref{P0:sr},
 the \co{1:r2=0} indicates that the \co{WRITE_ONCE()} on \clnref{P2:st} came
 too late to affect the value returned by the \co{READ_ONCE()} on \clnref{P1:ld},
 and finally the \co{2:r3=0} indicates that the
 \co{WRITE_ONCE()} on \clnref{P0:st} came too late to affect the value returned
 by the \co{READ_ONCE()} on \clnref{P2:ld}.
 In this case, the fact that the \co{exists} clause can trigger means that
 the cycle is said to be \emph{allowed}.
 In contrast, in cases where the \co{exists} clause cannot trigger,
 the cycle is said to be \emph{prohibited}.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-W+RWC+o-mb-o+a-o+o-mb-o@whole.fcv}
 \caption{W+WRC Litmus Test With More Barriers}
 \label{lst:memorder:W+WRC Litmus Test With More Barriers}
 \end{listing}

 \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
 But what if we need to prohibit the cycle corresponding to the \co{exists}
 clause on \clnref{exists} of
 \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}?
 One solution is to replace \co{P0()}'s \co{smp_store_release()}
 with an \co{smp_mb()}, which
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 shows to have not only cumulativity, but also propagation.
 \end{fcvref}
 The result is shown in
 \cref{lst:memorder:W+WRC Litmus Test With More Barriers}
 (\path{C-W+RWC+o-mb-o+a-o+o-mb-o.litmus}).

 \QuickQuiz{
 	\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
 	But given that \co{smp_mb()} has the propagation property,
 	why doesn't the \co{smp_mb()} on \clnref{P2:mb} of
 	\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 	prevent the \co{exists} clause from triggering?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	\begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole]
 	As a rough rule of thumb, the \co{smp_mb()} barrier's
 	propagation property is sufficient to maintain ordering
 	through only one load-to-store link between
 	processes.
 	Unfortunately,
 	\cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}
 	has not one but two load-to-store links, with the
 	first being from the \co{READ_ONCE()} on \clnref{P1:ld} to the
 	\co{WRITE_ONCE()} on \clnref{P2:st} and the second being from
 	the \co{READ_ONCE()} on \clnref{P2:ld} to the \co{WRITE_ONCE()}
 	on \clnref{P0:st}.
 	Therefore, preventing the \co{exists} clause from triggering
 	should be expected to require not one but two
 	instances of \co{smp_mb()}.
 	\end{fcvref}

 	As a special exception to this rule of thumb, a release-acquire
 	chain can have one load-to-store link between processes
 	and still prohibit the cycle.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/co}}
 \caption{Store-to-Store is Counter-Temporal}
 \label{fig:memorder:Store-to-Store is Counter-Temporal}
 \end{figure}

 For completeness,
 \cref{fig:memorder:Store-to-Store is Counter-Temporal}
 shows that the ``winning'' store among a group of stores to the
 same variable is not necessarily the store that started last.
 This should not come as a surprise to anyone who carefully examined
 \cref{fig:memorder:A Variable With More Simultaneous Values}
 on
 \cpageref{fig:memorder:A Variable With More Simultaneous Values}.
 One way to rationalize the counter-temporal properties of both
 load-to-store and store-to-store ordering is to clearly distinguish
 between the temporal order in which the store instructions executed on
 the one hand, and the order in which the corresponding cacheline visited
 the CPUs that executed those instructions on the other.
 It is the cacheline-visitation order that defines the externally
 visible ordering of the actual stores.
 This cacheline-visitation order is not directly visible to the code
 executing the store instructions, which results in the counter-intuitive
 counter-temporal nature of load-to-store and store-to-store ordering.\footnote{
 	In some hardware-multithreaded systems, the store would become
 	visible to other CPUs in that same core as soon as the store
 	reached the shared store buffer.
 	As a result, such systems are non-multicopy atomic.}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-2+2W+o-wmb-o+o-wmb-o@whole.fcv}
 \caption{2+2W Litmus Test With Write Barriers}
 \label{lst:memorder:2+2W Litmus Test With Write Barriers}
 \end{listing}

 \QuickQuiz{
 	But for litmus tests having only ordered stores, as shown in
 	\cref{lst:memorder:2+2W Litmus Test With Write Barriers}
 	(\path{C-2+2W+o-wmb-o+o-wmb-o.litmus}),
 	research shows that the cycle is prohibited, even in weakly
 	ordered systems such as \ARM\ and Power~\cite{test6-pdf}.
 	Given that, is store-to-store ordering really \emph{always}
 	counter-temporal???
 }\QuickQuizAnswer{
 	This litmus test is indeed a very interesting curiosity.
 	Its ordering apparently occurs naturally given typical
 	weakly ordered hardware design, which would normally be
 	considered a great gift from the relevant laws of physics
 	and cache-coherency-protocol mathematics.

 	\begin{fcvref}[ln:formal:C-2+2W+o-wmb-o+o-wmb-o:whole]
 	Unfortunately, no one has been able to come up with a software use
 	case for this gift that does not have a much better alternative
 	implementation.
 	Therefore, neither the C11 nor the Linux kernel memory models
 	provide any guarantee corresponding to
 	\cref{lst:memorder:2+2W Litmus Test With Write Barriers}.
 	This means that the \co{exists} clause on \clnref{exists} can
 	trigger.
 	\end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-2+2W+o-o+o-o@whole.fcv}
 \caption{2+2W Litmus Test (No Ordering)}
 \label{lst:memorder:2+2W Litmus Test (No Ordering)}
 \end{listing}

 	Of course, without the barrier, there are no ordering
 	guarantees, even on real weakly ordered hardware, as shown in
 	\cref{lst:memorder:2+2W Litmus Test (No Ordering)}
 	(\path{C-2+2W+o-o+o-o.litmus}).
 }\QuickQuizEnd

 But sometimes time really is on our side.
 Read on!

 \subsubsection{Happens-Before}
 \label{sec:memorder:Happens-Before}

 As shown in
 \cref{fig:memorder:Store-to-Load is Temporal},
 on platforms without user-visible speculation, if a load returns the value
 from a particular store, then, courtesy of the finite speed of light and
 the non-zero size of modern computing systems, the store absolutely has
 to have executed at an earlier time than did the load.
 This means that carefully constructed programs can rely on the
 passage of time itself as a memory-ordering operation.

 \QuickQuiz{
 	Why don't we just stick to sanely ordered CPU families like x86,
 	so that time will \emph{always} be on our side???
 }\QuickQuizAnswer{
 	Sorry to be the one to break it to you, but x86 CPUs have
 	store buffers and they are not afraid to use them.

 	The data shown in
 	\crefrange{fig:memorder:x86 CPUs Can Disagree}{fig:memorder:Store-to-Load is Temporal on x86}
 	was obtained from a series of torture tests
 	(\path{perftemporal.sh}) running on a two-socket x86 system
 	having a total of 80 hardware
 	threads.\footnote{
 		\co{Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz},
 		for those wanting more details.}

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe-nvals}}
 \caption{x86 CPUs Can Disagree}
 \label{fig:memorder:x86 CPUs Can Disagree}
 \end{figure}

 	\Cref{fig:memorder:x86 CPUs Can Disagree}
 	is from a torture-test run similar to the one that generated
 	\cref{fig:memorder:A Variable With More Simultaneous Values}
 	on
 	\cpageref{fig:memorder:A Variable With More Simultaneous Values},
 	but showing the number of distinct opinions per unit time, where
 	each timestamp period is 0.5~nanoseconds.\footnote{
 		Recall that each CPU writes its own number to a single
 		shared variable, which each CPU repeatedly polls,
 		recording time and value at each change in value.}
 	As you can see, there is a significant period of time during
 	which there are more than 40 distinct opinions as to the value
 	of a single shared variable.
 	Even on x86.

 \begin{figure}
 \centering
 \resizebox{.85\twocolumnwidth}{!}{\includegraphics{memorder/co-hopes}}
 \caption{Is Store-to-Store Counter-Temporal on x86?}
 \label{fig:memorder:Is Store-to-Store Counter-Temporal on x86?}
 \end{figure}

 	But perhaps we might hope that on x86, the last CPU to execute its
 	store in global time order would ``win'', that is, the final value
 	of the shared variable would be that of that last store.
 	If so, any store starting after the ``winning'' store finished would
 	overwrite the winning store.
 	In contrast, we know that on weakly ordered systems, a
 	counter-intuitive store such as that shown at the bottom of
 	\cref{fig:memorder:Is Store-to-Store Counter-Temporal on x86?}
 	could be overwritten by the old value, despite the fact that it
 	started $t$ timestamp periods after the winning store completed,
 	as depicted in the figure by the time interval that is labelled
 	$-t$.\footnote{
 		That is completed from the viewpoint of the instruction
 		stream containing that store.
 		The value stored might well remain in the store buffer
 		long afterwards.}

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe}}
 \caption{Store-to-Store is Counter-Temporal on x86}
 \label{fig:memorder:Store-to-Store is Counter-Temporal on x86}
 \end{figure}

 	However,
 	\cref{fig:memorder:Store-to-Store is Counter-Temporal on x86}
 	dashes any fond hope of x86 refusing to indulge in
 	counter-intuitive counter-temporal stores.
 	The data in this figure summarizes the results from 79,000
 	torture-test runs of the type that generated
 	\cref{fig:memorder:x86 CPUs Can Disagree},
 	and is a histogram of the minimum time elapsed between the start of
 	a non-winning CPU's store and the end of the winning CPU's store.
 	Of course, if this value is negative, the winning store completed
 	(though its value might not have propagated past the store buffer)
 	before the other store even started.
 	And the negative-time data points in that figure show that this
 	counter-temporal behavior really can be made to happen most of
 	the time, even on x86.

 \begin{figure}
 \centering
 \resizebox{.6\twocolumnwidth}{!}{\includegraphics{memorder/fr-hopes}}
 \caption{Is Load-to-Store Counter-Temporal on x86?}
 \label{fig:memorder:Is Load-to-Store Counter-Temporal on x86?}
 \end{figure}

 	But perhaps we could instead hope that once a store had
 	executed, any future load from that same variable might
 	be guaranteed to return the new value.
 	In contrast, we know that on weakly ordered systems, a
 	counter-intuitive load such as that shown at the bottom
 	of
 	\cref{fig:memorder:Is Load-to-Store Counter-Temporal on x86?}
 	could return the old value, despite having started $t$
 	timestamp periods after the end of the store, again, as
 	depicted in the figure by the $-t$.

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/fre}}
 \caption{Load-to-Store is Counter-Temporal on x86}
 \label{fig:memorder:Load-to-Store is Counter-Temporal on x86}
 \end{figure}

 	However,
 	\cref{fig:memorder:Load-to-Store is Counter-Temporal on x86}
 	dashes any fond hopes that loads executing after a given store
 	would see that store's value (or some later value).
 	This data is generated by 1,000 runs of a program in which the
 	parent thread spawns the children, each of which polls a shared
 	variable.
 	After all the children are running, the parent writes a new
 	value to the shared variable.
 	The quantity histogrammed in the figure is the time from just
 	after the store until just before the last load of the old value.
 	And most of these data points lie on the negative x~axis,
 	clearly demonstrating that a torture test can easily provoke
 	the counter-temporal nature of cross-thread load-to-store.

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf-hopes}}
 \caption{Is Store-to-Load Counter-Temporal on x86?}
 \label{fig:memorder:Is Store-to-Load Counter-Temporal on x86?}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/rfe}}
 \caption{Store-to-Load is Temporal on x86}
 \label{fig:memorder:Store-to-Load is Temporal on x86}
 \end{figure}

 	It is only reasonable to assume that a load that ends before
 	a store starts will be unable to load that store's value,
 	as shown in
 	\cref{fig:memorder:Is Store-to-Load Counter-Temporal on x86?},
 	even on weakly ordered systems.
 	And the data points from all 1,000 runs shown in
 	\cref{fig:memorder:Store-to-Load is Temporal on x86},
 	lie on the positive x~axis, so in at least this case, our temporal
 	intuitions, hopes, and dreams have been fulfilled.

 	These results should be no surprise.
 	Even on x86, the cache line shuttles between cores and sockets, and
 	\cref{fig:memorder:x86 CPUs Can Disagree}
 	indicates that a given store can remain in its CPU's store buffer
 	for more than a microsecond, which is quite enough time for
 	today's multi-GHz CPUs to detect the resulting counter-temporal
 	behavior.

 	However, for the store-to-load case, the laws of physics
 	guarantee temporal ordering because the finite speed of light
 	and the non-zero size of atoms work together to guarantee that
 	some time will pass between a store on one CPU and a load of
 	that store's value on some other CPU.

 	Alert readers may have noticed that the distribution shown in
 	\cref{fig:memorder:Store-to-Store is Counter-Temporal on x86}
 	is nearly monomodal, which those in
 	\cref{fig:memorder:Load-to-Store is Counter-Temporal on x86}
 	and
 	\cref{fig:memorder:Store-to-Load is Temporal on x86}
 	are decidedly trimodal.
 	Such readers are encouraged to consider why that might be,
 	perhaps referring to \path{CodeSamples/cpu/perftemporal.sh} and
 	the scripts and programs that it invokes.

 	And all readers are encouraged to note that even on relatively
 	strongly ordered CPUs such as x86, store-to-store and
 	load-to-store IPC links can be strongly counter-temporal.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf}}
 \caption{Store-to-Load is Temporal}
 \label{fig:memorder:Store-to-Load is Temporal}
 \end{figure}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+a-o+o-data-o+o-data-o@whole.fcv}
 \caption{LB Litmus Test With One Acquire}
 \label{lst:memorder:LB Litmus Test With One Acquire}
 \end{listing}

 Of course, just the passage of time by itself is not enough, as
 was seen in
 \cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}
 on
 \cpageref{lst:memorder:Load-Buffering Litmus Test (No Ordering)},
 which has nothing but store-to-load links and, because it provides
 absolutely no ordering, still can trigger its \co{exists} clause.
 However, as long as each thread provides even the weakest possible
 ordering, \co{exists} clause would not be able to trigger.
 For example,
 \cref{lst:memorder:LB Litmus Test With One Acquire}
 (\path{C-LB+a-o+o-data-o+o-data-o.litmus})
 shows \co{P0()} ordered with an \co{smp_load_acquire()} and
 both \co{P1()} and \co{P2()} ordered with data dependencies.
 These orderings, which are close to the top of
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet},
 suffice to prevent the \co{exists} clause from triggering.

 \QuickQuiz{
 	Can you construct a litmus test like that in
 	\cref{lst:memorder:LB Litmus Test With One Acquire}
 	that uses \emph{only} dependencies?
 }\QuickQuizAnswer{
 	\Cref{lst:memorder:LB Litmus Test With No Acquires}
 	shows a somewhat nonsensical but very real example.
 	Creating a more useful (but still real) litmus test is left
 	as an exercise for the reader.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-data-o+o-data-o+o-data-o@whole.fcv}
 \caption{LB Litmus Test With No Acquires}
 \label{lst:memorder:LB Litmus Test With No Acquires}
 \end{listing}
 }\QuickQuizEnd

 An important use of time for ordering memory accesses is covered in the
 next section.

 \subsubsection{Release-Acquire Chains}
 \label{sec:memorder:Release-Acquire Chains}

 A minimal release-acquire chain was shown in
 \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}
 on
 \cpageref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test},
 but these chains can be much longer, as shown in
 \cref{lst:memorder:Long LB Release-Acquire Chain}
 (\path{C-LB+a-r+a-r+a-r+a-r.litmus}).
 The longer the release-acquire chain, the more ordering is gained
 from the passage of time, so that no matter how many threads are
 involved, the corresponding \co{exists} clause cannot trigger.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+a-r+a-r+a-r+a-r@whole.fcv}
 \caption{Long LB Release-Acquire Chain}
 \label{lst:memorder:Long LB Release-Acquire Chain}
 \end{listing}

 Although release-acquire chains are inherently store-to-load creatures,
 it turns out that they can tolerate one load-to-store step, despite
 such steps being counter-temporal, as shown in
 \cref{fig:memorder:Load-to-Store is Counter-Temporal}
 on
 \cpageref{fig:memorder:Load-to-Store is Counter-Temporal}.
 For example,
 \cref{lst:memorder:Long ISA2 Release-Acquire Chain}
 (\path{C-ISA2+o-r+a-r+a-r+a-o.litmus})
 shows a three-step release-acquire chain, but where \co{P3()}'s
 final access is a \co{READ_ONCE()} from \co{x0}, which is
 accessed via \co{WRITE_ONCE()} by \co{P0()}, forming a non-temporal
 load-to-store link between these two processes.
 \begin{fcvref}[ln:formal:litmus:C-ISA2+o-r+a-r+a-r+a-o:whole]
 However, because \co{P0()}'s \co{smp_store_release()} (\clnref{P0:rel})
 is cumulative, if \co{P3()}'s \co{READ_ONCE()} returns zero,
 this cumulativity will force the \co{READ_ONCE()} to be ordered
 before \co{P0()}'s \co{smp_store_release()}.
 In addition, the release-acquire chain
 (\clnref{P0:rel,P1:acq,P1:rel,P2:acq,P2:rel,P3:acq})
 forces \co{P3()}'s \co{READ_ONCE()} to be ordered after \co{P0()}'s
 \co{smp_store_release()}.
 Because \co{P3()}'s \co{READ_ONCE()} cannot be both before and after
 \co{P0()}'s \co{smp_store_release()}, either or both of two things must
 be true:
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-ISA2+o-r+a-r+a-r+a-o@whole.fcv}
 \caption{Long ISA2 Release-Acquire Chain}
 \label{lst:memorder:Long ISA2 Release-Acquire Chain}
 \end{listing}

 \begin{enumerate}
 \item	\co{P3()}'s \co{READ_ONCE()} came after \co{P0()}'s
 	\co{WRITE_ONCE()}, so that the \co{READ_ONCE()} returned
 	the value two, so that the \co{exists} clause's \co{3:r2=0}
 	is false.
 \item	The release-acquire chain did not form, that is, one or more
 	of the \co{exists} clause's \co{1:r2=2}, \co{2:r2=2}, or \co{3:r1=2}
 	is false.
 \end{enumerate}

 Either way, the \co{exists} clause cannot trigger, despite this litmus
 test containing a notorious load-to-store link between
 \co{P3()} and \co{P0()}.
 But never forget that release-acquire chains can tolerate only one
 load-to-store link, as was seen in
 \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-r+a-r+a-o@whole.fcv}
 \caption{Long Z6.2 Release-Acquire Chain}
 \label{lst:memorder:Long Z6.2 Release-Acquire Chain}
 \end{listing}

 Release-acquire chains can also tolerate a single store-to-store step,
 as shown in
 \cref{lst:memorder:Long Z6.2 Release-Acquire Chain}
 (\path{C-Z6.2+o-r+a-r+a-r+a-o.litmus}).
 \begin{fcvref}[ln:formal:C-Z6.2+o-r+a-r+a-r+a-o:whole]
 As with the previous example, \co{smp_store_release()}'s cumulativity
 combined with the temporal nature of the release-acquire chain
 prevents the \co{exists} clause on \clnref{exists} from triggering.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-o+o-mb-o@whole.fcv}
 \caption{Z6.2 Release-Acquire Chain (Ordering?)}
 \label{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}
 \end{listing}

 \QuickQuiz{
 	Suppose we have a short release-acquire chain along with one
 	load-to-store link and one store-to-store link, like that shown in
 	\cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}.
 	Given that there is only one of each type of non-store-to-load
 	link, the \co{exists} cannot trigger, right?
 }\QuickQuizAnswer{
 	Wrong.
 	It is the number of non-store-to-load links that matters.
 	If there is only one non-store-to-load link, a release-acquire
 	chain can prevent the \co{exists} clause from triggering.
 	However, if there is more than one non-store-to-load link,
 	be they store-to-store, load-to-store, or any combination
 	thereof, it is necessary to have at least one full barrier
 	(\co{smp_mb()} or better) between each non-store-to-load link.
 	In
 	\cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)},
 	preventing the \co{exists} clause from triggering therefore requires
 	an additional full barrier between either \co{P0()}'s or
 	\co{P1()}'s accesses.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MP+o-r+a-o@whole.fcv}
 \caption{A Release-Acquire Chain Ordering Multiple Accesses}
 \label{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-MPO+o-r+a-o+o@whole.fcv}
 \caption{A Release-Acquire Chain With Added Store (Ordering?)}
 \label{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
 \end{listing}

 But beware:
 Adding a second store-to-store link allows the correspondingly updated
 \co{exists} clause to trigger.
 To see this, review
 \cref{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses,%
 lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)},
 which have identical \co{P0()} and \co{P1()} processes.
 The only code difference is that
 \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
 has an additional \co{P2()} that does an \co{smp_store_release()} to
 the \co{x2} variable that \co{P0()} releases and \co{P1()} acquires.
 The \co{exists} clause is also adjusted to exclude executions in which
 \co{P2()}'s \co{smp_store_release()} precedes that of \co{P0()}.

 Running the litmus test in
 \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}
 shows that the addition of \co{P2()} can totally destroy the
 ordering from the release-acquire chain.
 Therefore, when constructing release-acquire chains, please take care
 to construct them properly.

 \QuickQuiz{
 	There are store-to-load links, load-to-store links, and
 	store-to-store links.
 	But what about load-to-load links?
 }\QuickQuizAnswer{
 	The problem with the concept of load-to-load links is that
 	if the two loads from the same variable return the same
 	value, there is no way to determine their ordering.
 	The only way to determine their ordering is if they return
 	different values, in which case there had to have been an
 	intervening store.
 	And that intervening store means that there is no load-to-load
 	link, but rather a load-to-store link followed by a
 	store-to-load link.
 }\QuickQuizEnd

 In short, properly constructed release-acquire chains form a peaceful
 (if rather tightly constrained)
 island of intuitive bliss surrounded by a strongly counter-intuitive
 sea of more complex memory-ordering constraints.

 \subsection{A Counter-Intuitive Case Study}
 \label{sec:memorder:A Counter-Intuitive Case Study}

 This section will revisit
 \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
 on \cpageref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)},
 which was presented in the answer to
 \QuickQuizARef{\MemorderQQLitmusTestR}.
 This litmus test has only two threads, with the stores in \co{P0()}
 being ordered by \co{smp_wmb()} and the accesses in \co{P1()} being
 ordered by \co{smp_mb()}.
 Despite this litmus test's small size and heavy ordering, the
 counter-intuitive outcome shown in the \co{exists} clause is in fact
 allowed.

 One way to look at this was presented in the answer to
 \QuickQuizARef{\MemorderQQLitmusTestR}, namely that the link from
 \co{P0()} to \co{P1()} is a store-to-store link, and that back
 from \co{P1()} to \co{P0()} is a store-to-store link.
 Both links are counter-temporal, thus requiring full memory barriers
 in both processes.
 Revisiting
 \cref{fig:memorder:Store-to-Store is Counter-Temporal,%
 fig:memorder:Store-to-Load is Temporal}
 shows that these counter-temporal links give the hardware considerable
 latitude.

 But that raises the question of exactly how hardware would go about using
 this latitude to satisfy the \co{exists} clause in
 \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}.
 There is no known ``toy'' hardware implementation that can do this, so
 let us instead study the sequence of steps that the PowerPC architecture
 goes through to make this happen.

 The first step in this study is to translate
 \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}
 to a PowerPC assembly language litmus test
 (\cref{sec:formal:Anatomy of a Litmus Test} on
 \cpageref{sec:formal:Anatomy of a Litmus Test}):

 \begin{fcvlabel}[ln:memorder:inline:ppcasmr]
 \begin{VerbatimN}[samepage=true,commandchars=\@\[\]]
 PPC R+lwsync+sync
 {
 0:r1=1; 0:r2=x; 0:r4=y;		@lnlbl[init0]
 1:r1=2; 1:r2=y; 1:r4=x;		@lnlbl[init1]
 }
  P0           | P1           ;	@lnlbl[procs]
  stw r1,0(r2) | stw r1,0(r2) ;	@lnlbl[stores]
  lwsync       | sync         ;	@lnlbl[barriers]
  stw r1,0(r4) | lwz r3,0(r4) ;	@lnlbl[storeload]
 exists (y=2 /\ 1:r3=0)		@lnlbl[exists]
 \end{VerbatimN}
 \end{fcvlabel}

 \begin{fcvref}[ln:memorder:inline:ppcasmr]
 The first line identifies the type of test (\co{PPC}) and gives
 the test's name.
 \Clnref{init0,init1} initialize \co{P0()}'s and \co{P1()}'s registers,
 respectively.
 \Clnrefrange{procs}{storeload} show the PowerPC assembly
 statements corresponding to the C code from
 \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)},
 with the first column being the code for \co{P0()} and the second column
 being the code for \co{P1()}.
 \Clnref{stores} shows the initial \co{WRITE_ONCE()} calls in both columns;
 the columns of \clnref{barriers} show the \co{smp_wmb()} and \co{smp_mb()}
 for \co{P0()} and \co{P1()}, respectively;
 the columns of \clnref{storeload} shows \co{P0()}'s \co{WRITE_ONCE()} and
 \co{P1()}'s \co{READ_ONCE()}, respectively;
 and finally \clnref{exists} shows the \co{exists} clause.
 \end{fcvref}

 In order for this \co{exists} clause to be satisfied, \co{P0()}'s
 \co{stw} to \co{y} must precede that of \co{P1()}, but \co{P1()}'s
 later \co{lwz} from \co{x} must precede \co{P0()}'s \co{stw} to \co{x}.
 Seeing how this can happen requires a rough understanding of the
 following PowerPC terminology.

 \begin{description}[style=nextline]

 \item[Instruction commit:]
 This can be thought of as the execution of that instruction as opposed
 to the memory-system consequences of having executed that instruction.

 \item[Write reaching coherence point:]
 This can be thought of as the value written being deposited into the
 corresponding cache line.

 \item[Partial coherence commit:]
 This can be thought of as the system having worked out the order in which
 a pair of values written will be deposited into the corresponding cache
 line, but potentially well before that cache line arrives.
 Some might argue that the data in
 \cref{fig:memorder:A Variable With More Simultaneous Values}
 suggests that real PowerPC hardware does in fact use partial coherence
 commits to handle concurrent stores by multiple hardware threads within
 a single core.

 \item[Write propagate to thread:]
 This occurs when a second hardware thread becomes aware of the first
 hardware thread's write.
 The time at which a write propagates to a given thread might not have
 any relation to cache-line movement.
 For example, if a pair of threads share a store buffer, they might see
 each others' writes long before the cache line gets involved.
 On the other hand, if a pair of hardware threads are widely separated,
 the first thread's write's value might have been deposited into the
 corresponding cache line long before the second thread learns of that
 write.

 \item[Barrier propagate to thread:]
 Hardware threads make each other aware of memory-barrier instructions
 as needed by propagating them to each other.

 \item[Acknowledge \tco{sync}:]
 The PowerPC \co{sync} instruction implements the Linux kernel's
 \co{smp_mb()} full barrier.
 And one reason that the \co{sync} instruction provides such strong
 ordering is that each \co{sync} is not only propagated to other hardware
 threads, but these other threads must also acknowledge each \co{sync}.
 This two-way communication allows the hardware threads to cooperate
 to produce the required strong global ordering.

 \end{description}

 \begin{figure*}[tbp]
 \centering
 \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM0.png}}
 \caption{PPCMEM Initial R State}
 \label{fig:memorder:PPCMEM Initial R State}
 \end{figure*}

 \begin{figure*}[tbp]
 \centering
 \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM1.png}}
 \caption{PPCMEM First R Step}
 \label{fig:memorder:PPCMEM First R Step}
 \end{figure*}

 We are now ready to step through the PowerPC sequence of events that
 satisfies the above \co{exists} clause.

 To best understand this, please follow along at
 \url{https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html},
 carefully copying the above assembly-language litmus test into the pane.
 The result should look as shown in
 \cref{fig:memorder:PPCMEM Initial R State}, give or take space characters.
 Click on the ``Interactive'' button in the lower left, which, after a
 short delay, should produce a display as shown in
 \cref{fig:memorder:PPCMEM First R Step}.
 If the ``Interactive'' button refuses to do anything, this usually means
 that there is a syntax error, for example, a spurious newline character
 might have been introduced during the copy-paste operation.

 This display has one clickable link in each section displaying thread
 state, and as the ``Commit'' in each link suggests, these links commit
 each thread's first \co{stw} instruction.
 If you prefer, you can instead click on the corresponding links listed
 under ``Enabled transitions'' near the bottom of the screen.
 Note well that some of the later memory-system transitions will appear
 in the upper ``Storage subsystem state'' section of this display.

 The following sequence of clicks demonstrates how the \co{exists} clause
 can be satisfied:

 \begin{enumerate}
 \item	Commit \co{P0()}'s first \co{stw} instruction (to \co{x}).
 \item	Commit \co{P1()}'s \co{stw} instruction.
 \item	Commit \co{P0()}'s \co{lwsync} instruction.
 \item	Commit \co{P0()}'s second \co{stw} instruction (to \co{y}).
 \item	Commit \co{P1()}'s \co{sync} instruction.
 \item	At this point, there should be no clickable links in either of
 	the two sections displaying thread state, but there should be
 	quite a few of them up in the ``Storage subsystem state''.
 	The following steps tell you which of them to click on.
 \item	\co{Partial coherence commit: c:W y=1 -> d:W y=2}.
 	This commits the system to processing \co{P0()}'s store to
 	\co{y} before \co{P1()}'s store even though neither store
 	has reached either the coherence point or any other thread.
 	One might imagine partial coherence commits happening within a
 	store buffer that is shared by multiple hardware threads
 	that are writing to the same variable.
 \item	\co{Write propagate to thread: d:W y=2 to Thread 0}.
 	This is necessary to allow \co{P1()}'s \co{sync} instruction
 	to propagate to \co{P0()}.
 \item	\co{Barrier propagate to thread: e:Sync  to Thread 0}.
 \item	\co{Write reaching coherence point: a:W x=1}.
 \item	\co{Write reaching coherence point: c:W y=1}.
 \item	\co{Write reaching coherence point: d:W y=2}.
 	These three operations were required in order to allow \co{P0()}
 	to acknowledge \co{P1()}'s \co{sync} instruction.
 \item	\co{Acknowledge sync: Sync e:Sync}.
 \item	Back down in thread \co{P1()}'s state, click on \co{Read i:W
 	x=0}, which loads the value zero, thus satisfying the \co{exists}
 	clause.
 	All that remains is cleanup, which can be carried out in any order.
 \item	Commit \co{P1()}'s \co{lwz} instruction.
 \item	\co{Write propagate to thread: a:W x=1 to Thread 1}.
 \item	\co{Barrier propagate to thread: b:Lwsync  to Thread 1}.
 \end{enumerate}

 \begin{figure*}[tbp]
 \centering
 \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEMfinal.png}}
 \caption{PPCMEM Final R State}
 \label{fig:memorder:PPCMEM Final R State}
 \end{figure*}

 At this point, you should see something like
 \cref{fig:memorder:PPCMEM Final R State}.
 Note that the satisified \co{exists} clause is shown in blue near the
 bottom, confirming that this counter-intuitive really can happen.
 If you wish, you can click on ``Undo'' to explore other options or
 click on ``Reset'' to start over.
 It can be very helpful to carry out these steps in different orders
 to better understand how a
 \IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic}
 architecture operates.

 \QuickQuiz{
 	What happens if that \co{lwsync} instruction is instead a
 	\co{sync} instruction?
 }\QuickQuizAnswer{
 	The counter-intuitive outcome cannot happen.
 	(Try it!)
 }\QuickQuizEnd

 Although a full understanding of how this counter-intuitive outcome
 happens would require hardware details that are beyond the scope of
 this book, this exercise should provide some helpful intuitions.
 Or perhaps more accurately, destroy some counter-productive intuitions.

 % @@@ Exercises?
 % @@@ Hardware details from Appendix?

 \section{Compile-Time Consternation}
 \label{sec:memorder:Compile-Time Consternation}
 %
 \epigraph{Science increases our power in proportion as it lowers our pride.}
 	 {Claude Bernard}

 Most languages, including C, were developed on uniprocessor systems
 by people with little or no parallel-programming experience.
 As a result, unless explicitly told otherwise, these languages assume
 that the current CPU is the only thing that is reading or writing memory.
 This in turn means that these languages' compilers' optimizers
 are ready, willing, and oh so able to make dramatic changes to the
 order, number, and sizes of memory references that your program
 executes.
 In fact, the reordering carried out by hardware can seem quite tame
 by comparison.

 This section will help you tame your compiler, thus avoiding a great
 deal of compile-time consternation.
 \Cref{sec:memorder:Memory-Reference Restrictions}
 describes how to keep the compiler from destructively optimizing
 your code's memory references,
 \cref{sec:memorder:Address- and Data-Dependency Difficulties}
 describes how to protect address and data dependencies,
 and finally,
 \cref{sec:memorder:Control-Dependency Calamities}
 describes how to protect those delicate control dependencies.

 \subsection{Memory-Reference Restrictions}
 \label{sec:memorder:Memory-Reference Restrictions}

 As noted in \cref{sec:toolsoftrade:Accessing Shared Variables},
 unless told otherwise, compilers assume that nothing else
 is affecting the variables that the code is accessing.
 Furthermore, this assumption is not simply some design error, but is
 instead enshrined in various standards.\footnote{
 	Or perhaps it is a standardized design error.}
 It is worth summarizing this material in preparation for the following
 sections.

 \IXplx{Plain access}{es}, as in plain-access C-language assignment statements such
 as \qco{r1 = a} or \qco{b = 1} are subject to the
 shared-variable shenanigans described in
 \cref{sec:toolsoftrade:Shared-Variable Shenanigans}.
 Ways of avoiding these shenanigans are described in
 \crefrange{sec:toolsoftrade:A Volatile Solution}{sec:toolsoftrade:Avoiding Data Races}
 starting on
 \cpageref{sec:toolsoftrade:A Volatile Solution}:

 \begin{enumerate}
 \item	Plain accesses can tear, for example, the compiler could choose
 	to access an eight-byte pointer one byte at a time.
 	Tearing of aligned machine-sized accesses can be prevented by
 	using \co{READ_ONCE()} and \co{WRITE_ONCE()}.
 \item	Plain loads can fuse, for example, if the results of an earlier
 	load from that same object are still in a machine register,
 	the compiler might opt to reuse the value in that register
 	instead of reloading from memory.
 	Load fusing can be prevented by using \co{READ_ONCE()} or by
 	enforcing ordering between the two loads using \co{barrier()},
 	\co{smp_rmb()}, and other means shown in
 	\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
 \item	Plain stores can fuse, so that a store can be omitted entirely
 	if there is a later store to that same variable.
 	Store fusing can be prevented by using \co{WRITE_ONCE()} or by
 	enforcing ordering between the two stores using \co{barrier()},
 	\co{smp_wmb()}, and other means shown in
 	\cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.
 \item	Plain accesses can be reordered in surprising ways by modern
 	optimizing compilers.
 	This reordering can be prevented by enforcing ordering as
 	called out above.
 \item	Plain loads can be invented, for example, register pressure might
 	cause the compiler to discard a previously loaded value from
 	its register, and then reload it later on.
 	Invented loads can be prevented by using \co{READ_ONCE()} or by
 	enforcing ordering as called out above between the load and a
 	later use of its value using \co{barrier()}.
 \item	Stores can be invented before a plain store, for example, by
 	using the stored-to location as temporary storage.
 	This can be prevented by use of \co{WRITE_ONCE()}.
 \item	Stores can be transformed into a load-check-store sequence,
 	which can defeat control dependencies.
 	This can be prevented by use of \co{smp_load_acquire()}.
 \end{enumerate}

 \QuickQuiz{
 	Why not place a \co{barrier()} call immediately before
 	a plain store to prevent the compiler from inventing stores?
 }\QuickQuizAnswer{
 	Because it would not work.
 	Although the compiler would be prevented from inventing a
 	store prior to the \co{barrier()}, nothing would prevent
 	it from inventing a store between that \co{barrier()} and
 	the plain store.
 }\QuickQuizEnd

 Please note that all of these shared-memory shenanigans can instead be
 avoided by avoiding \IXpl{data race} on plain accesses, as described in
 \cref{sec:toolsoftrade:Avoiding Data Races}.
 After all, if there are no data races, then each and every one of the
 compiler optimizations mentioned above is perfectly safe.
 But for code containing data races, this list is subject to change
 without notice as compiler optimizations continue becoming increasingly
 aggressive.

 In short, use of \co{READ_ONCE()}, \co{WRITE_ONCE()}, \co{barrier()},
 \co{volatile}, and other primitives called out in
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 on
 \cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}
 are valuable tools in preventing the compiler from
 optimizing your parallel algorithm out of existence.
 Compilers are starting to provide other mechanisms for avoiding
 load and store tearing, for example, \co{memory_order_relaxed}
 atomic loads and stores, however, work is still
 needed~\cite{JonathanCorbet2016C11atomics}.
 In addition, compiler issues aside, \co{volatile} is still needed
 to avoid fusing and invention of accesses, including C11 atomic accesses.

 Please note that, it is possible to overdo use of \co{READ_ONCE()} and
 \co{WRITE_ONCE()}.
 For example, if you have prevented a given variable from changing
 (perhaps by holding the lock guarding all updates to that
 variable), there is no point in using \co{READ_ONCE()}.
 Similarly, if you have prevented any other CPUs or threads from
 reading a given variable (perhaps because you are initializing
 that variable before any other CPU or thread has access to it),
 there is no point in using \co{WRITE_ONCE()}.
 However, in my experience, developers need to use things like
 \co{READ_ONCE()} and \co{WRITE_ONCE()} more often than they think that
 they do, and the overhead of unnecessary uses is quite low.
 In contrast, the penalty for failing to use them when needed can be quite high.

 \subsection{Address- and Data-Dependency Difficulties}
 \label{sec:memorder:Address- and Data-Dependency Difficulties}
 \OriginallyPublished{sec:memorder:Address- and Data-Dependency Difficulties}{Address- and Data-Dependency Difficulties}{the Linux kernel}{PaulEMcKenney2014rcu-dereference}

 The low overheads of the \IXalth{address}{address}{dependency}
 and \IXalth{data dependencies}{data}{dependency} discussed in
 \cref{sec:memorder:Address Dependencies,%
 sec:memorder:Data Dependencies},
 respectively, makes their use extremely attractive.
 Unfortunately, compilers do not understand either address or data
 dependencies, although there are efforts underway to teach them, or at
 the very least, standardize the process of teaching
 them~\cite{PaulEMcKennneyConsumeP0190R4,PaulEMcKenney2017markconsumeP0462R1}.
 In the meantime, it is necessary to be very careful in order to prevent
 your compiler from breaking your dependencies.

 \subsubsection{Give your dependency chain a good start}
 The load that heads your dependency chain must use proper
 ordering, for example \co{rcu_dereference()} or \co{READ_ONCE()}.
 Failure to follow this rule can have serious side effects:

 \begin{enumerate}
 \item	On DEC Alpha, a dependent load might not be ordered with
 	the load heading the dependency chain, as described in
 	\cref{sec:memorder:Alpha}.
 \item	If the load heading the dependency chain is a
 	C11 non-volatile \co{memory_order_relaxed} load,
 	the compiler could omit the load, for example, by using a value
 	that it loaded in the past.
 \item	If the load heading the dependency chain is a plain load,
 	the compiler can omit the load, again by using a value
 	that it loaded in the past.
 	Worse yet, it could load twice instead of once, so that
 	different parts of your code use different values---and
 	compilers really do this, especially when under register
 	pressure.
 \item	The value loaded by the head of the dependency chain must
 	be a pointer.
 	In theory, yes, you could load an integer, perhaps to use
 	it as an array index.
 	In practice, the compiler knows too much about integers,
 	and thus has way too many opportunities to break your
 	dependency chain~\cite{PaulEMcKennneyConsumeP0190R4}.
 \end{enumerate}

 \subsubsection{Avoid arithmetic dependency breakage}
 Although it is just fine to do some arithmetic operations on a pointer in
 your dependency chain, you need to be careful to avoid giving the
 compiler too much information.
 After all, if the compiler learns enough to determine the exact value
 of the pointer, it can use that exact value instead of the pointer itself.
 As soon as the compiler does that, the dependency is broken and all
 ordering is lost.

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Breakable Dependencies With Comparisons]
 \begin{VerbatimL}[commandchars=\\\[\]]
 int reserve_int;
 int *gp;
 int *p;

 p = rcu_dereference(gp);
 if (p == &reserve_int)		\lnlbl[cmp]
 	handle_reserve(p);	\lnlbl[handle]
 do_something_with(*p); /* buggy! */
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Breakable Dependencies With Comparisons}
 \label{lst:memorder:Breakable Dependencies With Comparisons}
 \end{listing}

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Broken Dependencies With Comparisons]
 \begin{VerbatimL}[commandchars=\\\[\]]
 int reserve_int;
 int *gp;
 int *p;

 p = rcu_dereference(gp);	\lnlbl[deref1]
 if (p == &reserve_int) {
 	handle_reserve(&reserve_int);
 	do_something_with(reserve_int); /* buggy! */ \lnlbl[deref2]
 } else {
 	do_something_with(*p); /* OK! */
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Broken Dependencies With Comparisons}
 \label{lst:memorder:Broken Dependencies With Comparisons}
 \end{listing}

 \begin{enumerate}
 \item	Although it is permissible to compute offsets from a
 	pointer, these offsets must not result in total cancellation.
 	For example, given a \co{char} pointer \co{cp},
 	\co{cp-(uintptr_t)cp} will cancel and can allow the compiler
 	to break your dependency chain.
 	On the other hand, canceling offset values with each other
 	is perfectly safe and legal.
 	For example, if \co{a} and \co{b} are equal, \co{cp+a-b}
 	is an identity function, including preserving the dependency.
 \item	Comparisons can break dependencies.
 	\Cref{lst:memorder:Breakable Dependencies With Comparisons}
 	shows how this can happen.
 	Here global pointer \co{gp} points to a dynamically allocated
 	integer, but if memory is low, it might instead point to
 	the \co{reserve_int} variable.
 	\begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons]
 	This \co{reserve_int} case might need special handling, as
 	shown on \clnref{cmp,handle} of the listing.
 	\end{fcvref}
 	\begin{fcvref}[ln:memorder:Broken Dependencies With Comparisons]
 	But the compiler could reasonably transform this code into
 	the form shown in
 	\cref{lst:memorder:Broken Dependencies With Comparisons},
 	especially on systems where instructions with absolute
 	addresses run faster than instructions using addresses
 	supplied in registers.
 	However, there is clearly no ordering between the pointer
 	load on \clnref{deref1} and the dereference on \clnref{deref2}.
 	Please note that this is simply an example:
 	There are a great many other ways to break dependency chains
 	with comparisons.
 	\end{fcvref}
 \end{enumerate}

 \QuickQuizSeries{%
 \QuickQuizB{
 	\begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons]
 	Why can't you simply dereference the pointer before comparing it
 	to \co{&reserve_int} on \clnref{cmp} of
 	\cref{lst:memorder:Breakable Dependencies With Comparisons}?
 	\end{fcvref}
 }\QuickQuizAnswerB{
 	For first, it might be necessary to invoke
 	\co{handle_reserve()} before \co{do_something_with()}.

 	But more relevant to memory ordering, the compiler is often within
 	its rights to hoist the comparison ahead of the dereferences,
 	which would allow the compiler to use \co{&reserve_int} instead
 	of the variable \co{p} that the hardware has tagged with
 	a dependency.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But it should be safe to compare two pointer variables, right?
 	After all, the compiler doesn't know the value
 	of either, so how can it possibly learn anything from the
 	comparison?
 }\QuickQuizAnswerE{
 %
 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Breakable Dependencies With Non-Constant Comparisons]
 \begin{VerbatimL}
 int *gp1;
 int *p;
 int *q;

 p = rcu_dereference(gp1);
 q = get_a_pointer();
 if (p == q)
 	handle_equality(p);
 do_something_with(*p);
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Breakable Dependencies With Non-Constant Comparisons}
 \label{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}
 \end{listing}%
 %
 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Broken Dependencies With Non-Constant Comparisons]
 \begin{VerbatimL}[commandchars=\\\[\]]
 int *gp1;
 int *p;
 int *q;

 p = rcu_dereference(gp1);		\lnlbl[p]
 q = get_a_pointer();
 if (p == q) {
 	handle_equality(q);
 	do_something_with(*q);		\lnlbl[q]
 } else {
 	do_something_with(*p);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Broken Dependencies With Non-Constant Comparisons}
 \label{lst:memorder:Broken Dependencies With Non-Constant Comparisons}
 \end{listing}%
 %
 	Unfortunately, the compiler really can learn enough to
 	break your dependency chain, for example, as shown in
 	\cref{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}.
 	The compiler is within its rights to transform this code
 	into that shown in
 	\cref{lst:memorder:Broken Dependencies With Non-Constant Comparisons},
 	and might well make this transformation due to register pressure
 	if \co{handle_equality()} was inlined and needed a lot of registers.
 	\begin{fcvref}[ln:memorder:Broken Dependencies With Non-Constant Comparisons]
 	\Clnref{q} of this transformed code uses \co{q}, which although
 	equal to \co{p}, is not necessarily tagged by the hardware as
 	carrying a dependency.
 	Therefore, this transformed code does not necessarily guarantee
 	that \clnref{q} is ordered after \clnref{p}.\footnote{
 		Kudos to \ppl{Linus}{Torvalds} for providing this example.}
 	\end{fcvref}
 }\QuickQuizEndE
 }

 Note that a series of inequality comparisons might, when taken together,
 give the compiler enough information to determine the exact value of
 the pointer, at which point the dependency is broken.
 Furthermore, the compiler might be able to combine information from
 even a single inequality comparison with other information to learn
 the exact value, again breaking the dependency.
 Pointers to elements in arrays are especially susceptible to this latter
 form of dependency breakage.

 \subsubsection{Safe comparison of dependent pointers}
 It turns out that there are several safe ways to compare dependent
 pointers:

 \begin{enumerate}
 \item	Comparisons against the \co{NULL} pointer.
 	In this case, all the compiler can learn is that the pointer
 	is \co{NULL}, in which case you are not allowed to
 	dereference it anyway.
 \item	The dependent pointer is never dereferenced, whether before or
 	after the comparison.
 \item	The dependent pointer is compared to a pointer that references
 	objects that were last modified a very long time ago, where
 	the only unconditionally safe value of ``a very long time ago'' is
 	``at compile time''.
 	The key point is that something other than the address or data
 	dependency guarantees ordering.
 \item	Comparisons between two pointers, each of which carries
 	an appropriate dependency.
 	For example, you have a pair of pointers, each carrying a
 	dependency, to data structures each containing a lock, and you
 	want to avoid \IX{deadlock} by acquiring the locks in address order.
 \item	The comparison is not-equal, and the compiler does not have
 	enough other information to deduce the value of the
 	pointer carrying the dependency.
 \end{enumerate}

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Broken Dependencies With Pointer Comparisons]
 \begin{VerbatimL}[commandchars=\\\[\]]
 struct foo {		\lnlbl[foo:b]
 	int a;
 	int b;
 	int c;
 };                      \lnlbl[foo:e]
 struct foo *gp1;	\lnlbl[gp1]
 struct foo *gp2;	\lnlbl[gp2]

 void updater(void)		\lnlbl[upd:b]
 {
 	struct foo *p;

 	p = malloc(sizeof(*p));		\lnlbl[upd:alloc]
 	BUG_ON(!p);			\lnlbl[upd:bug]
 	p->a = 42;			\lnlbl[upd:init:a]
 	p->b = 43;
 	p->c = 44;			\lnlbl[upd:init:c]
 	rcu_assign_pointer(gp1, p);	\lnlbl[upd:assign1]
 	WRITE_ONCE(p->b, 143);		\lnlbl[upd:upd:b]
 	WRITE_ONCE(p->c, 144);		\lnlbl[upd:upd:c]
 	rcu_assign_pointer(gp2, p);	\lnlbl[upd:assign2]
 }				\lnlbl[upd:e]

 void reader(void)		\lnlbl[read:b]
 {
 	struct foo *p;
 	struct foo *q;
 	int r1, r2 = 0;

 	p = rcu_dereference(gp2);	\lnlbl[read:gp2]
 	if (p == NULL)			\lnlbl[read:nulchk]
 		return;			\lnlbl[read:nulret]
 	r1 = READ_ONCE(p->b);		\lnlbl[read:pb]
 	q = rcu_dereference(gp1);	\lnlbl[read:gp1]
 	if (p == q) {			\lnlbl[read:equ]
 		r2 = READ_ONCE(p->c);	\lnlbl[read:pc]
 	}
 	do_something_with(r1, r2);
 }				\lnlbl[read:e]
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Broken Dependencies With Pointer Comparisons}
 \label{lst:memorder:Broken Dependencies With Pointer Comparisons}
 \end{listing}

 Pointer comparisons can be quite tricky, and so it is well worth working
 through the example shown in
 \cref{lst:memorder:Broken Dependencies With Pointer Comparisons}.
 \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons]
 This example uses a simple \co{struct foo} shown on \clnrefrange{foo:b}{foo:e}
 and two global pointers, \co{gp1} and \co{gp2}, shown on \clnref{gp1,gp2},
 respectively.
 This example uses two threads, namely \co{updater()} on
 \clnrefrange{upd:b}{upd:e} and \co{reader()} on \clnrefrange{read:b}{read:e}.
 \end{fcvref}

 \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:upd]
 The \co{updater()} thread allocates memory on \clnref{alloc}, and complains
 bitterly on \clnref{bug} if none is available.
 \Clnrefrange{init:a}{init:c} initialize the newly allocated structure,
 and then \clnref{assign1} assigns the pointer to \co{gp1}.
 \Clnref{upd:b,upd:c} then update two of the structure's fields, and does
 so \emph{after} \clnref{assign1} has made those fields visible to readers.
 Please note that unsynchronized update of reader-visible fields
 often constitutes a bug.
 Although there are legitimate use cases doing just this, such use cases
 require more care than is exercised in this example.

 Finally, \clnref{assign2} assigns the pointer to \co{gp2}.
 \end{fcvref}

 \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
 The \co{reader()} thread first fetches \co{gp2} on \clnref{gp2}, with
 \clnref{nulchk,nulret} checking for \co{NULL} and returning if so.
 \Clnref{pb} fetches field \co{->b} and
 \clnref{gp1} fetches \co{gp1}.
 If \clnref{equ} sees that the pointers fetched on \clnref{gp2,gp1}
 are equal, \clnref{pc} fetches \co{p->c}.
 Note that \clnref{pc} uses pointer \co{p} fetched on \clnref{gp2}, not
 pointer \co{q} fetched on \clnref{gp1}.

 But this difference might not matter.
 An equals comparison on \clnref{equ} might lead the compiler to (incorrectly)
 conclude that both pointers are equivalent, when in fact they carry
 different dependencies.
 This means that the compiler might well transform \clnref{pc} to instead
 be \co{r2 = READ_ONCE(q->c)}, which might well cause the value 44 to be loaded
 instead of the expected value 144.
 \end{fcvref}

 \QuickQuiz{
 	\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
 	But doesn't the condition in \clnref{equ} supply a control dependency
 	that would keep \clnref{pc} ordered after \clnref{gp1}?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	\begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read]
 	Yes, but no.
 	Yes, there is a control dependency, but control dependencies do
 	not order later loads, only later stores.
 	If you really need ordering, you could place an \co{smp_rmb()}
 	between \clnref{equ,pc}.
 	Or better yet, have \co{updater()}
 	allocate two structures instead of reusing the structure.
 	For more information, see
 	\cref{sec:memorder:Control-Dependency Calamities}.
 	\end{fcvref}
 }\QuickQuizEnd

 In short, great care is required to ensure that dependency
 chains in your source code are still dependency chains in the
 compiler-generated assembly code.

 \subsection{Control-Dependency Calamities}
 \label{sec:memorder:Control-Dependency Calamities}

 The \IXalth{control dependencies}{control}{dependency} described in
 \cref{sec:memorder:Control Dependencies}
 are attractive due to their low overhead, but are also especially
 tricky because current compilers do not understand them and can easily
 break them.
 The rules and examples in this section are intended to help you
 prevent your compiler's ignorance from breaking your code.

 A load-load control dependency requires a full \IXh{read}{memory barrier},
 not simply a data dependency barrier.
 Consider the following bit of code:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q) {
 	<data dependency barrier>
 	q = READ_ONCE(y);
 }
 \end{VerbatimN}

 This will not have the desired effect because there is no actual data
 dependency, but rather a control dependency that the CPU may short-circuit
 by attempting to predict the outcome in advance, so that other CPUs see
 the load from~\co{y} as having happened before the load from~\co{x}.
 In such a case what's actually required is:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q) {
 	<read barrier>
 	q = READ_ONCE(y);
 }
 \end{VerbatimN}

 However, stores are not speculated.
 This means that ordering \emph{is} provided for load-store control
 dependencies, as in the following example:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q)
 	WRITE_ONCE(y, 1);
 \end{VerbatimN}

 Control dependencies pair normally with other types of ordering operations.
 That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()}
 are optional!
 Without the \co{READ_ONCE()}, the compiler might fuse the load
 from~\co{x} with other loads from~\co{x}.
 Without the \co{WRITE_ONCE()}, the compiler might fuse the store
 to~\co{y} with other stores to~\co{y}, or, worse yet, read the
 value, compare it, and only conditionally do the store.
 Any of these can result in highly counter-intuitive effects on ordering.

 Worse yet, if the compiler is able to prove (say) that the value of
 variable~\co{x} is always non-zero, it would be well within its rights
 to optimize the original example by eliminating the \qco{if} statement
 as follows:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
 \end{VerbatimN}

 \QuickQuiz{
 	But there is a \co{READ_ONCE()}, so how can the compiler
 	prove anything about the value of \co{q}?
 }\QuickQuizAnswer{
 	Given the simple \co{if} statement comparing against zero,
 	it is hard to imagine the compiler proving anything.
 	But suppose that later code executed a division by \co{q}.
 	Because division by zero is undefined behavior, as of 2023,
 	many compilers will assume that the value of \co{q} must
 	be non-zero, and will thus remove that \co{if} statement,
 	thus unconditionally executing the \co{WRITE_ONCE()}, in
 	turn destroying the control dependency.

 	There are some who argue (correctly, in Paul's view) that
 	back-propagating undefined behavior across volatile accesses
 	constitutes a compiler bug, but many compiler writers insist
 	that this is not a bug, but rather a valuable optimization.
 }\QuickQuizEnd

 It is tempting to try to enforce ordering on identical stores on both
 branches of the \qco{if} statement as follows:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q) {
 	barrier();
 	WRITE_ONCE(y, 1);
 	do_something();
 } else {
 	barrier();
 	WRITE_ONCE(y, 1);
 	do_something_else();
 }
 \end{VerbatimN}

 Unfortunately, current compilers will transform this as follows at high
 optimization levels:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 barrier();
 WRITE_ONCE(y, 1);  /* BUG: No ordering!!! */
 if (q)
 	do_something();
 else
 	do_something_else();
 \end{VerbatimN}

 Now there is no conditional between the load from~\co{x} and the store
 to~\co{y}, which means that the CPU is within its rights to reorder them:
 The conditional is absolutely required, and must be present in the
 assembly code even after all compiler optimizations have been applied.
 Therefore, if you need ordering in this example, you need explicit
 memory-ordering operations, for example, a \IX{release store}:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q) {
 	smp_store_release(&y, 1);
 	do_something();
 } else {
 	smp_store_release(&y, 1);
 	do_something_else();
 }
 \end{VerbatimN}

 The initial \co{READ_ONCE()} is still required to prevent the compiler from
 guessing the value of~\co{x}.
 In addition, you need to be careful what you do with the local variable~%
 \co{q},
 otherwise the compiler might be able to guess its value and again remove
 the needed conditional.
 For example:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q % MAX) {
 	WRITE_ONCE(y, 1);
 	do_something();
 } else {
 	WRITE_ONCE(y, 2);
 	do_something_else();
 }
 \end{VerbatimN}

 If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is
 equal to zero, in which case the compiler is within its rights to
 transform the above code into the following:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 WRITE_ONCE(y, 2);
 do_something_else();
 \end{VerbatimN}

 Given this transformation, the CPU is not required to respect the ordering
 between the load from variable~\co{x} and the store to variable~\co{y}.
 It is tempting to add a \co{barrier()} to constrain the compiler,
 but this does not help.
 The conditional is gone, and the \co{barrier()} won't bring it back.
 Therefore, if you are relying on this ordering, you should make sure
 that \co{MAX} is greater than one, perhaps as follows:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 BUILD_BUG_ON(MAX <= 1);
 if (q % MAX) {
 	WRITE_ONCE(y, 1);
 	do_something();
 } else {
 	WRITE_ONCE(y, 2);
 	do_something_else();
 }
 \end{VerbatimN}

 Please note once again that the stores to~\co{y} differ.
 If they were identical, as noted earlier, the compiler could pull this
 store outside of the \qco{if} statement.

 You must also avoid excessive reliance on boolean short-circuit evaluation.
 Consider this example:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q || 1 > 0)
 	WRITE_ONCE(y, 1);
 \end{VerbatimN}

 Because the first condition cannot fault and the second condition is
 always true, the compiler can transform this example as following,
 defeating the control dependency:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 WRITE_ONCE(y, 1);
 \end{VerbatimN}

 This example underscores the need to ensure that the compiler cannot
 out-guess your code.
 Never forget that, although \co{READ_ONCE()} does force
 the compiler to actually emit code for a given load, it does not force
 the compiler to use the value loaded.

 In addition, control dependencies apply only to the then-clause and
 else-clause of the if-statement in question.
 In particular, it does
 not necessarily apply to code following the if-statement:

 \begin{VerbatimN}
 q = READ_ONCE(x);
 if (q)
 	WRITE_ONCE(y, 1);
 else
 	WRITE_ONCE(y, 2);
 WRITE_ONCE(z, 1);  /* BUG: No ordering. */
 \end{VerbatimN}

 It is tempting to argue that there in fact is ordering because the
 compiler cannot reorder volatile accesses and also cannot reorder
 the writes to~\co{y} with the condition.
 Unfortunately for this line
 of reasoning, the compiler might compile the two writes to~\co{y} as
 conditional-move instructions, as in this fanciful pseudo-assembly
 language:

 \begin{VerbatimN}
 ld r1,x
 cmp r1,$0
 cmov,ne r4,$1
 cmov,eq r4,$2
 st r4,y
 st $1,z
 \end{VerbatimN}

 A weakly ordered CPU would have no dependency of any sort between the load
 from~\co{x} and the store to~\co{z}.
 The control dependencies would extend
 only to the pair of \co{cmov} instructions and the store depending on them.
 In short, control dependencies apply only to the stores in the \qco{then}
 and \qco{else} of the \qco{if} in question (including functions invoked by
 those two clauses), and not necessarily to code following that \qco{if}.

 Finally, control dependencies do \emph{not} provide cumulativity.\footnote{
 	Refer to \cref{sec:memorder:Cumulativity} for
 	the meaning of cumulativity.}
 This is demonstrated by two related litmus tests, namely
 \cref{lst:memorder:LB Litmus Test With Control Dependency,%
 lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
 with the initial values
 of~\co{x} and~\co{y} both being zero.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-LB+o-cgt-o+o-cgt-o@whole.fcv}
 \caption{LB Litmus Test With Control Dependency}
 \label{lst:memorder:LB Litmus Test With Control Dependency}
 \end{listing}

 The \co{exists} clause in the two-thread example of
 \cref{lst:memorder:LB Litmus Test With Control Dependency}
 (\path{C-LB+o-cgt-o+o-cgt-o.litmus})
 will never trigger.
 If control dependencies guaranteed cumulativity (which they do
 not), then adding a thread to the example as in
 \cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
 (\path{C-WWC+o-cgt-o+o-cgt-o+o.litmus})
 would guarantee the related \co{exists} clause never to trigger.

 \begin{listing}
 \input{CodeSamples/formal/litmus/C-WWC+o-cgt-o+o-cgt-o+o@whole.fcv}
 \caption{WWC Litmus Test With Control Dependency (Cumulativity?)}
 \label{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}
 \end{listing}

 But because control dependencies do \emph{not} provide cumulativity, the
 \co{exists} clause in the three-thread litmus test can trigger.
 If you need the three-thread example to provide ordering, you will need
 \co{smp_mb()} between the load and store in \co{P0()},
 that is, just before or just after the \qco{if} statements.
 Furthermore, the original two-thread example is very fragile and should be avoided.

 \QuickQuiz{
 	Can't you instead add an \co{smp_mb()} to \co{P1()} in
 	\cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}?
 }\QuickQuizAnswer{
 	Not given the Linux kernel memory model.
 	(Try it!)
 	However, you can instead replace \co{P0()}'s
 	\co{WRITE_ONCE()} with \co{smp_store_release()},
 	which usually has less overhead than does adding an \co{smp_mb()}.
 }\QuickQuizEnd

 The following list of rules summarizes the lessons of this section:

 \begin{enumerate}
 \item	Compilers do not understand control dependencies, so it is
 	your job to make sure that the compiler cannot break your code.

 \item	Control dependencies can order prior loads against later stores.
 	However, they do \emph{not} guarantee any other sort of ordering:
 	Not prior loads against later loads, nor prior stores against
 	later anything.
 	If you need these other forms of ordering, use \co{smp_rmb()},
 	\co{smp_wmb()}, or, in the case of prior stores and later loads,
 	\co{smp_mb()}.

 \item	If both legs of the \qco{if} statement begin with identical stores
 	to the same variable, then the control dependency will not order
 	those stores.
 	If ordering is needed, precede both of them with \co{smp_mb()} or
 	use \co{smp_store_release()}.
 	Please note that it is \emph{not} sufficient to use \co{barrier()}
 	at beginning of each leg of the \qco{if} statement because, as shown
 	by the example above, optimizing compilers can destroy the control
 	dependency while respecting the letter of the \co{barrier()} law.

 \item	Control dependencies require at least one run-time conditional
 	between the prior load and the subsequent store, and this
 	conditional must involve the prior load.
 	If the compiler is able to optimize the conditional away, it
 	will have also optimized away the ordering.
 	Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help
 	to preserve the needed conditional.

 \item	Control dependencies require that the compiler avoid reordering
 	the dependency into nonexistence.
 	Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or
 	\co{atomic64_read()} can help to preserve your control
 	dependency.

 \item	Control dependencies apply only to the \qco{then} and
 	\qco{else} of the \qco{if} containing the control
 	dependency, including any functions that these two clauses call.
 	Control dependencies do \emph{not} apply to code following the
 	end of the \qco{if} statement containing the control dependency.

 \item	Control dependencies pair normally with other types of
 	memory-ordering operations.

 \item	Control dependencies do \emph{not} provide cumulativity.
 	If you need cumulativity, use something that provides it,
 	such as \co{smp_store_release()} or \co{smp_mb()}.
 \end{enumerate}

 Again, many popular languages were designed with single-threaded use
 in mind.
 Successful multithreaded use of these languages requires you to pay
 special attention to your memory references and dependencies.

 \section{Higher-Level Primitives}
 \label{sec:memorder:Higher-Level Primitives}
 %
 \epigraph{Method will teach you to win time.}
 	 {Johann Wolfgang von Goethe}

 The answer to one of the quick quizzes in
 \cref{sec:formal:Axiomatic Approaches and Locking}
 demonstrated exponential speedups due to verifying programs
 modeled at higher levels of abstraction.
 This section will look into how higher levels of abstraction can
 also provide a deeper understanding of the synchronization primitives
 themselves.
 \Cref{sec:memorder:Memory Allocation}
 takes a look at memory allocation,
 \cref{sec:memorder:Locking}
 examines the surprisingly varied semantics of locking, and
 \cref{sec:memorder:RCU}
 digs more deeply into RCU\@.

 \subsection{Memory Allocation}
 \label{sec:memorder:Memory Allocation}

 \Cref{sec:SMPdesign:Parallel Fastpath for Resource Allocation}
 touched upon memory allocation, and this section expands upon the relevant
 memory-ordering issues.

 The key requirement is that any access executed on a given block of
 memory before freeing that block must be ordered before any access
 executed after that same block is reallocated.
 It would after all be a cruel and unusual memory-allocator bug if a store
 preceding the free were to be reordered after another store following
 the reallocation!
 However, it would also be cruel and unusual to require developers to use
 \co{READ_ONCE()} and \co{WRITE_ONCE()} to access dynamically allocated
 memory.
 Full ordering must therefore be provided for plain accesses, in spite of
 all the shared-variable shenanigans called out in
 \cref{sec:toolsoftrade:Shared-Variable Shenanigans}.

 Of course, each CPU sees its own accesses in order and the compiler
 always has fully accounted for intra-CPU shenanigans, give or take
 the occasional compiler bug.
 These facts are what enables the lockless fastpaths in
 \co{memblock_alloc()} and \co{memblock_free()}, which are shown in
 \cref{lst:SMPdesign:Allocator-Cache Allocator Function,%
 lst:SMPdesign:Allocator-Cache Free Function},
 respectively.
 However, this is also why the developer is responsible for providing
 appropriate ordering (for example, by using \co{smp_store_release()})
 when publishing a pointer to a newly allocated block of memory.
 After all, in the CPU-local case, the allocator has not necessarily
 provided any cross-CPU ordering.

 This means that the allocator must provide ordering when rebalancing
 its per-thread pools.
 This ordering is provided by the calls to \co{spin_lock()} and
 \co{spin_unlock()} from \co{memblock_alloc()} and \co{memblock_free()}.
 For any block that has migrated from one thread to another, the old
 thread will have executed \co{spin_unlock(&globalmem.mutex)} after
 placing the block in the \co{globalmem} pool, and the new thread will
 have executed \co{spin_lock(&globalmem.mutex)} before moving that
 block to its per-thread pool.
 This \co{spin_unlock()} and \co{spin_lock()} ensures that both the
 old and new threads see the old thread's accesses as having happened
 before those of the new thread.

 \QuickQuiz{
 	But doesn't PowerPC have weak unlock-lock ordering properties
 	within the Linux kernel, allowing a write before the unlock to
 	be reordered with a read after the lock?
 }\QuickQuizAnswer{
 	Yes, but only from the perspective of a third thread not holding
 	that lock.
 	In contrast, memory allocators need only concern themselves with
 	the two threads migrating the memory.
 	It is after all the developer's responsibility to properly
 	synchronize with any other threads that need access to the newly
 	migrated block of memory.
 }\QuickQuizEnd

 Therefore, the ordering required by conventional uses of memory allocation
 can be provided solely by non-fastpath locking, allowing the fastpath to
 remain synchronization-free.

 \subsection{Locking}
 \label{sec:memorder:Locking}

 Locking is a well-known synchronization primitive with which the
 parallel-programming community has had decades of experience.
 As such, locking's semantics are quite simple.

 That is, they are quite simple until you start trying to mathematically
 model them.

 The simple part is that any CPU or thread holding a given lock is
 guaranteed to see any accesses executed by CPUs or threads while they
 were previously holding that same lock.
 Similarly, any CPU or thread holding a given lock is guaranteed not
 to see accesses that will be executed by other CPUs or threads while
 subsequently holding that same lock.
 And what else is there?

 As it turns out, quite a bit:

 \begin{enumerate}
 \item	Are CPUs, threads, or compilers allowed to pull memory accesses
 	into a given lock-based critical section?
 \item	Will a CPU or thread holding a given lock also be guaranteed
 	to see accesses executed by CPUs and threads before they last
 	acquired that same lock, and vice versa?
 \item	Suppose that a given CPU or thread executes one access
 	(call it ``A''), releases a lock, reacquires that same lock,
 	then executes another access (call it ``B'')\@.
 	Is some other CPU or thread not holding that lock guaranteed to
 	see A and B in order?
 \item	As above, but with the lock reacquisition carried out by some
 	other CPU or thread?
 \item	As above, but with the lock reacquisition being some other lock?
 \item	What ordering guarantees are provided by \co{spin_is_locked()}?
 \end{enumerate}

 The reaction to some or even all of these questions might well be ``Why
 would anyone do \emph{that}?''
 However, any complete mathematical definition of locking must have
 answers to all of these questions.
 Therefore, the following sections address these questions in the context
 of the Linux kernel.

 \subsubsection{Accesses Into Critical Sections?}
 \label{sec:memorder:Accesses Into Critical Sections?}

 Can memory accesses be reordered into lock-based critical sections?

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-before-into@whole.fcv}
 \caption{Prior Accesses Into Critical Section (Ordering?)}
 \label{lst:memorder:Prior Accesses Into Critical Section (Ordering?)}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-after-into@whole.fcv}
 \caption{Subsequent Accesses Into Critical Section (Ordering?)}
 \label{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)}
 \end{listing}

 Within the context of the Linux-kernel memory model, the simple answer
 is ``yes''.
 This may be verified by running the litmus tests shown in
 \cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?),%
 lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)}
 (\path{C-Lock-before-into.litmus} and \path{C-Lock-after-into.litmus},
 respectively), both of which will yield the \co{Sometimes} result.
 This result indicates that the \co{exists} clause can be satisfied, that
 is, that the final value of both \co{P0()}'s and \co{P1()}'s \co{r1} variable
 can be zero.
 This means that neither \co{spin_lock()} nor \co{spin_unlock()}
 are required to act as a \IXh{full}{memory barrier}.

 However, other environments might make other choices.
 For example, locking implementations that run only on the x86 CPU
 family will have lock-acquisition primitives that fully order the lock
 acquisition with any prior and any subsequent accesses.
 Therefore, on such systems the ordering shown in
 \cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?)}
 comes for free.
 There are x86 lock-release implementations that are weakly ordered,
 thus failing to provide the ordering shown in
 \cref{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)},
 but an implementation could nevertheless choose to guarantee this ordering.

 For their part, weakly ordered systems might well choose to execute
 the memory-barrier instructions required to guarantee both orderings,
 possibly simplifying code making advanced use of combinations of locked
 and lockless accesses.
 However, as noted earlier, \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}
 chooses not to provide these additional
 orderings, in part to avoid imposing performance penalties on the simpler
 and more prevalent locking use cases.
 Instead, the \co{smp_mb__after_spinlock()} and \co{smp_mb__after_unlock_lock()}
 primitives are provided for those more complex use cases, as discussed
 in \cref{sec:memorder:Hardware Specifics}.

 Thus far, this section has discussed only hardware reordering.
 Can the compiler also reorder memory references into lock-based
 critical sections?

 The answer to this question in the context of the Linux kernel is a
 resounding ``No!''
 One reason for this otherwise inexplicable favoring of hardware reordering
 over compiler optimizations is that the hardware will avoid reordering
 a page-faulting access into a lock-based critical section.
 In contrast, compilers have no clue about page faults, and would
 therefore happily reorder a page fault into a critical section, which
 could crash the kernel.
 The compiler is also unable to reliably determine which accesses
 will result in cache misses, so that compiler reordering into critical
 sections could also result in excessive lock contention.
 Therefore, the Linux kernel prohibits the compiler (but not the CPU)
 from moving accesses into lock-based critical sections.

 \subsubsection{Accesses Outside of Critical Section?}
 \label{sec:memorder:Accesses Outside of Critical Section?}

 If a given CPU or thread holds a given lock, it is guaranteed to see
 accesses executed during all prior critical sections for that same
 lock.
 Similarly, such a CPU or thread is guaranteed not to see accesses
 that will be executed during all subsequent critical sections for
 that same lock.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-outside-across@whole.fcv}
 \caption{Accesses Outside of Critical Sections}
 \label{lst:memorder:Accesses Outside of Critical Sections}
 \end{listing}

 But what about accesses preceding prior critical sections and
 following subsequent critical sections?

 This question can be answered for the Linux kernel by referring to
 \cref{lst:memorder:Accesses Outside of Critical Sections}
 (\path{C-Lock-outside-across.litmus}).
 Running this litmus test yields the \co{Never} result,
 which means that accesses in code leading up to a prior critical section
 is also visible to the current CPU or thread holding that same lock.
 Similarly, code that is placed after a subsequent critical section
 is never visible to the current CPU or thread holding that same lock.

 As a result, the Linux kernel cannot allow accesses to be moved
 across the entirety of a given critical section.
 Other environments might well wish to allow such code motion, but please
 be advised that doing so is likely to yield profoundly counter-intuitive
 results.

 In short, the ordering provided by \co{spin_lock()} extends not only
 throughout the critical section, but also indefinitely beyond the end
 of that critical section.
 Similarly, the ordering provided by \co{spin_unlock()} extends not
 only throughout the critical section, but also indefinitely beyond the
 beginning of that critical section.

 \subsubsection{Ordering for Non-Lock Holders?}
 \label{sec:memorder:Ordering for Non-Lock Holders?}

 Does a CPU or thread that is not holding a given lock see that lock's
 critical sections as being ordered?

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-1@whole.fcv}
 \caption{Accesses Between Same-CPU Critical Sections (Ordering?)}
 \label{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)}
 \end{listing}

 This question can be answered for the Linux kernel by referring to
 \cref{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)}
 (\path{C-Lock-across-unlock-lock-1.litmus}), which
 shows an example where \co{P(0)} places its write and read in two
 different critical sections for the same lock.
 Running this litmus test shows that the \co{exists} can be satisfied,
 which means that the answer is ``no'', and that CPUs can reorder accesses
 across consecutive critical sections.
 In other words, not only are \co{spin_lock()} and \co{spin_unlock()}
 weaker than a full barrier when considered separately, they are also
 weaker than a full barrier when taken together.

 If the ordering of a given lock's critical sections are to be observed,
 then either the observer must hold that lock on the one hand or either
 \co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()}
 must be executed just after the second lock acquisition on the other.

 But what if the two critical sections run on different CPUs or threads?

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-2@whole.fcv}
 \caption{Accesses Between Different-CPU Critical Sections (Ordering?)}
 \label{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)}
 \end{listing}

 This question is answered for the Linux kernel by referring to
 \cref{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)}
 (\path{C-Lock-across-unlock-lock-2.litmus}),
 in which the first lock acquisition is executed by \co{P0()} and the
 second lock acquisition is executed by \co{P1()}.
 Note that \co{P1()} must read \co{x} to reject executions in which
 \co{P1()} executes before \co{P0()} does.
 Running this litmus test shows that the \co{exists} can be satisfied,
 which means that the answer is ``no'', and that CPUs can reorder accesses
 across consecutive critical sections, even if each of those critical
 sections runs on a different CPU or thread.

 \QuickQuiz{
 	But if there are three critical sections, isn't it true that
 	CPUs not holding the lock will observe the accesses from the
 	first and the third critical section as being ordered?
 }\QuickQuizAnswer{
 	No.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-3@whole.fcv}
 \caption{Accesses Between Multiple Different-CPU Critical Sections}
 \label{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections}
 \end{listing}

 	\Cref{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections}
 	shows an example three-critical-section chain
 	(\path{Lock-across-unlock-lock-3.litmus}).
 	Running this litmus test shows that the \co{exists} clause can
 	still be satisfied, so this additional critical section is still
 	not sufficient to force ordering.

 	However, as the reader can verify, placing an
 	\co{smp_mb__after_spinlock()} after either \co{P1()}'s or
 	\co{P2()}'s lock acquisition does suffice to force ordering.
 }\QuickQuizEnd

 As before, if the ordering of a given lock's critical sections are to
 be observed, then either the observer must hold that lock or either
 \co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()} must
 be executed just after \co{P1()}'s lock acquisition.

 Given that ordering is not guaranteed when both critical sections are
 protected by the same lock, there is no hope of any ordering guarantee
 when different locks are used.
 However, readers are encouraged to construct the corresponding litmus
 test and see this for themselves.

 This situation can seem counter-intuitive, but it is rare for code to
 care.
 This approach also allows certain weakly ordered systems to implement
 locks more efficiently.

 \subsubsection{Ordering for \tco{spin_is_locked()}?}
 \label{sec:memorder:Ordering for spin-is-locked()?}

 The Linux kernel's \co{spin_is_locked()} primitive returns
 \co{true} if the specified lock is held and \co{false} otherwise.
 Note that \co{spin_is_locked()} returns \co{true} when some other
 CPU or thread holds the lock, not just when the current CPU or thread
 holds that lock.
 This raises the question of what ordering guarantees \co{spin_is_locked()}
 might provide.

 In the Linux kernel, the answer has varied over time.
 Initially, \co{spin_is_locked()} was unordered, but a few interesting
 use cases motivated strong ordering.
 Later discussions surrounding the Linux-kernel memory model concluded
 that \co{spin_is_locked()} should be used only for debugging.
 Part of the reason for this is that even a fully ordered
 \co{spin_is_locked()} might return \co{true} because some other CPU or
 thread was just about to release the lock in question.
 In this case, there is little that can be learned from that return value
 of \co{true}, which means that reliable use of \co{spin_is_locked()}
 is surprisingly complex.
 Other approaches almost always work better, for example, use of explicit
 shared variables or the \co{spin_trylock()} primitive.

 This situation resulted in the current state, namely that
 \co{spin_is_locked()} provides no ordering guarantees, except that if
 it returns \co{false}, the current CPU or thread cannot be holding the
 corresponding lock.

 \QuickQuiz{
 	But if \co{spin_is_locked()} returns \co{false}, don't we also
 	know that no other CPU or thread is holding the corresponding
 	lock?
 }\QuickQuizAnswer{
 	No.
 	By the time that the code inspects the return value from
 	\co{spin_is_locked()}, some other CPU or thread might well have
 	acquired the corresponding lock.
 }\QuickQuizEnd

 \subsubsection{Why Mathematically Model Locking?}
 \label{sec:memorder:Why Mathematically Model Locking?}

 Given all these possible choices, why model locking in general?
 Why not simply model a simple implementation?

 One reason is modeling performance, as shown in
 \cref{tab:formal:Locking: Modeling vs. Emulation Time (s)}
 on
 \cpageref{tab:formal:Locking: Modeling vs. Emulation Time (s)}.
 Directly modeling locking in general is orders of magnitude faster
 than emulating even a trivial implementation.
 This should be no surprise, given the combinatorial explosion experienced
 by present-day formal-verification tools with increases in the number of
 memory accesses executed by the code being modeled.
 Splitting the modeling at API boundaries can therefore result in
 combinatorial implosion.

 Another reason is that a trivial implementation might needlessly constrain
 either real implementations or real use cases.
 In contrast, modeling a platonic lock allows the widest variety of
 implementations while providing specific guidance to locks' users.

 \subsection{RCU}
 \label{sec:memorder:RCU}

 As described in
 \cref{sec:defer:RCU Fundamentals},
 the fundamental property of RCU grace periods is this straightforward
 two-part guarantee:
 \begin{enumerate*}[(1)]
 \item If any part of a given RCU read-side critical section precedes
 the beginning of a given \IX{grace period}, then the entirety of that
 critical section precedes the end of that grace period.
 \item If any part of a given RCU read-side critical section follows
 the end of a given grace period, then the entirety of that
 critical section follows the beginning of that grace period.
 \end{enumerate*}
 These guarantees are summarized in
 \cref{fig:memorder:RCU Grace-Period Ordering Guarantees},
 where the grace period is denoted by the dashed arrow between the
 \co{call_rcu()} invocation in the upper right and the corresponding
 RCU callback invocation in the lower left.\footnote{
 	For more detail, please see
 	\crefrange{fig:defer:RCU Reader and Later Grace Period}{fig:defer:RCU Reader Within Grace Period}
 	starting on
 	\cpageref{fig:defer:RCU Reader and Later Grace Period}.}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{memorder/RCUGPordering}}
 \caption{RCU Grace-Period Ordering Guarantees}
 \label{fig:memorder:RCU Grace-Period Ordering Guarantees}
 \end{figure}

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul@whole.fcv}
 \caption{RCU Fundamental Property}
 \label{lst:memorder:RCU Fundamental Property}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+i-rl-o-o-rul@whole.fcv}
 \caption{RCU Fundamental Property and Reordering}
 \label{lst:memorder:RCU Fundamental Property and Reordering}
 \end{listing}

 In short, an RCU read-side critical section is guaranteed never to
 completely overlap an RCU grace period, as demonstrated by
 \cref{lst:memorder:RCU Fundamental Property}
 (\path{C-SB+o-rcusync-o+rl-o-o-rul.litmus}).
 Either or neither of the \co{r2} registers can have the final value of zero,
 but at least one of them must be non-zero (that is, the cycle identified
 by the \co{exists} clause is prohibited), courtesy of RCU's fundamental
 grace-period guarantee, as can be seen by running \co{herd} on this litmus test.
 Note that this guarantee is insensitive to the ordering of the accesses
 within \co{P1()}'s critical section, so the litmus test shown in
 \cref{lst:memorder:RCU Fundamental Property and Reordering}\footnote{
 	Dependencies can of course limit the ability to reorder accesses
 	within RCU read-side critical sections.}
 also forbids this same cycle.

 However, this definition is incomplete, as can be seen from the following
 list of questions:\footnote{
 	Several of which were introduced to Paul by \ppl{Jade}{Alglave} during
 	early work on LKMM, and a few more of which came from other
 	LKMM participants~\cite{Alglave:2018:FSC:3173162.3177156}.}

 \begin{enumerate}
 \item	What ordering is provided by \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()}, independent of RCU grace periods?
 \item	What ordering is provided by \co{synchronize_rcu()}
 	and \co{synchronize_rcu_expedited()}, independent of RCU read-side
 	critical sections?
 \item	If the entirety of a given RCU read-side critical section
 	precedes the end of a given RCU grace period, what about
 	accesses preceding that critical section?
 \item	If the entirety of a given RCU read-side critical section
 	follows the beginning of a given RCU grace period, what about
 	accesses following that critical section?
 \item	What happens in situations involving more than one RCU read-side
 	critical section and/or more than one RCU grace period?
 \item	What happens when RCU is combined with other memory-ordering
 	mechanisms?
 \end{enumerate}

 These questions are addressed in the following sections.

 \subsubsection{RCU Read-Side Ordering}
 \label{sec:memorder:RCU Read-Side Ordering}

 On their own, RCU's read-side primitives \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} provide no ordering whatsoever.
 In particular, despite their names, they do not act like locks, as can
 be seen in
 \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}
 (\path{C-LB+rl-o-o-rul+rl-o-o-rul.litmus}).
 This litmus test's cycle is allowed:
 Both instances of the \co{r1} register can have final values of 1.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-LB+rl-o-o-rul+rl-o-o-rul@whole.fcv}
 \caption{RCU Readers Provide No Lock-Like Ordering}
 \label{lst:memorder:RCU Readers Provide No Lock-Like Ordering}
 \end{listing}

 Nor do these primitives have barrier-like ordering properties,
 at least not unless there is a grace period in the mix, as can be seen in
 \cref{lst:memorder:RCU Readers Provide No Barrier-Like Ordering}
 (\path{C-LB+o-rl-rul-o+o-rl-rul-o.litmus}).
 This litmus test's cycle is also allowed.
 (Try it!)

 \begin{listing}
 \input{CodeSamples/formal/herd/C-LB+o-rl-rul-o+o-rl-rul-o@whole.fcv}
 \caption{RCU Readers Provide No Barrier-Like Ordering}
 \label{lst:memorder:RCU Readers Provide No Barrier-Like Ordering}
 \end{listing}

 Of course, lack of ordering in both these litmus tests should be absolutely
 no surprise, given that both \co{rcu_read_lock()} and \co{rcu_read_unlock()}
 are no-ops in the \IXacr{qsbr} implementation of RCU\@.

 \subsubsection{RCU Update-Side Ordering}
 \label{sec:memorder:RCU Update-Side Ordering}

 In contrast with RCU readers, the RCU update-side functions
 \co{synchronize_rcu()} and \co{synchronize_rcu_expedited()}
 provide memory ordering at least as strong as \co{smp_mb()},\footnote{
 	And also way more expensive!}
 as can be seen by running \co{herd} on the litmus test shown in
 \cref{lst:memorder:RCU Updaters Provide Full Ordering}.
 This test's cycle is prohibited, just as it would with \co{smp_mb()}.
 This should be no surprise given the information presented in
 \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o@whole.fcv}
 \caption{RCU Updaters Provide Full Ordering}
 \label{lst:memorder:RCU Updaters Provide Full Ordering}
 \end{listing}

 \subsubsection{RCU Readers:
 			    Before and After}
 \label{sec:memorder:RCU Readers: Before and After}

 Before reading this section, it would be well to reflect on the distinction
 between guarantees that are available and guarantees that maintainable
 software should rely on.
 Keeping that firmly in mind, this section presents a few of the
 more exotic RCU guarantees.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-o-rul@whole.fcv}
 \caption{What Happens Before RCU Readers?}
 \label{lst:memorder:What Happens Before RCU Readers?}
 \end{listing}

 \Cref{lst:memorder:What Happens Before RCU Readers?}
 (\path{C-SB+o-rcusync-o+o-rl-o-rul.litmus})
 shows a litmus test similar to that in
 \cref{lst:memorder:RCU Fundamental Property},
 but with the RCU reader's first access preceding the RCU read-side critical
 section, rather than the more conventional (and maintainable!\@) approach of
 being contained within it.
 Perhaps surprisingly, running \co{herd} on this litmus test gives the
 same result as for that in
 \cref{lst:memorder:RCU Fundamental Property}:
 The cycle is forbidden.

 Why would this be the case?

 Because both of \co{P1()}'s accesses are volatile,
 as discussed in
 \cref{sec:toolsoftrade:A Volatile Solution},
 the compiler is not permitted to reorder them.
 This means that the code emitted for \co{P1()}'s \co{WRITE_ONCE()} will
 precede that of \co{P1()}'s \co{READ_ONCE()}.
 Therefore, RCU implementations that place memory-barrier instructions in
 \co{rcu_read_lock()} and \co{rcu_read_unlock()} will preserve the ordering
 of \co{P1()}'s two accesses all the way down to the hardware level.
 On the other hand, RCU implementations that rely on interrupt-based
 state machines will also fully preserve this ordering
 \emph{relative to the grace period} due to the fact that interrupts take
 place at a precise location in the execution of the interrupted code.

 This in turn means that if the \co{WRITE_ONCE()} follows the end of a
 given RCU grace period, then the accesses within \emph{and following}
 that RCU read-side critical section must follow the beginning of that
 same grace period.
 Similarly, if the \co{READ_ONCE()} precedes the beginning of the grace
 period, everything within \emph{and preceding} that critical section
 must precede the end of that same grace period.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-rul-o@whole.fcv}
 \caption{What Happens After RCU Readers?}
 \label{lst:memorder:What Happens After RCU Readers?}
 \end{listing}

 \Cref{lst:memorder:What Happens After RCU Readers?}
 (\path{C-SB+o-rcusync-o+rl-o-rul-o.litmus})
 is similar, but instead looks at accesses after the RCU read-side
 critical section.
 This test's cycle is also forbidden, as can be checked with the \co{herd}
 tool.
 The reasoning is similar to that for
 \cref{lst:memorder:What Happens Before RCU Readers?},
 and is left as an exercise for the reader.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-rul-o@whole.fcv}
 \caption{What Happens With Empty RCU Readers?}
 \label{lst:memorder:What Happens With Empty RCU Readers?}
 \end{listing}

 \Cref{lst:memorder:What Happens With Empty RCU Readers?}
 (\path{C-SB+o-rcusync-o+o-rl-rul-o.litmus})
 takes things one step farther, moving \co{P1()}'s \co{WRITE_ONCE()}
 to precede the RCU read-side critical section and moving
 \co{P1()}'s \co{READ_ONCE()} to follow it, resulting in an
 empty RCU read-side critical section.

 Perhaps surprisingly, despite the empty critical section, RCU nevertheless
 still manages to forbid the cycle.
 This can again be checked using the \co{herd} tool.
 Furthermore, the reasoning is once again similar to that for
 \cref{lst:memorder:What Happens Before RCU Readers?}.
 Recapping, if \co{P1()}'s \co{WRITE_ONCE()} follows the end of a given
 grace period, then \co{P1()}'s RCU read-side critical section---and
 everything following it---must follow the beginning of that same grace
 period.
 Similarly, if \co{P1()}'s \co{READ_ONCE()} precedes the beginning of a
 given grace period, then \co{P1()}'s RCU read-side critical section---and
 everything preceding it---must precede the end of that same grace period.
 In both cases, the critical section's emptiness is irrelevant.

 \QuickQuiz{
 	Wait a minute!
 	In QSBR implementations of RCU, no code is emitted for
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}.
 	This means that the RCU read-side critical section in
 	\cref{lst:memorder:What Happens With Empty RCU Readers?}
 	isn't just empty, it is completely nonexistent!!!
 	So how can something that doesn't exist at all possibly have
 	any effect whatsoever on ordering???
 }\QuickQuizAnswer{
 	Because in QSBR, RCU read-side critical sections don't
 	actually disappear.
 	Instead, they are extended in both directions until a quiescent
 	state is encountered.
 	For example, in the Linux kernel, the critical section might
 	be extended back to the most recent \co{schedule()} call and
 	ahead to the next \co{schedule()} call.
 	Of course, in non-QSBR implementations, \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()} really do emit code, which can clearly
 	provide ordering.
 	And within the Linux kernel, even the QSBR implementation
 	has a compiler \co{barrier()} in \co{rcu_read_lock()} and
 	\co{rcu_read_unlock()}, which is necessary to prevent
 	the compiler from moving memory accesses that might result
 	in page faults into the RCU read-side critical section.

 	Therefore, strange though it might seem, empty RCU read-side
 	critical sections really can and do provide some degree of
 	ordering.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-o@whole.fcv}
 \caption{What Happens With No RCU Readers?}
 \label{lst:memorder:What Happens With No RCU Readers?}
 \end{listing}

 This situation leads to the question of what happens if
 \co{rcu_read_lock()} and \co{rcu_read_unlock()} are omitted
 entirely, as shown in
 \cref{lst:memorder:What Happens With No RCU Readers?}
 (\path{C-SB+o-rcusync-o+o-o.litmus}).
 As can be checked with \co{herd}, this litmus test's cycle is allowed,
 that is, both instances of \co{r2} can have final values of zero.

 This might seem strange in light of the fact that empty RCU
 read-side critical sections can provide ordering.
 And it is true that QSBR implementations of RCU would in fact forbid
 this outcome, due to the fact that there is no quiescent state anywhere
 in \co{P1()}'s function body, so that \co{P1()} would run
 within an implicit RCU read-side critical section.
 However, RCU also has non-QSBR implementations, which have no implied
 RCU read-side critical section, and in turn no way for RCU to enforce
 ordering.
 Therefore, this litmus test's cycle is allowed.

 \QuickQuiz{
 	Can \co{P1()}'s accesses be reordered in the litmus tests shown in
 	\cref{lst:memorder:What Happens Before RCU Readers?,%
 	lst:memorder:What Happens After RCU Readers?,%
 	lst:memorder:What Happens With Empty RCU Readers?}
 	in the same way that they were reordered going from
 	\cref{lst:memorder:RCU Fundamental Property}
 	to
 	\cref{lst:memorder:RCU Fundamental Property and Reordering}?
 }\QuickQuizAnswer{
 	No, because none of these later litmus tests have more than one
 	access within their RCU read-side critical sections.
 	But what about swapping the accesses, for example, in
 	\cref{lst:memorder:What Happens Before RCU Readers?},
 	placing \co{P1()}'s \co{WRITE_ONCE()} within its critical
 	section and the \co{READ_ONCE()} before its critical section?

 	Swapping the accesses allows both instances of \co{r2} to
 	have a final value of zero, in other words, although RCU read-side
 	critical sections' ordering properties can extend outside of
 	those critical sections, the same is not true of their
 	reordering properties.
 	Checking this with \co{herd} and explaining why is left as an
 	exercise for the reader.
 }\QuickQuizEnd

 \subsubsection{Multiple RCU Readers and Updaters}
 \label{sec:memorder:Multiple RCU Readers and Updaters}

 Because \co{synchronize_rcu()} has ordering semantics that are at least
 as strong as \co{smp_mb()}, no matter how many processes there are in
 an SB litmus test
 (such as \cref{lst:memorder:RCU Updaters Provide Full Ordering}),
 placing \co{synchronize_rcu()} between each process's
 accesses prohibits the cycle.
 In addition, the cycle is prohibited in an SB test where one process
 uses \co{synchronize_rcu()} and the other uses \co{rcu_read_lock()} and
 \co{rcu_read_unlock()}, as shown by
 \cref{lst:memorder:RCU Fundamental Property}.
 However, if both processes use \co{rcu_read_lock()} and
 \co{rcu_read_unlock()}, the cycle will be allowed, as shown by
 \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}.

 Is it possible to say anything general about which RCU-protected
 litmus tests will be prohibited and which will be allowed?
 This section takes up that question.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv}
 \caption{One RCU Grace Period and Two Readers}
 \label{lst:memorder:One RCU Grace Period and Two Readers}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv}
 \caption{Two RCU Grace Periods and Two Readers}
 \label{lst:memorder:Two RCU Grace Periods and Two Readers}
 \end{listing}

 More specifically, what if the litmus test has one RCU grace
 period and two RCU readers, as shown in
 \cref{lst:memorder:One RCU Grace Period and Two Readers}?
 The \co{herd} tool says that this cycle is allowed, but it would be
 good to know \emph{why}.\footnote{
 	Especially given that Paul changed his mind several times about
 	this particular litmus test when working with \ppl{Jade}{Alglave} to
 	generalize RCU ordering semantics.}

 \begin{figure*}
 \centering
 \resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2R}}
 \caption{Cycle for One RCU Grace Period and Two RCU Readers}
 \label{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers}
 \end{figure*}

 \begin{figure*}
 \centering
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/RCU2G2R}}
 \caption{No Cycle for Two RCU Grace Periods and Two RCU Readers}
 \label{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}
 \end{figure*}

 The key point is that even strongly ordered CPUs such as x86 can
 and will reorder \co{P1()}'s and \co{P2()}'s \co{WRITE_ONCE()} and
 \co{READ_ONCE()}.
 With that reordering,
 \cref{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers}
 shows how the cycle forms:

 \begin{enumerate}
 \item	\co{P0()}'s read from \co{x1} precedes \co{P1()}'s write, as
 	depicted by the dashed arrow near the bottom of the diagram.
 \item	Because \co{P1()}'s write follows the end of \co{P0()}'s grace period,
 	\co{P1()}'s read from \co{x2} cannot precede the beginning of
 	\co{P0()}'s grace period.
 \item	\co{P1()}'s read from \co{x2} precedes \co{P2()}'s write.
 \item	Because \co{P2()}'s write to \co{x2} precedes the end of
 	\co{P0()}'s grace period, it is completely legal for \co{P2()}'s
 	read from \co{x0} to precede the beginning of \co{P0()}'s grace period.
 \item	Therefore, \co{P2()}'s read from \co{x0} can precede \co{P0()}'s
 	write, thus allowing the cycle to form.
 \end{enumerate}

 But what happens when another grace period is added?
 This situation is shown in
 \cref{lst:memorder:Two RCU Grace Periods and Two Readers},
 an SB litmus test in which \co{P0()} and \co{P1()} have RCU grace periods
 and \co{P2()} and \co{P3()} have RCU readers.
 Again, the CPUs can reorder the accesses within RCU read-side critical
 sections, as shown in
 \cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}.
 For this cycle to form, \co{P2()}'s critical section must
 end after \co{P1()}'s grace period and \co{P3()}'s must end after the
 beginning of that same grace period, which happens to also be after the
 end of \co{P0()}'s grace period.
 Therefore, \co{P3()}'s critical section must start after the beginning
 of \co{P0()}'s grace period, which in turn means that \co{P3()}'s
 read from \co{x0} cannot possibly precede \co{P0()}'s write.
 Therefore, the cycle is forbidden because RCU read-side critical sections
 cannot span full RCU grace periods.

 However, a closer look at
 \cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}
 makes it clear that adding a third reader would allow the cycle.
 This is because this third reader could end before the end of \co{P0()}'s
 grace period, and thus start before the beginning of that same grace
 period.
 This in turn suggests the general rule, which is:
 In these sorts of RCU-only litmus tests, if there are at least as many
 RCU grace periods as there are RCU read-side critical sections,
 the cycle is forbidden.\footnote{
 	Interestingly enough, Alan Stern proved that within the context
 	of LKMM, the two-part fundamental property of RCU expressed
 	in \cref{sec:defer:RCU Fundamentals} actually implies
 	this seemingly more general result, which is called the RCU
 	axiom~\cite{Alglave:2018:FSC:3173162.3177156}.}

 \subsubsection{RCU and Other Ordering Mechanisms}
 \label{sec:memorder:RCU and Other Ordering Mechanisms}

 But what about litmus tests that combine RCU with other ordering
 mechanisms?

 The general rule is that it takes only one mechanism to forbid a cycle.

 For example, refer back to
 \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}.
 Applying the general rule from the previous section, because this litmus
 test has two RCU read-side critical sections and no RCU grace periods,
 the cycle is allowed.
 But what if \co{P0()}'s \co{WRITE_ONCE()} is replaced by an
 \co{smp_store_release()} and \co{P1()}'s \co{READ_ONCE()} is replaced
 by an \co{smp_load_acquire()}?

 RCU would still allow the cycle, but the release-acquire pair would
 forbid it.
 Because it only takes one mechanism to forbid a cycle, the release-acquire
 pair would prevail, thus forbidding the cycle.

 \begin{figure*}
 \centering
 \resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2Rmb}}
 \caption{Cycle for One RCU Grace Period, Two RCU Readers, and Memory Barrier}
 \label{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier}
 \end{figure*}

 For another example, refer back to
 \cref{lst:memorder:One RCU Grace Period and Two Readers}.
 Because this litmus test has two RCU readers but only one grace period,
 its cycle is allowed.
 But suppose that an \co{smp_mb()} was placed between \co{P1()}'s
 pair of accesses.
 In this new litmus test, because of the addition of the \co{smp_mb()},
 \co{P2()}'s as well as \co{P1()}'s critical sections would extend beyond the
 end of \co{P0()}'s grace period, which in turn would prevent \co{P2()}'s
 read from \co{x0} from preceding \co{P0()}'s write, as depicted by the
 red dashed arrow in
 \cref{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier}.
 In this case, RCU and the \IXh{full}{memory barrier} work together to forbid
 the cycle, with RCU preserving ordering between \co{P0()} and both
 \co{P1()} and \co{P2()}, and with the \co{smp_mb()} preserving
 ordering between \co{P1()} and \co{P2()}.

 \QuickQuiz{
 	What would happen if the \co{smp_mb()} was instead added between
 	\co{P2()}'s accesses in
 	\cref{lst:memorder:One RCU Grace Period and Two Readers}?
 }\QuickQuizAnswer{
 	The cycle would again be forbidden.
 	Further analysis is left as an exercise for the reader.
 }\QuickQuizEnd

 In short, where RCU's semantics were once purely pragmatic, they are
 now fully
 formalized~\cite{PaulMcKenney2005RCUSemantics,MathieuDesnoyers2012URCU,AlexeyGotsman2013ESOPRCU,Alglave:2018:FSC:3173162.3177156}.

 % \subsection{SRCU}
 % \label{sec:memorder:SRCU}
 % @@@ After LWN article

 % Nesting vs. value passed from \co{srcu_read_lock()} to
 % \co{srcu_read_unlock()}.

 % When augmented by \co{smp_mb__after_srcu_read_unlock()}.

 \subsection{Higher-Level Primitives:
 	    Discussion}
 \label{sec:memorder:Higher-Level Primitives: Discussion}

 It is quite beneficial to verify code in terms of a higher-level primitive
 instead of in terms of the low-level memory accesses used in a particular
 implementation of that primitive.
 First, this allows code using`those primitives to be
 verified against an abstract representation of those primitives,
 thus making that code less vulnerable to implementation changes.
 Second, partitioning the verification at API boundaries results in
 combinatorial implosion, greatly reducing the overhead of formal
 verification.

 It is hoped that verifying against detailed semantics for higher-level
 primitives will greatly increase the effectiveness of static analysis
 and model checking.

 \section{Hardware Specifics}
 \label{sec:memorder:Hardware Specifics}
 \OriginallyPublished{sec:memorder:Hardware Specifics}{Memory-Barrier Instructions For Specific CPUs}{Linux Journal}{PaulMcKenney2005i,PaulMcKenney2005j}
 %
 \epigraph{Rock beats paper!}{Derek Williams}

 Each CPU family has its own peculiar approach to memory ordering, which
 can make portability a challenge, as you can see in
 \cref{tab:memorder:Summary of Memory Ordering}.

 \begin{table*}[tb] % @@@ Omitting 'p' prevents unordered floats in 2c builds
 \rowcolors{4}{}{lightgray}
 \small
 \centering
 \newcommand{\cpufml}[1]{\begin{picture}(6,50)(0,0)\rotatebox{90}{#1}\end{picture}}
 \renewcommand*{\arraystretch}{1.2}\OneColumnHSpace{-.35in}
 \ebresizewidth{
 \begin{tabular}{llccccccccc}
 	\toprule
 	\multicolumn{2}{l}{~} & \multicolumn{9}{c}{CPU Family} \\
 	\cmidrule{3-11}
 	\multicolumn{2}{c}{\raisebox{.5ex}{Property}}
 	& \cpufml{Alpha}
 	& \cpufml{\ARMv7-A/R}
 	& \cpufml{\ARMv8}
 	& \cpufml{Itanium}
 	& \cpufml{MIPS}
 	& \cpufml{\Power{}}
 	& \cpufml{SPARC TSO}
 	& \cpufml{x86}
 	& \cpufml{z~Systems}
 	\\
 	\cmidrule(r){1-2} \cmidrule{3-11}
 %		 Alpha ARMv8 ARMv7 Itanium MIPS PPC SPARC x86 z Systems
 \cellcolor{white}
 	Memory Ordering
 	& Loads Reordered After Loads or Stores?
 		 & Y   & Y   & Y   & Y     & Y  & Y & ~   & ~ & ~ \\
 	& Stores Reordered After Stores?
 		 & Y   & Y   & Y   & Y     & Y  & Y & ~   & ~ & ~ \\
 \cellcolor{white}
 	& Stores Reordered After Loads?
 		 & Y   & Y   & Y   & Y     & Y  & Y & Y   & Y & Y \\
 	& \parbox[c][6ex]{2in}{\raggedright Atomic Instructions Reordered With\par Loads or Stores?}
 		 & Y   & Y   & Y   & ~     & Y  & Y & ~   & ~ & ~ \\
 \cellcolor{white}
 	& Dependent Loads Reordered?
 		 & Y   & ~   & ~   & ~     & ~  & ~ & ~   & ~ & ~ \\
 	& Dependent Stores Reordered?
 		 & ~   & ~   & ~   & ~     & ~  & ~ & ~   & ~ & ~ \\
 \cellcolor{white}
 	& Non-Sequentially Consistent?
 		 & Y   & Y   & Y   & Y     & Y  & Y & Y   & Y & Y \\
 	& Non-Multicopy Atomic?
 		 & Y   & Y   & Y   & Y     & Y  & Y & Y   & Y & ~ \\
 \cellcolor{white}
 	& Non-Other-Multicopy Atomic?
 		 & Y   & Y   & ~   & Y     & Y  & Y & ~   & ~ & ~ \\
 	& Non-Cache Coherent?
 		 & ~   & ~   & ~   & Y     & ~  & ~ & ~   & ~ & ~ \\
 	\cmidrule(r){1-2} \cmidrule{3-11}
 \cellcolor{white}
 	Instructions
 	& Load-Acquire/Store-Release?
 		 & F   & F   & i   & I     & F  & b & ~   & ~ & ~ \\
 	& Atomic RMW Instruction Type?
 		 & L   & L   & L   & C     & L  & L & C   & C & C \\
 \cellcolor{white}
 	& Incoherent Instruction Cache/Pipeline?
 		 & Y   & Y   & Y   & Y     & Y  & Y & Y   & Y & Y \\
 	\bottomrule
 \end{tabular}
 }

 \vspace{5pt}\hfill
 \ebresizeverb{.6}{
 \framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt}
 \rowcolors{1}{}{}
 \renewcommand*{\arraystretch}{1}
 \begin{tabular}{llcl}
 	{ \bf Key: }
 	  & \multicolumn{3}{l}{Load-Acquire/Store-Release?} \\
 	~ & ~ & b: & Lightweight memory barrier \\
 	~ & ~ & F: & Full memory barrier \\
 	~ & ~ & i: & Instruction with lightweight ordering \\
 	~ & ~ & I: & Instruction with heavyweight ordering \\
 	~ &\multicolumn{3}{l}{Atomic RMW Instruction Type?} \\
 	~ & ~ & C: & Compare-and-exchange instruction \\
 	~ & ~ & L: & Load-linked/store-conditional instruction \\
 \end{tabular}
 }
 }\OneColumnHSpace{-0.4in}
 \caption{Summary of Memory Ordering}
 \label{tab:memorder:Summary of Memory Ordering}
 \end{table*}

 In fact, some software environments simply prohibit
 direct use of memory-ordering operations, restricting the programmer
 to mutual-exclusion primitives that incorporate them to the extent that
 they are required.
 Please note that this section is not intended to be a reference manual
 covering all (or even most) aspects of each CPU family, but rather
 a high-level overview providing a rough comparison.
 For full details, see the reference manual for the CPU of interest.

 Getting back to
 \cref{tab:memorder:Summary of Memory Ordering},
 the first group of rows look at memory-ordering
 properties and the second group looks at instruction properties.
 Please note that these properties hold at the machine-instruction
 level.
 Compilers can and do reorder far more aggressively than does hardware.
 Use marked accesses such as \co{READ_ONCE()} and \co{WRITE_ONCE()}
 to constrain the compiler's optimizations and prevent undesireable
 reordering.

 The first three rows indicate whether a given CPU allows the four
 possible combinations of loads and stores to be reordered, as discussed
 in
 \cref{sec:memorder:Ordering: Why and How?} and
 \crefrange{sec:memorder:Load Followed By Load}{sec:memorder:Store Followed By Store}.
 The next row (``Atomic Instructions Reordered With Loads or Stores?\@'')
 indicates whether a given CPU allows loads and stores
 to be reordered with atomic instructions.

 The fifth and sixth rows cover reordering and dependencies,
 which was covered in
 \crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies}
 and which is explained in more detail in
 \cref{sec:memorder:Alpha}.
 The short version is that Alpha requires memory barriers for readers
 as well as updaters of linked data structures, however, these memory
 barriers are provided by the Alpha architecture-specific code in
 v4.15 and later Linux kernels.

 The next row, ``Non-Sequentially Consistent'', indicates whether
 the CPU's normal load and store instructions are constrained by
 sequential consistency.
 Performance considerations have dictated that no modern mainstream
 system is sequentially consistent.

 The next three rows cover \IXalt{multicopy atomicity}{multi-copy atomic},
 which was defined in
 \cref{sec:memorder:Multicopy Atomicity}.
 The first is full-up (and rare)
 \IXalt{multicopy atomicity}{multi-copy atomic}, the second is the
 weaker \IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic},
 and the third is the weakest
 \IXalthy{non-multicopy atomicity}{non-}{multi-copy atomic}.

 The next row, ``Non-Cache Coherent'', covers accesses from multiple
 threads to a single variable, which was discussed in
 \cref{sec:memorder:Cache Coherence}.

 The final three rows cover instruction-level choices and issues.
 The first row indicates how each CPU implements load-acquire
 and store-release, the second row classifies CPUs by atomic-instruction
 type, and the third and final row
 indicates whether a given CPU has an incoherent
 instruction cache and pipeline.
 Such CPUs require special instructions be executed for self-modifying
 code.

 %Parenthesized CPU names indicate modes that are architecturally allowed,
 %but rarely used in practice.

 The common ``just say no'' approach to memory-ordering operations
 can be eminently reasonable where it applies,
 but there are environments, such as the Linux kernel, where direct
 use of memory-ordering operations is required.
 Therefore,
 Linux provides a carefully chosen least-common-denominator
 set of memory-ordering primitives, which are as follows:
 \begin{description}
 \item	[\tco{smp_mb()}] (\IXh{full}{memory barrier}) that orders both loads and
 	stores.
 	This means that loads and stores preceding the memory barrier
 	will be committed to memory before any loads and stores
 	following the memory barrier.
 \item	[\tco{smp_rmb()}] (\IXh{read}{memory barrier}) that orders only loads.
 \item	[\tco{smp_wmb()}] (\IXh{write}{memory barrier}) that orders only stores.
 \item	[\tco{smp_mb__before_atomic()}] that forces ordering of accesses
 	preceding the \co{smp_mb__before_atomic()} against accesses following
 	a later RMW atomic operation.
 	This is a noop on systems that fully order atomic RMW operations.
 \item	[\tco{smp_mb__after_atomic()}] that forces ordering of accesses
 	preceding an earlier RMW atomic operation against accesses
 	following the \co{smp_mb__after_atomic()}.
 	This is also a noop on systems that fully order atomic RMW operations.
 \item	[\tco{smp_mb__after_spinlock()}] that forces ordering of accesses
 	preceding a lock acquisition against accesses
 	following the \co{smp_mb__after_spinlock()}.
 	This is also a noop on systems that fully order lock acquisitions.
 \item	[\tco{mmiowb()}] that forces ordering on MMIO writes that are guarded
 	by global spinlocks, and is more thoroughly described
 	in a 2016 LWN article on MMIO~\cite{PaulEMcKenney2016LinuxKernelMMIO}.
 \end{description}
 The \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()}
 primitives also force
 the compiler to eschew any optimizations that would have the effect
 of reordering memory optimizations across the barriers.

 \QuickQuiz{
 	What happens to code between an atomic operation and an
 	\co{smp_mb__after_atomic()}?
 }\QuickQuizAnswer{
 	First, please don't do this!

 	But if you do, this intervening code will either be ordered
 	after the atomic operation or before the
 	\co{smp_mb__after_atomic()}, depending on the architecture,
 	but not both.
 	This also applies to \co{smp_mb__before_atomic()} and
 	\co{smp_mb__after_spinlock()}, that is, both the uncertain
 	ordering of the intervening code and the plea to avoid such code.
 }\QuickQuizEnd

 These primitives generate code only in SMP kernels, however, several
 have UP versions (\co{mb()}, \co{rmb()}, and \co{wmb()},
 respectively) that generate a memory barrier even in UP kernels.
 The \co{smp_} versions should be used in most cases.
 However, these latter primitives are useful when writing drivers,
 because MMIO accesses must remain ordered even in UP kernels.
 In absence of memory-ordering operations, both CPUs and compilers would
 happily rearrange these accesses, which at best would make the device
 act strangely, and could crash your kernel or even damage your hardware.

 So most kernel programmers need not worry about the memory-ordering
 peculiarities of each and every CPU, as long as they stick to these
 interfaces and to the fully ordered atomic operations.\footnote{
 	For a full list, expand the patterns in
 	\path{Documentation/atomic_t.txt}.}
 If you are working deep in a given CPU's architecture-specific code,
 of course, all bets are off.

 Furthermore,
 all of Linux's locking primitives (spinlocks, reader-writer locks,
 semaphores, RCU, \ldots) include any needed ordering primitives.
 So if you are working with code that uses these primitives properly,
 you need not worry about Linux's memory-ordering primitives.

 That said, deep knowledge of each CPU's
 \IXalt{memory-consistency model}{memory model}
 can be very helpful when debugging, to say nothing of when writing
 architecture-specific code or synchronization primitives.

 Besides, they say that a little knowledge is a very dangerous thing.
 Just imagine the damage you could do with a lot of knowledge!
 For those who wish to understand more about individual CPUs'
 memory consistency models, the next sections describe those of a few
 popular and prominent CPUs.
 Although there is no substitute for actually reading a given CPU's
 documentation, these sections do give a good overview.

 \subsection{Alpha}
 \label{sec:memorder:Alpha}

 It may seem strange to say much of anything about a CPU whose end of life
 has long since passed, but Alpha is interesting because it is the only
 mainstream CPU that reorders dependent loads, and has thus had outsized
 influence on concurrency APIs, including within the Linux kernel.
 The need for core Linux-kernel code to accommodate Alpha ended
 with version v4.15 of the Linux kernel, and all traces of this
 accommodation were removed in v5.9 with the removal of the
 \apikh{smp_read_barrier_depends()} and \co{read_barrier_depends()} APIs.
 This section is nevertheless retained in the Third Edition
 because here in early 2023 there are still a few Linux kernel hackers
 still working on pre-v4.15 versions of the Linux kernel.
 In addition, the modifications to \co{READ_ONCE()} that permitted
 these APIs to be removed have not necessarily propagated to all
 userspace projects that might still support Alpha.

 \begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
 The dependent-load difference between Alpha and the other CPUs is
 illustrated by the code shown in
 \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}.
 This \co{smp_store_release()}
 guarantees that the element initialization
 in \clnrefrange{init:b}{init:e} is executed before the element is added to the
 list on \clnref{add}, so that the lock-free search will work correctly.
 That is, it makes this guarantee on all CPUs {\em except} Alpha.
 \end{fcvref}

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
 \begin{VerbatimL}[commandchars=\\\[\]]
 struct el *insert(long key, long data)
 {
 	struct el *p;
 	p = kmalloc(sizeof(*p), GFP_ATOMIC);
 	spin_lock(&mutex);
 	p->next = head.next;		\lnlbl[init:b]
 	p->key = key;
 	p->data = data;			\lnlbl[init:e]
 	smp_store_release(&head.next, p); \lnlbl[add]
 	spin_unlock(&mutex);
 }

 struct el *search(long searchkey)
 {
 	struct el *p;
 	p = READ_ONCE_OLD(head.next);	\lnlbl[h:next]
 	while (p != &head) {
 		/* Prior to v4.15, BUG ON ALPHA!!! */ \lnlbl[BUG]
 		if (p->key == searchkey) {	\lnlbl[key]
 			return (p);
 		}
 		p = READ_ONCE_OLD(p->next);	\lnlbl[next]
 	};
 	return (NULL);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Insert and Lock-Free Search (No Ordering)}
 \label{lst:memorder:Insert and Lock-Free Search (No Ordering)}
 \end{listing}

 \begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)]
 Given the pre-v4.15 implementation of \co{READ_ONCE()}, indicated by
 \co{READ_ONCE_OLD()} in the listing, Alpha actually allows the code on
 \clnref{key} of
 \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
 to see the old garbage values that were present before the initialization
 on \clnrefrange{init:b}{init:e}.

 \Cref{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels}
 shows how this can happen on
 an aggressively parallel machine with partitioned caches, so that
 alternating \IXpl{cache line} are processed by the different partitions
 of the caches.
 For example, the load of \co{head.next} on \clnref{h:next} of
 \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
 might access cache bank~0,
 and the load of \co{p->key} on \clnref{key} and of \co{p->next} on \clnref{next}
 might access cache bank~1.
 On Alpha, the \co{smp_store_release()} will guarantee that the cache
 invalidations performed by \clnrefrange{init:b}{init:e} of
 \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}
 (for \co{p->next}, \co{p->key}, and \co{p->data}) will reach
 the interconnect before that of \clnref{add} (for \co{head.next}), but
 makes absolutely no guarantee about the order of
 propagation through the reading CPU's cache banks.
 For example, it is possible that the reading CPU's cache bank~1 is very
 busy, but cache bank~0 is idle.
 This could result in the cache invalidations for the new element
 (\co{p->next}, \co{p->key}, and \co{p->data}) being
 delayed, so that the reading CPU loads the new value for \co{head.next},
 but loads the old cached values for \co{p->key} and \co{p->next}.
 Yes, this does mean that Alpha can in effect fetch
 the data pointed to {\em before} it fetches the pointer itself, strange
 but true.
 \end{fcvref}
 See the documentation~\cite{Compaq01,WilliamPugh2000Gharachorloo}
 called out earlier for more information,
 or if you think that I am just making all this up.\footnote{
 	Of course, the astute reader will have already recognized that
 	Alpha is nowhere near as mean and nasty as it could be,
 	the (thankfully) mythical architecture in
 	\cref{sec:app:whymb:Ordering-Hostile Architecture}
 	being a case in point.}
 The benefit of this unusual approach to ordering is that Alpha can use
 simpler cache hardware, which in turn permitted higher clock frequencies
 in Alpha's heyday.

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/Alpha}}
 \caption{Why \tco{smp_read_barrier_depends()} is Required in Pre-v4.15 Linux Kernels}
 \label{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels}
 \end{figure}

 One could place an \co{smp_rmb()} primitive
 between the pointer fetch and dereference in order to force Alpha
 to order the pointer fetch with the later dependent load.
 However, this imposes unneeded overhead on systems (such as \ARM,
 Itanium, PPC, and SPARC) that respect
 \IXalth{address dependencies}{address}{dependency} on the read side.
 A \co{smp_read_barrier_depends()} primitive was therefore added to the
 Linux kernel to eliminate overhead on these systems, but was removed
 in v5.9 of the Linux kernel in favor of augmenting Alpha's definition
 of \co{READ_ONCE()}.
 Thus, as of v5.9, core kernel code no longer needs to concern itself
 with this aspect of DEC Alpha.
 \begin{fcvref}[ln:memorder:Safe Insert and Lock-Free Search]
 However, it is better to use \co{rcu_dereference()}
 as shown on \clnref{deref1,deref2} of
 \cref{lst:memorder:Safe Insert and Lock-Free Search},
 which works safely and efficiently for all recent kernel versions.
 \end{fcvref}

 It is also possible to implement a software mechanism
 that could be used in place of \co{smp_store_release()} to force
 all reading CPUs to see the writing CPU's writes in order.
 This software barrier could be implemented by sending \IXacrfpl{ipi}
 to all other CPUs.
 Upon receipt of such an IPI, a CPU would execute a memory-barrier
 instruction, implementing a system-wide memory barrier similar to that
 provided by the Linux kernel's \co{sys_membarrier()} system call.
 Additional logic is required to avoid deadlocks.
 Of course, CPUs that respect address dependencies would define such a barrier
 to simply be \co{smp_store_release()}.
 However, this approach was deemed by the Linux community
 to impose excessive overhead~\cite{McKenney01f}, and to their point would
 be completely inappropriate for systems having
 aggressive real-time response requirements.

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:Safe Insert and Lock-Free Search]
 \begin{VerbatimL}[commandchars=\\\[\]]
 struct el *insert(long key, long data)
 {
 	struct el *p;
 	p = kmalloc(sizeof(*p), GFP_ATOMIC);
 	spin_lock(&mutex);
 	p->next = head.next;
 	p->key = key;
 	p->data = data;
 	smp_store_release(&head.next, p);
 	spin_unlock(&mutex);
 }

 struct el *search(long searchkey)
 {
 	struct el *p;
 	p = rcu_dereference(head.next);		\lnlbl[deref1]
 	while (p != &head) {
 		if (p->key == searchkey) {
 			return (p);
 		}
 		p = rcu_dereference(p->next);	\lnlbl[deref2]
 	};
 	return (NULL);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Safe Insert and Lock-Free Search}
 \label{lst:memorder:Safe Insert and Lock-Free Search}
 \end{listing}

 The Linux memory-barrier primitives took their names from the Alpha
 instructions, so \co{smp_mb()} is \co{mb}, \co{smp_rmb()} is \co{rmb},
 and \co{smp_wmb()} is \co{wmb}.
 Alpha is the only CPU whose \co{READ_ONCE()} includes an \co{smp_mb()}.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Why does Alpha's \co{READ_ONCE()} include an
 	\co{mb()} rather than \co{rmb()}?
 }\QuickQuizAnswerB{
 	Alpha has only \co{mb} and \co{wmb} instructions,
 	so \co{smp_rmb()} would be implemented by the Alpha \co{mb}
 	instruction in either case.
 	In addition, at the time that the Linux kernel started relying on
 	dependency ordering, it was not clear that Alpha ordered dependent
 	stores, and thus \co{smp_mb()} was therefore the safe choice.

 	However, given the aforementioned v5.9 changes to \co{READ_ONCE()}
 	and a few of Alpha's atomic read-modify-write operations,
 	no Linux-kernel core code need concern itself with DEC Alpha,
 	thus greatly reducing Paul E.~McKenney's incentive to remove
 	Alpha support from the kernel.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Isn't DEC Alpha significant as having the weakest possible
 	memory ordering?
 }\QuickQuizAnswerE{
 	Although DEC Alpha does take considerable flak, it does avoid
 	reordering reads from the same CPU to the same variable.
 	It also avoids the out-of-thin-air problem that plagues
 	the Java and C11 memory
 	models~\cite{Boehm:2014:OGA:2618128.2618134,conf/esop/BattyMNPS15,MarkBatty2013OOTA-WorkingNote,HansBoehm2020ConcurrentUB,DavidGoldblatt2019NoElegantOOTAfix,AlanJeffrey2014JavaDRF,PaulEMcKenney2020RelaxedGuideRelaxed,PaulEMcKenney2016OOTA,Sevcik:2011:SOS:1993316.1993534,Vafeiadis:2015:CCO:2775051.2676995}.
 }\QuickQuizEndE
 }

 For more on Alpha, see its reference manual~\cite{ALPHA2002}.

 \subsection{\ARMv7-A/R}
 \label{sec:memorder:ARMv7-A/R}

 The \ARM\ family of CPUs is popular in deep embedded applications,
 particularly for power-constrained microcontrollers.
 Its \IXalthmr{memory model}{Armv7}{memory model} is similar to that of \Power{}
 (see \cref{sec:memorder:POWER / PowerPC}), but \ARM\ uses a
 different set of memory-barrier instructions~\cite{ARMv7A:2010}:

 \begin{description}
 \item	[\tco{DMB}] (data memory barrier) causes the specified type of
 	operations to \emph{appear} to have completed before any
 	subsequent operations of the same type.
 	The ``type'' of operations can be all operations or can be
 	restricted to only writes (similar to the Alpha \co{wmb}
 	and the \Power{} \co{eieio} instructions).
 	In addition, \ARM\ allows \IX{cache coherence} to have one of three
 	scopes:
 	Single processor, a subset of the processors
 	(``inner'') and global (``outer'').
 \item	[\tco{DSB}] (data synchronization barrier) causes the specified
 	type of operations to actually complete before any subsequent
 	operations (of any type) are executed.
 	The ``type'' of operations is the same as that of \co{DMB}.
 	The \co{DSB} instruction was called \co{DWB} (drain write buffer
 	or data write barrier, your choice) in early versions of the
 	\ARM\ architecture.
 \item	[\tco{ISB}] (instruction synchronization barrier) flushes the CPU
 	pipeline, so that all instructions following the \co{ISB}
 	are fetched only after the \co{ISB} completes.
 	For example, if you are writing a self-modifying program
 	(such as a JIT), you should execute an \co{ISB} between
 	generating the code and executing it.
 \end{description}

 None of these instructions exactly match the semantics of Linux's
 \co{rmb()} primitive, which must therefore be implemented as a full
 \co{DMB}.
 The \co{DMB} and \co{DSB} instructions have a recursive definition
 of accesses ordered before and after the barrier, which has an effect
 similar to that of \Power{}'s cumulativity, both of which are
 stronger than \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}'s
 cumulativity described in
 \cref{sec:memorder:Cumulativity}.

 \ARM\ also implements \IXalth{control dependencies}{control}{dependency},
 so that if a conditional
 branch depends on a load, then any store executed after that conditional
 branch will be ordered after the load.
 However, loads following the conditional branch will \emph{not}
 be guaranteed to be ordered unless there is an \co{ISB}
 instruction between the branch and the load.
 Consider the following example:

 \begin{fcvlabel}[ln:memorder:ARM:load-store control dependency]
 \begin{VerbatimN}[commandchars=\\\[\]]
 r1 = x;			\lnlbl[x]
 if (r1 == 0)		\lnlbl[if]
 	nop();		\lnlbl[nop]
 y = 1;			\lnlbl[y]
 r2 = z;			\lnlbl[z1]
 ISB();			\lnlbl[isb]
 r3 = z;			\lnlbl[z2]
 \end{VerbatimN}
 \end{fcvlabel}

 \begin{fcvref}[ln:memorder:ARM:load-store control dependency]
 In this example, load-store control dependency ordering causes
 the load from \co{x} on \clnref{x} to be ordered before the store to
 \co{y} on \clnref{y}.
 However, \ARM\ does not respect load-load control dependencies, so that
 the load on \clnref{x} might well happen \emph{after} the
 load on \clnref{z1}.
 On the other hand, the combination of the conditional branch on \clnref{if}
 and the \co{ISB} instruction on \clnref{isb} ensures that
 the load on \clnref{z2} happens after the load on \clnref{x}.
 Note that inserting an additional \co{ISB} instruction somewhere between
 \clnref{if,z1} would enforce ordering between \clnref{x,z1}.
 \end{fcvref}

 \subsection{\ARMv8}
 \label{sec:memorder:ARMv8}

 \begin{figure}
 \centering
 \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-LDLAR}}
 \caption{Half Memory Barrier}
 \ContributedBy{fig:memorder:Half Memory Barrier}{Melissa Brossard}
 \end{figure}

 \ARM's \ARMv8 CPU family~\cite{ARMv8A:2017}
 includes 64-bit capabilities,
 in contrast to their 32-bit-only CPU described in
 \cref{sec:memorder:ARMv7-A/R}.
 \IXalthmr{\ARMv8's memory model}{Armv8}{memory model}
 closely resembles its \ARMv7 counterpart,
 but adds load-acquire (\co{LDLARB}, \co{LDLARH}, and \co{LDLAR})
 and store-release (\co{STLLRB}, \co{STLLRH}, and \co{STLLR})
 instructions.
 These instructions act as ``half memory barriers'', so that
 \ARMv8 CPUs can reorder previous accesses with a later \co{LDLAR}
 instruction, but are prohibited from reordering an earlier \co{LDLAR}
 instruction with later accesses, as fancifully depicted in
 \cref{fig:memorder:Half Memory Barrier}.
 Similarly, \ARMv8 CPUs can reorder an earlier \co{STLLR} instruction with
 a subsequent access, but are prohibited from reordering
 previous accesses with a later \co{STLLR} instruction.
 As one might expect, this means that these instructions directly support
 the C11 notion of load-acquire and store-release.

 However, \ARMv8 goes well beyond the C11 memory model by mandating that
 the combination of a store-release and load-acquire act as a full
 barrier under certain circumstances.
 For example, in \ARMv8, given a store followed by a store-release followed
 a load-acquire followed by a load, all to different variables and all from
 a single CPU, all CPUs
 would agree that the initial store preceded the final load.
 Interestingly enough, most TSO architectures (including x86 and the
 mainframe) do not make this guarantee, as the two loads could be
 reordered before the two stores.

 \ARMv8 is one of only two architectures that needs the
 \co{smp_mb__after_spinlock()} primitive to be a full barrier,
 due to its relatively weak lock-acquisition implementation in
 the Linux kernel.

 \ARMv8 also has the distinction of being the first CPU whose vendor publicly
 defined its memory ordering with an executable formal model~\cite{ARMv8A:2017}.

 \subsection{Itanium}
 \label{sec:memorder:Itanium}

 Itanium offers a \IXalth{weak consistency}{weak}{memory consistency}
 \IXalthmr{model}{Itanium}{memory model}, so that in absence of explicit
 memory-barrier instructions or dependencies, Itanium is within its rights
 to arbitrarily reorder memory references~\cite{IntelItanium02v2}.
 Itanium has a memory-fence instruction named \co{mf}, but also has
 ``half-memory fence'' modifiers to loads, stores, and to some of its atomic
 instructions~\cite{IntelItanium02v3}.
 The \co{acq} modifier prevents subsequent memory-reference instructions
 from being reordered before the \co{acq}, but permits
 prior memory-reference instructions to be reordered after the \co{acq},
 similar to the \ARMv8 load-acquire instructions.
 Similarly, the \co{rel} modifier prevents prior memory-reference
 instructions from being reordered after the \co{rel}, but allows
 subsequent memory-reference instructions to be reordered before
 the \co{rel}.

 These half-memory fences are useful for critical sections, since
 it is safe to push operations into a critical section, but can be
 fatal to allow them to bleed out.
 However, as one of the few CPUs with this property, Itanium at one
 time defined Linux's semantics of memory ordering associated with lock
 acquisition and release.\footnote{
 	PowerPC is now the architecture with this dubious privilege.}
 Oddly enough, actual Itanium hardware is rumored to implement
 both load-acquire and store-release instructions as full barriers.
 Nevertheless, Itanium was the first mainstream CPU to introduce the concept
 (if not the reality) of load-acquire and store-release into its
 instruction set.

 \QuickQuiz{
 	Given that hardware can have a half memory barrier, why don't
 	locking primitives allow the compiler to move memory-reference
 	instructions into lock-based critical sections?
 }\QuickQuizAnswer{
 	In fact, as we saw in \cref{sec:memorder:ARMv8} and will
 	see in \cref{sec:memorder:POWER / PowerPC}, hardware really does
 	implement partial memory-ordering instructions and it also turns
 	out that these really are used to construct locking primitives.
 	However, these locking primitives use full compiler barriers,
 	thus preventing the compiler from reordering memory-reference
 	instructions both out of and into the corresponding critical
 	section.

 \begin{listing}
 \begin{fcvlabel}[ln:memorder:synchronize-rcu]
 \begin{VerbatimL}[commandchars=\@\[\]]
 static inline int rcu_gp_ongoing(unsigned long *ctr)
 {
 	unsigned long v;

 	v = LOAD_SHARED(*ctr);@lnlbl[load]
 	return v && (v != rcu_gp_ctr);
 }

 static void update_counter_and_wait(void)
 {
 	struct rcu_reader *index;

 	STORE_SHARED(rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR);
 	barrier();
 	list_for_each_entry(index, &registry, node) {@lnlbl[loop]
 		while (rcu_gp_ongoing(&index->ctr))@lnlbl[call2]
 			msleep(10);
 	}
 }

 void synchronize_rcu(void)
 {
 	unsigned long was_online;

 	was_online = rcu_reader.ctr;
 	smp_mb();
 	if (was_online)@lnlbl[if]
 		STORE_SHARED(rcu_reader.ctr, 0);@lnlbl[store]
 	mutex_lock(&rcu_gp_lock);@lnlbl[acqmutex]
 	update_counter_and_wait();@lnlbl[call1]
 	mutex_unlock(&rcu_gp_lock);
 	if (was_online)
 		STORE_SHARED(rcu_reader.ctr, LOAD_SHARED(rcu_gp_ctr));
 	smp_mb();
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Userspace RCU Code Reordering}
 \label{lst:memorder:Userspace RCU Code Reordering}
 \end{listing}

 	To see why the compiler is forbidden from doing reordering that
 	is permitted by hardware, consider the following sample code
 	in \cref{lst:memorder:Userspace RCU Code Reordering}.
 	This code is based on the userspace RCU update-side
 	code~\cite[Supplementary Materials Figure 5]{MathieuDesnoyers2012URCU}.

 \begin{fcvref}[ln:memorder:synchronize-rcu]
 	Suppose that the compiler reordered \clnref{if,store} into
 	the critical section starting at \clnref{acqmutex}.
 	Now suppose that two updaters start executing \co{synchronize_rcu()}
 	at about the same time.
 	Then consider the following sequence of events:
 	\begin{enumerate}
 	\item	CPU~0 acquires the lock at \clnref{acqmutex}.
 	\item	\Clnref{if} determines that CPU~0 was online, so it clears
 		its own counter at \clnref{store}.
 		(Recall that \clnref{if,store} have been reordered by the
 		compiler to follow \clnref{acqmutex}).
 	\item	CPU~0 invokes \co{update_counter_and_wait()} from
 		\clnref{call1}.
 	\item	CPU~0 invokes \co{rcu_gp_ongoing()} on itself at
 		\clnref{call2}, and \clnref{load} sees that CPU~0 is
 		in a quiescent state.
 		Control therefore returns to \co{update_counter_and_wait()},
 		and \clnref{loop} advances to CPU~1.
 	\item	CPU~1 invokes \co{synchronize_rcu()}, but because CPU~0
 		already holds the lock, CPU~1 blocks waiting for this
 		lock to become available.
 		Because the compiler reordered \clnref{if,store} to follow
 		\clnref{acqmutex}, CPU~1 does not clear its own counter,
 		despite having been online.
 	\item	CPU~0 invokes \co{rcu_gp_ongoing()} on CPU~1 at
 		\clnref{call2}, and \clnref{load} sees that CPU~1 is
 		not in a quiescent state.
 		The \co{while} loop at \clnref{call2} therefore never
 		exits.
 	\end{enumerate}

 	So the compiler's reordering results in a deadlock.
 	In contrast, hardware reordering is temporary, so that CPU~1
 	might undertake its first attempt to acquire the mutex on
 	\clnref{acqmutex} before executing \clnref{if,store}, but it
 	will eventually execute \clnref{if,store}.
 	Because hardware reordering only results in a short delay, it
 	can be tolerated.
 	On the other hand, because compiler reordering results in a
 	deadlock, it must be prohibited.

 	Some research efforts have used hardware transactional memory
 	to allow compilers to safely reorder more aggressively, but
 	the overhead of hardware transactions has thus far made
 	such optimizations unattractive.
 	% @@@ Citation for compilers use of HTM in this manner?
 \end{fcvref}
 }\QuickQuizEnd

 The Itanium \co{mf} instruction is used for the \co{smp_rmb()},
 \co{smp_mb()}, and \co{smp_wmb()} primitives in the Linux kernel.
 Despite persistent rumors to the contrary, the \qco{mf} mnemonic stands
 for ``memory fence''.

 Itanium also offers a global total order for release operations,
 including the \co{mf} instruction.
 This provides the notion of transitivity, where if a given code fragment
 sees a given access as having happened, any later code fragment will
 also see that earlier access as having happened.
 Assuming, that is, that all the code fragments involved correctly use
 memory barriers.

 Finally, Itanium is the only architecture supporting the Linux kernel
 that can reorder normal loads to the same variable.
 The Linux kernel avoids this issue because \co{READ_ONCE()} emits
 a \co{volatile} load, which is compiled as a \co{ld,acq} instruction,
 which forces ordering of all \co{READ_ONCE()} invocations by a given
 CPU, including those to the same variable.

 \subsection{MIPS}

 The \IXhmr{MIPS}{memory model}~\cite[page~479]{MIPSvII-A-2016}
 appears to resemble that of \ARM, Itanium, and \Power{},
 being weakly ordered by default, but respecting dependencies.
 MIPS has a wide variety of memory-barrier instructions, but ties them
 not to hardware considerations, but rather to the use cases provided
 by the Linux kernel and the C++11 standard~\cite{RichardSmith2019N4800}
 in a manner similar to the \ARMv8 additions:

 \begin{description}[style=nextline]
 \item[\tco{SYNC}]
 	Full barrier for a number of hardware operations in addition
 	to memory references, which is used to implement the v4.13
 	Linux kernel's \co{smp_mb()} for OCTEON systems.
 \item[\tco{SYNC_WMB}]
 	Write memory barrier, which can be used on OCTEON systems
 	to implement the
 	\co{smp_wmb()} primitive in the v4.13 Linux kernel via the
 	\co{syncw} mnemonic.
 	Other systems use plain \co{sync}.
 \item[\tco{SYNC_MB}]
 	Full memory barrier, but only for memory operations.
 	This may be used to implement the
 	C++ \co{atomic_thread_fence(memory_order_seq_cst)}.
 \item[\tco{SYNC_ACQUIRE}]
 	Acquire memory barrier, which could be used to implement
 	C++'s \co{atomic_thread_fence(memory_order_acquire)}.
 	In theory, it could also be used to implement the v4.13 Linux-kernel
 	\co{smp_load_acquire()} primitive, but in practice
 	\co{sync} is used instead.
 \item[\tco{SYNC_RELEASE}]
 	Release memory barrier, which may be used to implement
 	C++'s \co{atomic_thread_fence(memory_order_release)}.
 	In theory, it could also be used to implement the v4.13 Linux-kernel
 	\co{smp_store_release()} primitive, but in practice
 	\co{sync} is used instead.
 \item[\tco{SYNC_RMB}]
 	Read memory barrier, which could in theory be used to implement the
 	\co{smp_rmb()} primitive in the Linux kernel, except that current
 	MIPS implementations supported by the v4.13 Linux kernel do not
 	need an explicit instruction to force ordering.
 	Therefore, \co{smp_rmb()} instead simply constrains the compiler.
 \item[\tco{SYNCI}]
 	Instruction-cache synchronization, which is used in conjunction with
 	other instructions to allow self-modifying code, such as that produced
 	by just-in-time (JIT) compilers.
 \end{description}

 Informal discussions with MIPS architects indicates that MIPS has a
 definition of transitivity or cumulativity similar to that of
 \ARM\ and \Power{}\@.
 However, it appears that different MIPS implementations can have
 different memory-ordering properties, so it is important to consult
 the documentation for the specific MIPS implementation you are using.

 \subsection{\Power{} / PowerPC}
 \label{sec:memorder:POWER / PowerPC}

 The \Power{} and PowerPC CPU families have a wide variety of memory-barrier
 instructions~\cite{PowerPC94,MichaelLyons05a}:
 \begin{description}
 \item	[\tco{sync}] causes all preceding operations to {\em appear to have}
 	completed before any subsequent operations are started.
 	This instruction is therefore quite expensive.
 \item	[\tco{lwsync}] (lightweight sync) orders loads with respect to
 	subsequent loads and stores, and also orders stores.
 	However, it does {\em not} order stores with respect to subsequent
 	loads.
 	The \co{lwsync} instruction may be used to implement
 	load-acquire and store-release operations.
 	Interestingly enough, the \co{lwsync} instruction enforces
 	the same within-CPU ordering as does x86, z~Systems, and coincidentally,
 	SPARC TSO\@.
 	However, placing the \co{lwsync} instruction between each
 	pair of memory-reference instructions will \emph{not}
 	result in x86, z~Systems, or SPARC TSO memory ordering.
 	On these other systems, if a pair of CPUs independently execute
 	stores to different variables, all other CPUs will agree on the
 	order of these stores.
 	Not so on PowerPC, even with an \co{lwsync} instruction between each
 	pair of memory-reference instructions, because PowerPC is
 	\IXalthy{non-multicopy atomic}{non-}{multi-copy atomic}.
 \item	[\tco{eieio}] (enforce in-order execution of I/O, in case you
 	were wondering) causes all preceding cacheable stores to appear
 	to have completed before all subsequent stores.
 	However, stores to cacheable memory are ordered separately from
 	stores to non-cacheable memory, which means that \co{eieio}
 	will not force an MMIO store to precede a spinlock release.
 	This instruction may well be unique in having a five-vowel mnemonic.
 \item	[\tco{isync}] forces all preceding instructions to appear to have
 	completed before any subsequent instructions start execution.
 	This means that the preceding instructions must have progressed
 	far enough that any traps they might generate have either happened
 	or are guaranteed not to happen, and that any side-effects of
 	these instructions (for example, page-table changes) are seen by the
 	subsequent instructions.
 	However, it does \emph{not} force all memory references to be
 	ordered, only the actual execution of the instruction itself.
 	Thus, the loads might return old still-cached values and the
 	\co{isync} instruction does not force values previously stored
 	to be flushed from the store buffers.
 \end{description}

 Unfortunately, none of these instructions line up exactly with Linux's
 \co{wmb()} primitive, which requires \emph{all} stores to be ordered,
 but does not require the other high-overhead actions of the \co{sync}
 instruction.
 The \co{rmb()} primitive doesn't have a matching light-weight instruction
 either.
 But there is no choice:
 {ppc64} versions of \co{wmb()}, \co{rmb()}, and \co{mb()} are defined
 to be the heavyweight \co{sync} instruction.
 However, Linux's \co{smp_wmb()} primitive is never used for MMIO
 (since a driver must carefully order MMIOs in UP as well as
 SMP kernels, after all), so it is defined to be the lighter weight
 \co{eieio} or \co{lwsync} instruction~\cite{PaulEMcKenney2016LinuxKernelMMIO}.
 The \co{smp_mb()} primitive is also defined to be the \co{sync}
 instruction, while \co{smp_rmb()} is defined to be the lighter-weight
 \co{lwsync} instruction.

 \Power{} features ``cumulativity'', which can be used to obtain
 transitivity.
 When used properly, any code seeing the results of an earlier
 code fragment will also see the accesses that this earlier code
 fragment itself saw.
 Much more detail is available from
 McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}.

 \Power{} respects control dependencies in much the same way that \ARM\
 does, with the exception that the \Power{} \co{isync} instruction
 is substituted for the \ARM\ \co{ISB} instruction.

 Like \ARMv8, \Power{} requires \co{smp_mb__after_spinlock()} to be
 a full memory barrier.
 In addition, \Power{} is the only architecture requiring
 \co{smp_mb__after_unlock_lock()} to be a full memory barrier.
 In both cases, this is because of the weak ordering properties
 of \Power{}'s locking primitives, due to the use of the \co{lwsync}
 instruction to provide ordering for both acquisition and release.

 Many members of the \Power{} architecture have incoherent instruction
 caches, so that a store to memory will not necessarily be reflected
 in the instruction cache.
 Thankfully, few people write self-modifying code these days, but JITs
 and compilers do it all the time.
 Furthermore, recompiling a recently run program looks just like
 self-modifying code from the CPU's viewpoint.
 The \co{icbi} instruction (instruction cache block invalidate)
 invalidates a specified cache line from
 the instruction cache, and may be used in these situations.

 \subsection{SPARC TSO}

 Although SPARC's TSO (total-store order) is used by both Linux and
 Solaris, the architecture also defines PSO (partial store order) and RMO
 (relaxed-memory order).
 Any program that runs in RMO will also run in either PSO or TSO, and similarly,
 a program that runs in PSO will also run in TSO\@.
 Moving a shared-memory parallel program in the other direction may
 require careful insertion of memory barriers.

 Although SPARC's PSO and RMO modes are not used much these days, they
 did give rise to a very flexible memory-barrier instruction~\cite{SPARC94}
 that permits fine-grained control of ordering:
 \begin{description}
 \item	[\tco{StoreStore}] orders preceding stores before subsequent stores.
 	(This option is used by the Linux \co{smp_wmb()} primitive.)
 \item	[\tco{LoadStore}] orders preceding loads before subsequent stores.
 \item	[\tco{StoreLoad}] orders preceding stores before subsequent loads.
 \item	[\tco{LoadLoad}] orders preceding loads before subsequent loads.
 	(This option is used by the Linux \co{smp_rmb()} primitive.)
 \item	[\tco{Sync}] fully completes all preceding operations before starting
 	any subsequent operations.
 \item	[\tco{MemIssue}] completes preceding memory operations before subsequent
 	memory operations, important for some instances of memory-mapped
 	I/O.
 \item	[\tco{Lookaside}] does the same as MemIssue,
 	but only applies to preceding stores
 	and subsequent loads, and even then only for stores and loads that
 	access the same memory location.
 \end{description}

 So, why is \qco{membar #MemIssue} needed?
 Because a \qco{membar #StoreLoad} could permit a subsequent
 load to get its value from a store buffer, which would be
 disastrous if the write was to an MMIO register that induced side effects
 on the value to be read.
 In contrast, \qco{membar #MemIssue} would wait until the store buffers
 were flushed before permitting the loads to execute,
 thereby ensuring that the load actually gets its value from the MMIO register.
 Drivers could instead use \qco{membar #Sync}, but the lighter-weight
 \qco{membar #MemIssue} is preferred in cases where the additional function
 of the more-expensive \qco{membar #Sync} are not required.

 The \qco{membar #Lookaside} is a lighter-weight version of
 \qco{membar #MemIssue}, which is useful when writing to a given MMIO register
 affects the value that will next be read from that register.
 However, the heavier-weight \qco{membar #MemIssue} must be used when
 a write to a given MMIO register affects the value that will next be
 read from {\em some other} MMIO register.

 SPARC requires a \co{flush} instruction be used between the time that
 the instruction stream is modified and the time that any of these
 instructions are executed~\cite{SPARC94}.
 This is needed to flush any prior value for that location from
 the SPARC's instruction cache.
 Note that \co{flush} takes an address, and will flush only that address
 from the instruction cache.
 On SMP systems, all CPUs' caches are flushed, but there is no
 convenient way to determine when the off-CPU flushes complete,
 though there is a reference to an implementation note.

 But again, the Linux kernel runs SPARC in TSO mode, so
 all of the above \co{membar} variants are strictly of historical
 interest.
 In particular, the \co{smp_mb()} primitive only needs to use \co{#StoreLoad}
 because the other three reorderings are prohibited by TSO\@.

 \subsection{x86}

 Historically, the x86 CPUs provided ``process ordering'' so that all CPUs
 agreed on the order of a given CPU's writes to memory.
 This allowed the \co{smp_wmb()}
 primitive to be a no-op for the CPU~\cite{IntelXeonV3-96a}.
 Of course, a compiler directive was also required to prevent optimizations
 that would reorder across the \co{smp_wmb()} primitive.
 In ancient times, certain x86 CPUs gave no ordering guarantees for loads, so
 the \co{smp_mb()} and \co{smp_rmb()} primitives expanded to \co{lock;addl}.
 This atomic instruction acts as a barrier to both loads and stores.

 But those were ancient times.
 More recently, Intel has published a
 \IXalthmr{memory model}{x86}{memory model} for
 x86~\cite{Intelx86MemoryOrdering2007}.
 It turns out that Intel's modern CPUs enforce tighter ordering than was
 claimed in the previous specifications, so this model simply mandates
 this modern behavior.
 Even more recently, Intel published an updated memory model for
 x86~\cite[Section 8.2]{Intel64IA32v3A2011}, which mandates a total global order
 for stores, although individual CPUs are still permitted to see their
 own stores as having happened earlier than this total global order
 would indicate.
 This exception to the total ordering is needed to allow important
 hardware optimizations involving store buffers.
 In addition, x86 provides
 \IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic}, for example,
 so that if CPU~0 sees a store by CPU~1, then CPU~0 is guaranteed to see
 all stores that CPU~1 saw prior to its store.
 Software may use atomic operations to override these hardware optimizations,
 which is one reason that atomic operations tend to be more expensive
 than their non-atomic counterparts.

 It is also important to note that atomic instructions operating
 on a given memory location should all be of the same
 size~\cite[Section 8.1.2.2]{Intel64IA32v3A2016}.
 For example, if you write a program where one CPU atomically increments
 a byte while another CPU executes a 4-byte atomic increment on
 that same location, you are on your own.

 Some SSE instructions are weakly ordered (\co{clflush}
 and non-temporal move instructions~\cite{IntelXeonV2b-96a}).
 Code that uses these non-temporal move instructions
 can use \co{mfence} for \co{mb()},
 \co{lfence} for \co{rmb()}, and \co{sfence} for \co{wmb()}.\footnote{
 	\co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()} don't suffice
 	for ordering non-temporal move instructions since Linux v4.15.}
 A few older variants of the x86 CPU have a mode bit that enables out-of-order
 stores, and for these CPUs, \co{smp_wmb()} must also be defined to
 be \co{lock;addl}.

 A 2017 kernel commit by Michael S.~Tsirkin replaced \co{mfence} with
 \co{lock;addl} in \co{smp_mb()}, achieving a 60 percent performance
 boost~\cite{Tsirkin2017}.
 The change used a 4-byte negative offset from \co{SP} to avoid
 slowness due to false data dependencies, instead of directly
 accessing memory pointed to by \co{SP}.
 \co{clflush} users still need to use \co{mfence} for ordering.
 Therefore, they were converted to use \co{mb()}, which uses \co{mfence}
 as before, instead of \co{smp_mb()}.

 Although newer x86 implementations accommodate self-modifying code
 without any special instructions, to be fully compatible with
 past and potential future x86 implementations, a given CPU must
 execute a jump instruction or a serializing instruction (e.g., \co{cpuid})
 between modifying the code and executing
 it~\cite[Section 8.1.3]{Intel64IA32v3A2011}.

 \subsection{z Systems}

 The z~Systems machines make up the IBM mainframe family, previously
 known as the 360, 370, 390 and zSeries~\cite{IBMzSeries04a}.
 Parallelism came late to z~Systems, but given that these mainframes first
 shipped in the mid 1960s, this is not saying much.
 The \qco{bcr 15,0} instruction is used for the Linux \co{smp_mb()} primitives,
 but compiler constraints suffices for both the
 \co{smp_rmb()} and \co{smp_wmb()} primitives.
 It also has strong memory-ordering semantics, as shown in
 \cref{tab:memorder:Summary of Memory Ordering}.
 In particular, all CPUs will agree on the order of unrelated stores from
 different CPUs, that is, the z~Systems CPU family is
 \IXalt{fully multicopy atomic}{multi-copy atomic},
 and is the only commercially available system with this property.

 As with most CPUs, the z~Systems architecture does not guarantee a
 cache-coherent instruction stream, hence,
 self-modifying code must execute a serializing instruction between updating
 the instructions and executing them.
 That said, many actual z~Systems machines do in fact accommodate self-modifying
 code without serializing instructions.
 The z~Systems instruction set provides a large set of serializing instructions,
 including compare-and-swap, some types of branches (for example, the
 aforementioned \qco{bcr 15,0} instruction), and test-and-set.

 \subsection{Hardware Specifics:
 	    Discussion}
 \label{sec:memorder:Hardware Specifics: Discussion}

 There is considerable variation among these CPU families, and this section
 only scratched the surface of a few families that are either heavily used
 or historically significant.
 Those wishing more detail are invited to consult the reference manuals.

 However, perhaps the biggest benefit of the Linux-kernel memory model is
 that you can ignore these details when writing architecture-independent
 Linux-kernel code.

 \QuickQuizAnswersChp{qqzmemorder}