advsync/memorybarriers.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % advsync/memorybarriers.tex

 \section{Memory Barriers}
 \label{sec:advsync:Memory Barriers}
 \OriginallyPublished{Section}{sec:advsync:Memory Barriers}{Linux Kernel Memory Barriers}{kernel.org}{Howells2009membartxt}

 Causality and sequencing are deeply intuitive, and hackers often
 tend to have a much stronger grasp of these concepts than does
 the general population.
 These intuitions can be extremely powerful tools when writing, analyzing,
 and debugging both sequential code and parallel code that makes
 use of standard mutual-exclusion mechanisms, such as locking and
 RCU.

 \begin{table}
 \centering{\tt
 \begin{tabular}{l|l}
 	\nf{Thread 1} & \nf{Thread 2} \\
 	\hline
 	x = 1;  & y = 1; \\
 	r1 = y; & r2 = x; \\
 	\hline
 	\multicolumn{2}{l}{assert(r1 == 1 || r2 == 1);} \\
 \end{tabular}}
 \caption{Memory Misordering: Dekker}
 \label{tab:advsync:Memory Misordering: Dekker}
 \end{table}

 Unfortunately, these intuitions break down completely in face of
 code that makes direct use of explicit memory barriers for data structures
 in shared memory.
 For example, the litmus test in
 Table~\ref{tab:advsync:Memory Misordering: Dekker}
 appears to guarantee that the assertion never fires.
 After all, if \nbco{r1 != 1}, we might hope that Thread~1's load from~\co{y}
 must have happened before Thread~2's store to~\co{y}, which might raise
 further hopes that Thread~2's load from~\co{x} must happen after
 Thread~1's store to~\co{x}, so that \nbco{r2 == 1}, as required by the
 assertion.
 The example is symmetric, so similar hopeful reasoning might lead
 us to hope that \nbco{r2 != 1} guarantees that \nbco{r1 == 1}.
 Unfortunately, the lack of memory barriers in
 Table~\ref{tab:advsync:Memory Misordering: Dekker}
 dashes these hopes.
 Both the compiler and the CPU are within their rights to reorder
 the statements within both Thread~1 and Thread~2, even on relatively
 strongly ordered systems such as x86.

 The following sections show exactly where this intuition breaks down,
 and then puts forward a mental model of memory barriers that can help
 you avoid these pitfalls.

 Section~\ref{sec:advsync:Memory Ordering and Memory Barriers}
 gives a brief overview of memory ordering and memory barriers.
 Once this background is in place, the next step is to get you to admit
 that your intuition has a problem.
 This painful task is taken up by
 Section~\ref{sec:advsync:If B Follows A, and C Follows B, Why Doesn't C Follow A?},
 which shows an intuitively correct code fragment that fails miserably
 on real hardware, and by
 Section~\ref{sec:advsync:Variables Can Have More Than One Value},
 which presents some code demonstrating that scalar variables can
 take on multiple values simultaneously.
 Once your intuition has made it through the grieving process,
 Section~\ref{sec:advsync:What Can You Trust?}
 provides the basic rules that memory barriers follow, rules that we
 will build upon.
 These rules are further refined in
 Sections~\ref{sec:advsync:Review of Locking Implementations}
 through~\ref{sec:advsync:Where Are Memory Barriers Needed?}.

 \subsection{Memory Ordering and Memory Barriers}
 \label{sec:advsync:Memory Ordering and Memory Barriers}

 But why are memory barriers needed in the first place?
 Can't CPUs keep track of ordering on their own?
 Isn't that why we have computers in the first place, to keep track of things?

 Many people do indeed expect their computers to keep track of things,
 but many also insist that they keep track of things quickly.
 One difficulty that modern computer-system vendors face is that
 the main memory cannot keep up with the CPU---modern CPUs can execute
 hundreds of instructions in the time required to fetch a single variable
 from memory.
 CPUs therefore sport increasingly large caches, as shown in
 Figure~\ref{fig:advsync:Modern Computer System Cache Structure}.
 Variables that are heavily used by a given CPU will tend to remain
 in that CPU's cache, allowing high-speed access to the corresponding
 data.

 \begin{figure}[htb]
 \centering
 \resizebox{3in}{!}{\includegraphics{appendix/whymb/cacheSC}}
 \caption{Modern Computer System Cache Structure}
 \label{fig:advsync:Modern Computer System Cache Structure}
 \end{figure}

 Unfortunately, when a CPU accesses data that is not yet in its cache
 will result in an expensive ``cache miss'', requiring the data to
 be fetched from main memory.
 Doubly unfortunately, running typical code results in a significant
 number of cache misses.
 To limit the resulting performance degradation, CPUs have been designed to
 execute other instructions and memory references while waiting for
 a cache miss to fetch data from memory.
 This clearly causes instructions and memory references to execute out
 of order, which could cause serious confusion, as illustrated in
 Figure~\ref{fig:advsync:CPUs Can Do Things Out of Order}.
 Compilers and synchronization primitives (such as locking and RCU)
 are responsible for maintaining the illusion of ordering through use of
 ``memory barriers'' (for example, \co{smp_mb()} in the Linux kernel).
 These memory barriers can be explicit instructions, as they are on
 ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions,
 as they are on x86.

 \begin{figure}[htb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}}
 \caption{CPUs Can Do Things Out of Order}
 \ContributedBy{Figure}{fig:advsync:CPUs Can Do Things Out of Order}{Melissa Broussard}
 \end{figure}

 Since the standard synchronization primitives preserve the illusion of
 ordering, your path of least resistance is to stop reading
 this section and simply use these primitives.

 However, if you need to implement the synchronization primitives
 themselves, or if you are simply interested in understanding how memory
 ordering and memory barriers work, read on!

 The next sections present counter-intuitive scenarios that you might
 encounter when using explicit memory barriers.

 \subsection{If B Follows A, and C Follows B, Why Doesn't C Follow A?}
 \label{sec:advsync:If B Follows A, and C Follows B, Why Doesn't C Follow A?}

 Memory ordering and memory barriers can be extremely counter-intuitive.
 For example, consider the functions shown in
 Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal}
 executing in parallel
 where variables~A, B, and~C are initially zero:

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbbox}
   1 thread0(void)
   2 {
   3   A = 1;
   4   smp_wmb();
   5   B = 1;
   6 }
   7
   8 thread1(void)
   9 {
  10   while (B != 1)
  11     continue;
  12   barrier();
  13   C = 1;
  14 }
  15
  16 thread2(void)
  17 {
  18   while (C != 1)
  19     continue;
  20   barrier();
  21   assert(A != 0);
  22 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Parallel Hardware is Non-Causal}
 \label{fig:advsync:Parallel Hardware is Non-Causal}
 \end{figure}

 Intuitively, \co{thread0()} assigns to~B after it assigns to~A,
 \co{thread1()} waits until \co{thread0()} has assigned to~B before
 assigning to~C, and \co{thread2()} waits until \co{thread1()} has
 assigned to~C before referencing~A.
 Therefore, again intuitively, the assertion on line~21 cannot possibly
 fire.

 This line of reasoning, intuitively obvious though it may be, is completely
 and utterly incorrect.
 Please note that this is \emph{not} a theoretical assertion:
 actually running this code on real-world weakly-ordered hardware
 (a 1.5GHz 16-CPU POWER 5 system) resulted in the assertion firing
 16~times out of 10~million runs.
 Clearly, anyone who produces code with explicit memory barriers
 should do some extreme testing---although a proof of correctness might
 be helpful, the strongly counter-intuitive nature of the behavior of
 memory barriers should in turn strongly limit one's trust in such proofs.
 The requirement for extreme testing should not be taken lightly, given
 that a number of dirty hardware-dependent tricks were used to
 greatly \emph{increase} the probability of failure in this run.

 \QuickQuiz{}
 	How on earth could the assertion on line~21 of the code in
 	Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on
 	page~\pageref{fig:advsync:Parallel Hardware is Non-Causal}
 	\emph{possibly} fail?
 \QuickQuizAnswer{
 	The key point is that the intuitive analysis missed is that
 	there is nothing preventing the assignment to~C from overtaking
 	the assignment to~A as both race to reach \co{thread2()}.
 	This is explained in the remainder of this section.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Great \ldots So how do I fix it?
 \QuickQuizAnswer{
 	The easiest fix is to replace each of the \co{barrier()}s on
 	line~12 and line~20 with an \co{smp_mb()}.

 	Of course, some hardware is more forgiving than other hardware.
 	For example, on x86 the assertion on line~21 of
 	Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on
 	page~\pageref{fig:advsync:Parallel Hardware is Non-Causal}
 	cannot trigger.
 	On PowerPC, only the \co{barrier()} on line~20 need be
 	replaced with \co{smp_mb()} to prevent the assertion from
 	triggering.
 } \QuickQuizEnd

 So what should you do?
 Your best strategy, if possible, is to use existing primitives that
 incorporate any needed memory barriers, so that you can simply ignore
 the rest of this chapter.

 Of course, if you are implementing synchronization primitives,
 you don't have this luxury.
 The following discussion of memory ordering and memory barriers
 is for you.

 \subsection{Variables Can Have More Than One Value}
 \label{sec:advsync:Variables Can Have More Than One Value}

 It is natural to think of a variable as taking on a well-defined
 sequence of values in a well-defined, global order.
 Unfortunately, it is time to say ``goodbye'' to this sort of comforting
 fiction.

 To see this, consider the program fragment shown in
 Figure~\ref{fig:advsync:Software Logic Analyzer}.
 This code fragment is executed in parallel by several CPUs.
 Line~1 sets a shared variable to the current CPU's ID, line~2
 initializes several variables from a \co{gettb()} function that
 delivers the value of a fine-grained hardware ``timebase'' counter that is
 synchronized among all CPUs (not available from all CPU architectures,
 unfortunately!), and the loop from lines~3-8 records the length of
 time that the variable retains the value that this CPU assigned to it.
 Of course, one of the CPUs will ``win'', and would thus never exit
 the loop if not for the check on lines~6-7.

 \QuickQuiz{}
 	What assumption is the code fragment
 	in Figure~\ref{fig:advsync:Software Logic Analyzer}
 	making that might not be valid on real hardware?
 \QuickQuizAnswer{
 	The code assumes that as soon as a given CPU stops
 	seeing its own value, it will immediately see the
 	final agreed-upon value.
 	On real hardware, some of the CPUs might well see several
 	intermediate results before converging on the final value.
 } \QuickQuizEnd

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbbox}
   1 state.variable = mycpu;
   2 lasttb = oldtb = firsttb = gettb();
   3 while (state.variable == mycpu) {
   4   lasttb = oldtb;
   5   oldtb = gettb();
   6   if (lasttb - firsttb > 1000)
   7     break;
   8 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Software Logic Analyzer}
 \label{fig:advsync:Software Logic Analyzer}
 \end{figure}

 Upon exit from the loop, \co{firsttb} will hold a timestamp
 taken shortly after the assignment and \co{lasttb} will hold
 a timestamp taken before the last sampling of the shared variable
 that still retained the assigned value, or a value equal to \co{firsttb}
 if the shared variable had changed before entry into the loop.
 This allows us to plot each CPU's view of the value of \co{state.variable}
 over a 532-nanosecond time period, as shown in
 Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}.
 This data was collected in 2006 on 1.5GHz POWER5 system with 8 cores,
 each containing a pair of hardware threads.
 CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test.
 The timebase counter period was about 5.32ns, sufficiently fine-grained
 to allow observations of intermediate cache states.

 \begin{figure}[htb]
 \centering
 \resizebox{3in}{!}{\includegraphics{advsync/MoreThanOneValue}}
 \caption{A Variable With Multiple Simultaneous Values}
 \label{fig:advsync:A Variable With Multiple Simultaneous Values}
 \end{figure}

 Each horizontal bar represents the observations of a given CPU over time,
 with the black regions to the left indicating the time before the
 corresponding CPU's first measurement.
 During the first 5ns, only CPU~3 has an opinion about the value of the
 variable.
 During the next 10ns, CPUs~2 and~3 disagree on the value of the variable,
 but thereafter agree that the value is~``2'', which is in fact
 the final agreed-upon value.
 However, CPU~1 believes that the value is~``1'' for almost 300ns, and
 CPU~4 believes that the value is~``4'' for almost 500ns.

 \QuickQuiz{}
 	How could CPUs possibly have different views of the
 	value of a single variable \emph{at the same time?}
 \QuickQuizAnswer{
 	Many CPUs have write buffers that record the values of
 	recent writes, which are applied once the corresponding
 	cache line makes its way to the CPU.
 	Therefore, it is quite possible for each CPU to see a
 	different value for a given variable at a single point
 	in time---and for main memory to hold yet another value.
 	One of the reasons that memory barriers were invented was
 	to allow software to deal gracefully with situations like
 	this one.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why do CPUs~2 and~3 come to agreement so quickly, when it
 	takes so long for CPUs~1 and~4 to come to the party?
 \QuickQuizAnswer{
 	CPUs~2 and~3 are a pair of hardware threads on the same
 	core, sharing the same cache hierarchy, and therefore have
 	very low communications latencies.
 	This is a NUMA, or, more accurately, a NUCA effect.

 	This leads to the question of why CPUs~2 and~3 ever disagree
 	at all.
 	One possible reason is that they each might have a small amount
 	of private cache in addition to a larger shared cache.
 	Another possible reason is instruction reordering, given the
 	short 10-nanosecond duration of the disagreement and the
 	total lack of memory barriers in the code fragment.
 } \QuickQuizEnd

 And if you think that the situation with four CPUs was intriguing, consider
 Figure~\ref{fig:advsync:A Variable With More Simultaneous Values},
 which shows the same situation, but with 15~CPUs each assigning their
 number to a single shared variable at time~$t=0$. Both diagrams in the
 figure are drawn in the same way as
 Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}.
 The only difference is that the unit of horizontal axis is timebase ticks,
 with each tick lasting about 5.3~nanoseconds.
 The entire sequence therefore lasts a bit longer than the events recorded in
 Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values},
 consistent with the increase in number of CPUs.
 The upper diagram shows the overall picture, while the lower one shows
 the zoom-up of first 50~timebase ticks.

 Again, CPU~0 coordinates the test, so does not record any values.

 \begin{figure*}
 \centering
 \resizebox{5in}{!}{\includegraphics{advsync/MoreThanOneValue-15CPU}}
 \caption{A Variable With More Simultaneous Values}
 \ContributedBy{Figure}{fig:advsync:A Variable With More Simultaneous Values}{Akira Yokosawa}
 \end{figure*}

 All CPUs eventually agree on the final value of~9, but not before
 the values~15 and~12 take early leads.
 Note that there are fourteen different opinions on the variable's value
 at time~21 indicated by the vertical line in the lower diagram.
 Note also that all CPUs see sequences whose orderings are consistent with
 the directed graph shown in
 Figure~\ref{fig:advsync:Possible Global Orders With More Simultaneous Values}.
 Nevertheless, both figures underscore the importance of
 proper use of memory barriers for code that cares about memory ordering.

 \begin{figure}[htb]
 \centering
 \resizebox{2.5in}{!}{\includegraphics{advsync/store15tred}}
 \caption{Possible Global Orders With More Simultaneous Values}
 \label{fig:advsync:Possible Global Orders With More Simultaneous Values}
 \end{figure}

 We have entered a regime where we must bid a fond farewell to
 comfortable intuitions about values of variables and the passage of time.
 This is the regime where memory barriers are needed.

 All that aside, it is important to remember the lessons from
 Chapters~\ref{chp:Hardware and its Habits}
 and~\ref{cha:Partitioning and Synchronization Design}.
 Having all CPUs write concurrently to the same variable
 is absolutely no way to design a parallel program, at least
 not if performance and scalability are at all important to you.

 \subsection{What Can You Trust?}
 \label{sec:advsync:What Can You Trust?}

 You most definitely cannot trust your intuition.

 What \emph{can} you trust?

 It turns out that there are a few reasonably simple rules that
 allow you to make good use of memory barriers.
 This section derives those rules, for those who wish to get
 to the bottom of the memory-barrier story, at least from the viewpoint
 of portable code.
 If you just want to be told what the rules are rather than suffering
 through the actual derivation,
 please feel free to skip to Section~\ref{sec:advsync:A Few Simple Rules}.

 The exact semantics of memory barriers vary wildly from one CPU family to
 another, so portable code must rely only on the least-common-denominator
 semantics of memory barriers.

 Fortunately, all CPU families impose the following rules:
 \begin{enumerate}
 \item	All accesses by a given CPU will appear to that CPU to have
 	occurred in program order.
 \item	All CPUs' accesses to a single variable will be consistent with
 	some global ordering of stores to that variable.
 \item	Memory barriers will operate in a pair-wise fashion.
 \item	Operations will be provided from which exclusive locking
 	primitives may be constructed.
 \end{enumerate}

 Therefore, if you need to use memory barriers in portable code,
 you can rely on all of these properties.\footnote{
 	Or, better yet, you can avoid explicit use of memory barriers
 	entirely.
 	But that would be the subject of other sections.}
 Each of these properties is described in the following sections.

 \subsubsection{Self-References Are Ordered}

 A given CPU will see its own accesses as occurring in ``program order'',
 as if the CPU was executing only one instruction at a time with no
 reordering or speculation.
 For older CPU families, this restriction is necessary for binary compatibility,
 and only secondarily for the sanity of us software types.
 There have been a few CPU families that violate this rule to a limited extent,
 but in those cases, the compiler has been responsible
 for ensuring that ordering is explicitly enforced as needed.

 Either way, from the programmer's viewpoint, the CPU sees its own accesses
 in program order.

 \subsubsection{Single-Variable Memory Consistency}
 \label{sec:advsync:Single-Variable Memory Consistency}

 Because current commercially available computer systems provide
 \emph{cache coherence},
 if a group of CPUs all do concurrent non-atomic stores to a single variable,
 the series of values seen by all CPUs will be consistent with at
 least one global ordering.
 For example, in the series of accesses shown in
 Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values},
 CPU~1 sees the sequence {\tt \{1,2\}},
 CPU~2 sees the sequence {\tt \{2\}},
 CPU~3 sees the sequence {\tt \{3,2\}},
 and
 CPU~4 sees the sequence {\tt \{4,2\}}.
 This is consistent with the global sequence {\tt \{3,1,4,2\}},
 but also with all five of the other sequences of these four numbers that end
 in~``2''.
 Thus, there will be agreement on the sequence of values taken on
 by a single variable, but there might be ambiguity.

 In contrast, had the CPUs used atomic operations (such as the Linux kernel's
 \co{atomic_inc_return()} primitive) rather than simple stores of
 unique values, their observations would
 be guaranteed to determine a single globally consistent sequence of values.
 One of the \co{atomic_inc_return()} invocations would happen first,
 and would change the value from~0 to~1, the second from~1 to~2, and
 so on.
 The CPUs could compare notes afterwards and come to agreement on the
 exact ordering of the sequence of \co{atomic_inc_return()} invocations.
 This does not work for the non-atomic stores described earlier because
 the non-atomic stores do not return any indication of the earlier value,
 hence the possibility of ambiguity.

 Please note well that this section applies \emph{only} when all
 CPUs' accesses are to one single variable.
 In this single-variable case, cache coherence guarantees the
 global ordering, at least assuming that some of the more aggressive
 compiler optimizations are disabled via the Linux kernel's
 \co{READ_ONCE()} and \co{WRITE_ONCE()} directives or C++11's relaxed
 atomics~\cite{PeteBecker2011N3242}.
 In contrast, if there are multiple variables, memory barriers are
 required for the CPUs to consistently agree on the order for current
 commercially available computer systems.

 \subsubsection{Pair-Wise Memory Barriers}

 Pair-wise memory barriers provide conditional ordering semantics.
 For example, in the following set of operations, CPU~1's access to
 A does not unconditionally precede its access to~B from the viewpoint
 of an external logic analyzer
 \IfInBook{(see Appendix~\ref{chp:app:whymb:Why Memory Barriers?}
 	  for examples).
 	 }
 	 {(the system is only to act \emph{as if} the accesses
 	  are in order; it is not necessarily required to actually
 	  force them to be in order).}
 However, if CPU~2's access to~B sees the result of CPU~1's access
 to~B, then CPU~2's access to~A is guaranteed to see the result of
 CPU~1's access to~A.
 Although some CPU families' memory barriers do in fact provide stronger,
 unconditional ordering guarantees, portable code may rely only
 on this weaker if-then conditional ordering guarantee.

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l}
 	\nf{CPU 1}	& \nf{CPU 2} \\
 	\hline
 	<access> A	& <access> B \\
 	<memory barrier>& <memory barrier> \\
 	<access> B	& <access> A \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 \QuickQuiz{}
 	But if the memory barriers do not unconditionally force
 	ordering, how the heck can a device driver reliably execute
 	sequences of loads and stores to MMIO registers?
 \QuickQuizAnswer{
 	MMIO registers are special cases: because they appear
 	in uncached regions of physical memory.
 	Memory barriers \emph{do} unconditionally force ordering
 	of loads and stores to uncached memory, as discussed in
 	Section~\ref{sec:advsync:Device Operations}.
 } \QuickQuizEnd

 Of course, accesses must be either loads or stores, and these
 do have different properties.
 Table~\ref{tab:advsync:Memory-Barrier Combinations}
 shows all possible combinations of loads and stores from a pair
 of CPUs.
 Of course, to enforce conditional ordering, there must be
 a memory barrier between each CPU's pair of operations.

 \begin{table*}
 \small
 \centering
 \begin{tabular}{r||c|c||c|c||l}
 	\multicolumn{1}{c||}{} & \multicolumn{2}{c||}{CPU 1} &
 		\multicolumn{2}{c||}{CPU 2} & Description \\
 	\hline
 	\hline
 	0 & load(A) & load(B) & load(B) & load(A) &
 		Ears to ears. \\
 	1 & load(A) & load(B) & load(B) & store(A) &
 		Only one store. \\
 	2 & load(A) & load(B) & store(B) & load(A) &
 		Only one store. \\
 	3 & load(A) & load(B) & store(B) & store(A) &
 		Pairing 1. \\
 	\hline
 	4 & load(A) & store(B) & load(B) & load(A) &
 		Only one store. \\
 	5 & load(A) & store(B) & load(B) & store(A) &
 		Pairing 2. \\
 	6 & load(A) & store(B) & store(B) & load(A) &
 		Mouth to mouth, ear to ear. \\
 	7 & load(A) & store(B) & store(B) & store(A) &
 		Pairing 3. \\
 	\hline
 	8 & store(A) & load(B) & load(B) & load(A) &
 		Only one store. \\
 	9 & store(A) & load(B) & load(B) & store(A) &
 		Mouth to mouth, ear to ear. \\
 	A & store(A) & load(B) & store(B) & load(A) &
 		Ears to mouths. \\
 	B & store(A) & load(B) & store(B) & store(A) &
 		Stores ``pass in the night''. \\
 	\hline
 	C & store(A) & store(B) & load(B) & load(A) &
 		Pairing 1. \\
 	D & store(A) & store(B) & load(B) & store(A) &
 		Pairing 3. \\
 	E & store(A) & store(B) & store(B) & load(A) &
 		Stores ``pass in the night''. \\
 	F & store(A) & store(B) & store(B) & store(A) &
 		Stores ``pass in the night''. \\
 \end{tabular}
 \caption{Memory-Barrier Combinations}
 \label{tab:advsync:Memory-Barrier Combinations}
 \end{table*}

 \subsubsection{Pair-Wise Memory Barriers: Portable Combinations}

 The following pairings from
 Table~\ref{tab:advsync:Memory-Barrier Combinations},
 enumerate all the combinations of memory-barrier
 pairings that portable software may depend on.

 \paragraph{Pairing 1.}
 	In this pairing, one CPU executes a pair of loads separated
 	by a memory barrier, while a second CPU executes a pair
 	of stores also separated by a memory barrier, as follows
 	(both~A and~B are initially equal to zero):

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{tabular}{l|l}
 		\nf{CPU 1}	& \nf{CPU 2} \\
 		\hline
 		STORE A = 1	& Y = LOAD B \\
 		<memory barrier>& <memory barrier> \\
 		STORE B = 1	& X = LOAD A \\
 	\end{tabular}
 	\end{minipage}
 	\vspace{5pt}

 	After both CPUs have completed executing these code sequences,
 	if \co{Y==1}, then we must also have \co{X==1}.
 	In this case, the fact that \co{Y==1} means that
 	CPU~2's load prior to its memory barrier has
 	seen the store following CPU~1's memory barrier.
 	Due to the pairwise nature of memory barriers, CPU~2's
 	load following its memory barrier must therefore see
 	the store that precedes CPU~1's memory barrier, so that
 	\co{X==1}.

 	On the other hand, if \co{Y==0}, the memory-barrier condition
 	does not hold, and so in this case, X~could be either~0 or~1.

 \paragraph{Pairing 2.}
 	In this pairing, each CPU executes a load followed by a
 	memory barrier followed by a store, as follows
 	(both~A and~B are initially equal to zero):

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{tabular}{l|l}
 		\nf{CPU 1}	& \nf{CPU 2} \\
 		\hline
 		X = LOAD A	& Y = LOAD B \\
 		<memory barrier>& <memory barrier> \\
 		STORE B = 1	& STORE A = 1 \\
 	\end{tabular}
 	\end{minipage}
 	\vspace{5pt}

 	After both CPUs have completed executing these code sequences,
 	if \co{X==1}, then we must also have \co{Y==0}.
 	In this case, the fact that \co{X==1} means that
 	CPU~1's load prior to its memory barrier has
 	seen the store following CPU~2's memory barrier.
 	Due to the pairwise nature of memory barriers, CPU~1's
 	store following its memory barrier must therefore see
 	the results of CPU~2's load preceding its memory barrier,
 	so that \co{Y==0}.

 	On the other hand, if \co{X==0}, the memory-barrier condition
 	does not hold, and so in this case, Y~could be either~0 or~1.

 	The two CPUs' code sequences are symmetric, so if \co{Y==1}
 	after both CPUs have finished executing these code sequences,
 	then we must have \co{X==0}.

 \paragraph{Pairing 3.}
 	In this pairing, one CPU executes a load followed by a
 	memory barrier followed by a store, while the other CPU
 	executes a pair of stores separated by a memory barrier,
 	as follows (both~A and~B are initially equal to zero):

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{tabular}{l|l}
 		\nf{CPU 1}	& \nf{CPU 2} \\
 		\hline
 		X = LOAD A	& STORE B = 2 \\
 		<memory barrier>& <memory barrier> \\
 		STORE B = 1	& STORE A = 1 \\
 	\end{tabular}
 	\end{minipage}
 	\vspace{5pt}

 	After both CPUs have completed executing these code sequences,
 	if \co{X==1}, then we must also have \co{B==1}.
 	In this case, the fact that \co{X==1} means that
 	CPU~1's load prior to its memory barrier has
 	seen the store following CPU~2's memory barrier.
 	Due to the pairwise nature of memory barriers, CPU~1's
 	store following its memory barrier must therefore see
 	the results of CPU~2's store preceding its memory barrier.
 	This means that CPU~1's store to~B will overwrite CPU~2's
 	store to~B, resulting in \co{B==1}.

 	On the other hand, if \co{X==0}, the memory-barrier condition
 	does not hold, and so in this case, B~could be either~1 or~2.

 \subsubsection{Pair-Wise Memory Barriers: Semi-Portable Combinations}

 The following pairings from
 Table~\ref{tab:advsync:Memory-Barrier Combinations}
 can be used on modern hardware, but might fail on some systems
 that were produced in the 1900s.
 However, these \emph{can} safely be used on all mainstream hardware
 introduced since the year 2000.
 So if you think that memory barriers are difficult to deal with, please
 keep in mind that they used to be a \emph{lot} harder on some systems!

 \paragraph{Ears to Mouths.}
 	Since the stores cannot see the results of the loads
 	(again, ignoring MMIO registers for the moment),
 	it is not always possible to determine whether the memory-barrier
 	condition has been met.
 	However, 21\textsuperscript{st}-century
 	hardware \emph{would} guarantee that at least one
 	of the loads saw the value stored by the corresponding store
 	(or some later value for that same variable).

 \QuickQuiz{}
 	How do we know that modern hardware guarantees that at least
 	one of the loads will see the value stored by the other thread
 	in the ears-to-mouths scenario?
 \QuickQuizAnswer{
 	The scenario is as follows, with~A and~B both initially zero:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{tabular}{l|l}
 		\nf{CPU 0}	& \nf{CPU 1} \\
 		\hline
 		STORE A = 1	& STORE B = 1 \\
 		<memory barrier>& <memory barrier> \\
 		r1 = LOAD B	& r2 = LOAD A \\
 	\end{tabular}
 	\end{minipage}
 	\vspace{5pt}

 	If neither of the loads see the corresponding store, when both
 	CPUs finish, both \co{r1} and \co{r2} will be equal to zero.
 	Let's suppose that \co{r1} is equal to zero.
 	Then we know that CPU~0's load from~B happened before CPU~1's
 	store to~B: After all, we would have had \co{r1} equal to one
 	otherwise.
 	But given that CPU~0's load from~B happened before CPU~1's store
 	to~B, memory-barrier pairing guarantees that CPU~0's store to~A
 	happens before CPU~1's load from~A, which in turn guarantees that
 	\co{r2} will be equal to one, not zero.

 	Therefore, at least one of \co{r1} and \co{r2} must be nonzero,
 	which means that at least one of the loads saw the value from
 	the corresponding store, as claimed.
 } \QuickQuizEnd

 \paragraph{Stores ``Pass in the Night''.}
 	In the following example, after both CPUs have finished
 	executing their code sequences, it is quite tempting to
 	conclude that the result {\tt \{A==1,B==2\}} cannot happen.

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{tabular}{l|l}
 		\nf{CPU 1}	& \nf{CPU 2} \\
 		\hline
 		STORE A = 1	& STORE B = 2 \\
 		<memory barrier>& <memory barrier> \\
 		STORE B = 1	& STORE A = 2 \\
 	\end{tabular}
 	\end{minipage}
 	\vspace{5pt}

 	Unfortunately, although this conclusion is correct on
 	21\textsuperscript{st}-century systems, it does not necessarily hold
 	on all antique 20\textsuperscript{th}-century systems.
 	Suppose that the cache line containing~A is initially owned
 	by CPU~2, and that containing~B is initially owned by CPU~1.
 	Then, in systems that have invalidation queues and store
 	buffers, it is possible for the first assignments to ``pass
 	in the night'', so that the second assignments actually
 	happen first.
 	\IfInBook{This strange effect is explained in
 	          Appendix~\ref{chp:app:whymb:Why Memory Barriers?}.
 		 }
 		 {}

 	This same effect can happen in any memory-barrier pairing
 	where each CPU's memory barrier is preceded by a store,
 	including the ``ears to mouths'' pairing.

 	However, 21\textsuperscript{st}-century hardware \emph{does}
 	accommodate these ordering intuitions, and \emph{do} permit
 	this combination to be used safely.

 \subsubsection{Pair-Wise Memory Barriers: Dubious Combinations}

 In the following combinations from
 Table~\ref{tab:advsync:Memory-Barrier Combinations},
 the memory barriers have very limited use in portable code, even
 on 21\textsuperscript{st}-century hardware.
 However, ``limited use'' is different than ``no use'', so let's see
 what can be done!
 Avid readers will want to write toy programs that rely on each of
 these combinations in order to fully understand how this works.

 \paragraph{Ears to Ears.}
 	Since loads do not change the state of memory
 	(ignoring MMIO registers for the moment),
 	it is not possible for one of the loads to see the
 	results of the other load.
 	However, if we know that CPU~2's load from~B returned a
 	newer value than CPU~1's load from~B, then we also know
 	that CPU~2's load from~A returned either the same value
 	as CPU~1's load from~A or some later value.

 \paragraph{Mouth to Mouth, Ear to Ear.}
 	One of the variables is only loaded from, and the other
 	is only stored to.
 	Because (once again, ignoring MMIO registers) it is not
 	possible for one load to see the results of the other,
 	it is not possible to detect the conditional ordering
 	provided by the memory barrier.

 	However, it is possible to determine which store happened
 	last, but this requires an additional load from~B.
 	If this additional load from~B is executed after both
 	CPUs~1 and~2 complete, and if it turns out that CPU~2's
 	store to~B happened last, then we know
 	that CPU~2's load from~A returned either the same value
 	as CPU~1's load from~A or some later value.

 \paragraph{Only One Store.}
 	Because there is only one store, only one of the variables
 	permits one CPU to see the results of the other CPU's
 	access.
 	Therefore, there is no way to detect the
 	conditional ordering provided by the memory barriers.

 	At least not straightforwardly.
 	But suppose that in combination~1 from
 	Table~\ref{tab:advsync:Memory-Barrier Combinations},
 	CPU~1's load from~A returns the value that CPU~2 stored
 	to~A.  Then we know that CPU~1's load from~B returned
 	either the same value as CPU~2's load from~A or some later value.

 \QuickQuiz{}
 	How can the other ``Only one store'' entries in
 	Table~\ref{tab:advsync:Memory-Barrier Combinations}
 	be used?
 \QuickQuizAnswer{
 	For combination~2, if CPU~1's load from~B sees a value prior
 	to CPU~2's store to~B, then we know that CPU~2's load from~A
 	will return the same value as CPU~1's load from~A, or some later
 	value.

 	For combination~4, if CPU~2's load from~B sees the value from
 	CPU~1's store to~B, then we know that CPU~2's load from~A
 	will return the same value as CPU~1's load from~A, or some later
 	value.

 	For combination~8, if CPU~2's load from~A sees CPU~1's store
 	to~A, then we know that CPU~1's load from~B will return the same
 	value as CPU~2's load from~A, or some later value.
 } \QuickQuizEnd

 \subsubsection{Semantics Sufficient to Implement Locking}

 Suppose we have an exclusive lock (\co{spinlock_t} in the Linux
 kernel, \co{pthread_mutex_t} in pthreads code) that guards a number
 of variables (in other words, these variables are not accessed except
 from the lock's critical sections).
 The following properties must then hold true:
 \begin{enumerate}
 \item	A given CPU or thread must see all of its own loads and stores
 	as if they had occurred in program order.
 \item	The lock acquisitions and releases must appear to have executed
 	in a single global order.\footnote{
 		Of course, this order might be different from one run
 		to the next.
 		On any given run, however, all CPUs and threads must
 		have a consistent view of the order of critical sections
 		for a given exclusive lock.}
 \item	Suppose a given variable has not yet been stored in a
 	critical section that is currently executing.
 	Then any load from a given variable performed in that critical section
 	must see the last store to that variable from the last previous
 	critical section that stored to it.
 \end{enumerate}

 The difference between the last two properties is a bit subtle:
 the second requires that the lock acquisitions and releases occur
 in a well-defined order, while the third requires that the critical
 sections not ``bleed out'' far enough to cause difficulties for
 other critical section.

 Why are these properties necessary?

 Suppose the first property did not hold.
 Then the assertion in the following code might well fail!

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 a = 1;
 b = 1 + a;
 assert(b == 2);
 \end{verbatim}
 \end{minipage}
 \label{codesample:advsync:What Can You Count On? 1}
 \vspace{5pt}

 \QuickQuiz{}
 	How could the assertion {\tt b==2} on
 	page~\pageref{codesample:advsync:What Can You Count On? 1}
 	possibly fail?
 \QuickQuizAnswer{
 	If the CPU is not required to see all of its loads and
 	stores in order, then the {\tt b=1+a} might well see an
 	old version of the variable~``a''.

 	This is why it is so very important that each CPU or thread
 	see all of its own loads and stores in program order.
 } \QuickQuizEnd

 Suppose that the second property did not hold.
 Then the following code might leak memory!

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 spin_lock(&mylock);
 if (p == NULL)
   p = kmalloc(sizeof(*p), GFP_ATOMIC);
 spin_unlock(&mylock);
 \end{verbatim}
 \end{minipage}
 \label{codesample:advsync:What Can You Count On? 2}
 \vspace{5pt}

 \QuickQuiz{}
 	How could the code on
 	page~\pageref{codesample:advsync:What Can You Count On? 2}
 	possibly leak memory?
 \QuickQuizAnswer{
 	Only the first execution of the critical section should
 	see {\tt p==NULL}.
 	However, if there is no global ordering of critical sections for
 	{\tt mylock}, then how can you say that a particular one was
 	first?
 	If several different executions of that critical section thought
 	that they were first, they would all see {\tt p==NULL}, and
 	they would all allocate memory.
 	All but one of those allocations would be leaked.

 	This is why it is so very important that all the critical sections
 	for a given exclusive lock appear to execute in some well-defined
 	order.
 } \QuickQuizEnd

 Suppose that the third property did not hold.
 Then the counter shown in the following code might well count backwards.
 This third property is crucial, as it cannot be strictly true with
 pairwise memory barriers.

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 spin_lock(&mylock);
 ctr = ctr + 1;
 spin_unlock(&mylock);
 \end{verbatim}
 \end{minipage}
 \label{codesample:advsync:What Can You Count On? 3}
 \vspace{5pt}

 \QuickQuiz{}
 	How could the code on
 	page~\pageref{codesample:advsync:What Can You Count On? 3}
 	possibly count backwards?
 \QuickQuizAnswer{
 	Suppose that the counter started out with the value zero,
 	and that three executions of the critical section had therefore
 	brought its value to three.
 	If the fourth execution of the critical section is not constrained
 	to see the most recent store to this variable, it might well see
 	the original value of zero, and therefore set the counter to
 	one, which would be going backwards.

 	This is why it is so very important that loads from a given variable
 	in a given critical
 	section see the last store from the last prior critical section to
 	store to that variable.
 } \QuickQuizEnd

 If you are convinced that these rules are necessary, let's look at how
 they interact with a typical locking implementation.

 \subsection{Review of Locking Implementations}
 \label{sec:advsync:Review of Locking Implementations}

 Naive pseudocode for simple lock and unlock operations are shown below.
 Note that the \co{atomic_xchg()} primitive implies a memory barrier
 both before and after the atomic exchange operation, and that the
 implicit barrier after the atomic exchange operation eliminates
 the need for an explicit memory barrier in \co{spin_lock()}.
 Note also that, despite the names, \co{atomic_read()} and
 \co{atomic_set()} do \emph{not}
 execute any atomic instructions, instead, it merely executes a
 simple load and store, respectively.
 This pseudocode follows a number of Linux implementations for
 the unlock operation, which is a simple non-atomic store following a
 memory barrier.
 These minimal implementations must possess all the locking properties
 laid out in Section~\ref{sec:advsync:What Can You Trust?}.

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 void spin_lock(spinlock_t *lck)
   2 {
   3   while (atomic_xchg(&lck->a, 1) != 0)
   4     while (atomic_read(&lck->a) != 0)
   5       continue;
   6 }
   7
   8 void spin_unlock(spinlock_t lck)
   9 {
  10   smp_mb();
  11   atomic_set(&lck->a, 0);
  12 }
 \end{verbatim}
 \end{minipage}
 \label{codesample:advsync:Naive Lock and Unlock Pseudocode}
 \vspace{5pt}

 The \co{spin_lock()} primitive cannot proceed until
 the preceding \co{spin_unlock()} primitive completes.
 If CPU~1 is releasing a lock that CPU~2 is attempting to acquire,
 the sequence of operations might be as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt \scriptsize
 \begin{tabular}{l|l}
 	\nf{CPU 1}		& \nf{CPU 2} \\
 	\hline
 	(critical section)	& \tco{atomic_xchg(&lck->a, 1)} $\rightarrow$1 \\
 	<memory barrier>	& LOAD lck->a $\rightarrow$1 \\
 	STORE lck->a = 0	& LOAD lck->a $\rightarrow$1 \\
 				& LOAD lck->a $\rightarrow$0 \\
 				& (implicit memory barrier \#1) \\
 				& \tco{atomic_xchg(&lck->a, 1)} $\rightarrow$0 \\
 				& (implicit memory barrier \#2) \\
 				& (critical section) \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 In this particular case, pairwise memory barriers suffice to keep
 the two critical sections in place.
 CPU~2's \co{atomic_xchg(&lck->a, 1)} has seen CPU~1's \co{lck->a=0},
 so therefore everything in CPU~2's following critical section must see
 everything that CPU~1's preceding critical section did.
 Conversely, CPU~1's critical section cannot see anything that CPU~2's
 critical section will do.

 \subsection{A Few Simple Rules}
 \label{sec:advsync:A Few Simple Rules}

 Probably the easiest way to understand memory barriers is to understand
 a few simple rules:

 \begin{enumerate}
 \item	Each CPU sees its own accesses in order.
 \item	If a single shared variable is loaded and stored by multiple
 	CPUs, then the series of values seen by a given CPU will be
 	consistent with the series seen by the other CPUs, and there
 	will be at least one sequence consisting of all values stored
 	to that variable with which each CPUs series will be
 	consistent.\footnote{
 		A given CPU's series may of course be incomplete,
 		for example, if a given CPU never loaded or stored
 		the shared variable, then it can have no opinion about
 		that variable's value.}
 \item	If one CPU does ordered stores to variables~A and~B,\footnote{
 		For example, by executing the store to~A, a
 		memory barrier, and then the store to~B.}
 	and if a second CPU does ordered loads from~B and~A,\footnote{
 		For example, by executing the load from~B, a
 		memory barrier, and then the load from~A.}
 	then if the second CPU's load from~B gives the value stored
 	by the first CPU, then the second CPU's load from~A must
 	give the value stored by the first CPU.
 \item	If one CPU does a load from~A ordered before a store to~B,
 	and if a second CPU does a load from~B ordered before a store to~A,
 	and if the second CPU's load from~B gives the value stored by
 	the first CPU, then the first CPU's load from~A must \emph{not}
 	give the value stored by the second CPU.
 \item	If one CPU does a load from~A ordered before a store to~B,
 	and if a second CPU does a store to~B ordered before a
 	store to~A, and if the first CPU's load from~A gives
 	the value stored by the second CPU, then the first CPU's
 	store to~B must happen after the second CPU's store to~B,
 	hence the value stored by the first CPU persists.\footnote{
 		Or, for the more competitively oriented, the first
 		CPU's store to~B ``wins''.}
 \end{enumerate}

 The next section takes a more operational view of these rules.

 \subsection{Abstract Memory Access Model}

 Consider the abstract model of the system shown in
 Figure~\ref{fig:advsync:Abstract Memory Access Model}.

 \begin{figure}[htb]
 \centering
 \includegraphics{advsync/AbstractMemoryAccessModel}
 \caption{Abstract Memory Access Model}
 \ContributedBy{Figure}{fig:advsync:Abstract Memory Access Model}{David Howells}
 \end{figure}

 Each CPU executes a program that generates memory access operations.  In the
 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
 perform the memory operations in any order it likes, provided program causality
 appears to be maintained.  Similarly, the compiler may also arrange the
 instructions it emits in any order it likes, provided it doesn't affect the
 apparent operation of the program.

 So in Figure~\ref{fig:advsync:Abstract Memory Access Model},
 the effects of the memory operations performed by a
 CPU are perceived by the rest of the system as the operations cross the
 interface between the CPU and rest of the system (the dotted lines).


 For example, consider the following sequence of events given the
 initial values of shared variables {\tt \{A~=~1, B~=~2\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	STORE A = 3 &	x = LOAD B \\
 	STORE B = 4 &	y = LOAD A \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 The set of accesses as seen by the memory system in the middle can be arranged
 in 24 different combinations, with loads denoted by ``ld'' and stores
 denoted by ``st'':

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{llll}
 	st A=3, & st B=4, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$4 \\
 	st A=3, & st B=4, & y=ld A$\rightarrow$4, & x=ld B$\rightarrow$3 \\
 	st A=3, & x=ld B$\rightarrow$3, & st B=4, & y=ld A$\rightarrow$4 \\
 	st A=3, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$2, & st B=4 \\
 	st A=3, & y=ld A$\rightarrow$2, & st B=4, & x=ld B$\rightarrow$3 \\
 	st A=3, & y=ld A$\rightarrow$2, & x=ld B$\rightarrow$3, & st B=4 \\
 	st B=4, & st A=3, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$4 \\
 	st B=4, & ... & & \\
 	... & & & \\
 \end{tabular}
 \vspace{3pt}
 \end{minipage}
 %
 and can thus result in four different combinations of local values:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{ll}
 	x == 2, & y == 1 \\
 	x == 2, & y == 3 \\
 	x == 4, & y == 1 \\
 	x == 4, & y == 3 \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Furthermore, the stores committed by a CPU to the memory system may not be
 perceived by the loads made by another CPU in the same order as the stores were
 committed.

 As a further example, consider this sequence of events given the
 initial values of shared variables
 {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l}
 	\nf{CPU 1}	& \nf{CPU 2} \\
 	\hline
 	STORE B = 4	& Q = LOAD P \\
 	STORE P = \&B	& D = LOAD *Q \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 There is an obvious data dependency here,
 as the value loaded into~\co{D} depends on
 the address retrieved from~\co{P} by CPU~2.
 At the end of the sequence, any of the
 following results are possible:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{c@{ and }c}
 	(Q == \&A) & (D == 1) \\
 	(Q == \&B) & (D == 2) \\
 	(Q == \&B) & (D == 4) \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Note that CPU~2 will never try and load~C into~D because the CPU will load~P
 into~Q before issuing the load of~*Q.\footnote{Although it might sound
 	counterintuitive, one of the results,
 	{\tt\scriptsize (Q~==~\&B) and (D~==~2)},
 	does not necessarily mean that CPU~1's
 	{\tt\scriptsize STORE B~=~4} was issued
 	{\em after} {\tt\scriptsize STORE P~=~\&B}.
 	To reach a consensus in ordering, we need a pair of memory barriers
 	as is described in
 	Section~\ref{sec:advsync:Data Dependency Barriers}.}

 \subsection{Device Operations}
 \label{sec:advsync:Device Operations}

 Some devices present their control interfaces as collections of memory
 locations, but the order in which the control registers are accessed is very
 important.  For instance, imagine an Ethernet card with a set of internal
 registers that are accessed through an address port register~(A) and a data
 port register~(D).  To read internal register~5, the following code might then
 be used:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 *A = 5;
 x = *D;
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}
 %
 but this might show up as either of the following two sequences:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 STORE *A = 5, x = LOAD *D
 x = LOAD *D, STORE *A = 5
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}
 %
 the second of which will almost certainly result in a malfunction, since it set
 the address \emph{after} attempting to read the register.

 \subsection{Guarantees}
 \label{sec:advsync:Guarantees}

 There are some minimal guarantees that may be expected of a CPU:

 \begin{enumerate}
 \item	On any given CPU, dependent memory accesses will be issued in order,
 	with respect to itself.  This means that for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 Q = P; D = *Q;
 \end{verbatim}
 \end{minipage}

 	the CPU will issue the following memory operations:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 Q = LOAD P, D = LOAD *Q
 \end{verbatim}
 \end{minipage}

 	and always in that order.

 \item	Overlapping loads and stores within a particular CPU will appear to be
 	ordered within that CPU.  This means that for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 a = *X; *X = b;
 \end{verbatim}
 \end{minipage}

 	the CPU will only issue the following sequence of memory operations:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 a = LOAD *X, STORE *X = b
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}

 	And for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 *X = c; d = *X;
 \end{verbatim}
 \end{minipage}

 	the CPU will only issue:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 STORE *X = c, d = LOAD *X
 \end{verbatim}
 \end{minipage}

 	(Loads and stores overlap if they are targeted at overlapping pieces of
 	memory).
 \item	A series of stores to a single variable will appear to all
 	CPUs to have occurred in a single order, though this order
 	might not be predictable from the code, and in fact the
 	order might vary from one run to another.
 \end{enumerate}

 And there are a number of things that \emph{must} or \emph{must not} be assumed:

 \begin{enumerate}
 \item	It \emph{must not} be assumed that independent loads and stores will
 	be issued in the order given.  This means that for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 X = *A; Y = *B; *D = Z;
 \end{verbatim}
 \end{minipage}

 	we may get any of the following sequences:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 X = LOAD *A,   Y = LOAD *B,  STORE *D = Z
 X = LOAD *A,   STORE *D = Z, Y = LOAD *B
 Y = LOAD *B,   X = LOAD *A,  STORE *D = Z
 Y = LOAD *B,   STORE *D = Z, X = LOAD *A
 STORE *D = Z,  X = LOAD *A,  Y = LOAD *B
 STORE *D = Z,  Y = LOAD *B,  X = LOAD *A
 \end{verbatim}
 \end{minipage}

 \item	It \emph{must} be assumed that overlapping memory accesses may
 	be merged or discarded.  This means that for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 X = *A; Y = *(A + 4);
 \end{verbatim}
 \end{minipage}

 	we may get any one of the following sequences:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 X = LOAD *A; Y = LOAD *(A + 4);
 Y = LOAD *(A + 4); X = LOAD *A;
 {X, Y} = LOAD {*A, *(A + 4) };
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}

 	And for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 *A = X; *(A + 4) = Y;
 \end{verbatim}
 \end{minipage}

 	we may get any of:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 STORE *A = X; STORE *(A + 4) = Y;
 STORE *(A + 4) = Y; STORE *A = X;
 STORE {*A, *(A + 4) } = {X, Y};
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}

 	Finally, for:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 *A = X; *A = Y;
 \end{verbatim}
 \end{minipage}

 	we may get either of:

 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 STORE *A = X; STORE *A = Y;
 STORE *A = Y;
 \end{verbatim}
 \end{minipage}

 \end{enumerate}

 \subsection{What Are Memory Barriers?}
 \label{sec:advsync:What Are Memory Barriers?}

 As can be seen above, independent memory operations are effectively performed
 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
 What is required is some way of intervening to instruct the compiler and the
 CPU to restrict the order.

 Memory barriers are such interventions.  They impose a perceived partial
 ordering over the memory operations on either side of the barrier.

 Such enforcement is important because the CPUs and other devices in a system
 can use a variety of tricks to improve performance---including reordering,
 deferral and combination of memory operations; speculative loads; speculative
 branch prediction and various types of caching.  Memory barriers are used to
 override or suppress these tricks, allowing the code to sanely control the
 interaction of multiple CPUs and/or devices.

 \subsubsection{Explicit Memory Barriers}
 \label{sec:advsync:Explicit Memory Barriers}

 Memory barriers come in four basic varieties:

 \begin{enumerate}
 \item	Write (or store) memory barriers,
 \item	Data dependency barriers,
 \item	Read (or load) memory barriers, and
 \item	General memory barriers.
 \end{enumerate}

 Each variety is described below.

 \paragraph{Write Memory Barriers}

 A write memory barrier gives a guarantee that all the STORE operations
 specified before the barrier will appear to happen before all the STORE
 operations specified after the barrier with respect to the other
 components of the system.

 A write barrier is a partial ordering on stores only; it is not required
 to have any effect on loads.

 A CPU can be viewed as committing a sequence of store operations to the
 memory system as time progresses.  All stores before a write barrier will
 occur in the sequence \emph{before} all the stores after the write barrier.

 Note that write barriers should normally be paired with read
 or data dependency barriers; see
 Section~\ref{sec:advsync:SMP Barrier Pairing}.

 \paragraph{Data Dependency Barriers}

 A data dependency barrier is a weaker form of read barrier.  In the case
 where two loads are performed such that the second depends on the result
 of the first (e.g., the first load retrieves the address to which the second
 load will be directed), a data dependency barrier would be required to
 make sure that the target of the second load is updated before the address
 obtained by the first load is accessed.

 A data dependency barrier is a partial ordering on interdependent loads
 only; it is not required to have any effect on stores, independent loads
 or overlapping loads.

 As mentioned for write memory barriers,
 the other CPUs in the system can be viewed as
 committing sequences of stores to the memory system that the CPU being
 considered can then perceive.  A data dependency barrier issued by the CPU
 under consideration guarantees that for any load preceding it, if that
 load touches one of a sequence of stores from another CPU, then by the
 time the barrier completes, the effects of all the stores prior to that
 touched by the load will be perceptible to any loads issued after the data
 dependency barrier.

 See the Section~\ref{sec:advsync:Examples of Memory Barrier Pairings} for
 diagrams showing the ordering constraints.

 Note that the first load really has to have a
 \emph{data} dependency and
 not a control dependency.  If the address for the second load is dependent
 on the first load, but the dependency is through a conditional rather than
 actually loading the address itself, then it's a \emph{control} dependency and
 a full read barrier or better is required.  See
 Section~\ref{sec:advsync:Control Dependencies} for more information.

 Note that data dependency barriers should normally be paired with
 write barriers; see Section~\ref{sec:advsync:SMP Barrier Pairing}.

 \paragraph{Read Memory Barriers}

 A read barrier is a data dependency barrier plus a guarantee that all the
 LOAD operations specified before the barrier will appear to happen before
 all the LOAD operations specified after the barrier with respect to the
 other components of the system.

 A read barrier is a partial ordering on loads only; it is not required to
 have any effect on stores.

 Read memory barriers imply data dependency barriers, and so can substitute
 for them.

 Note that read barriers should normally be paired with write barriers;
 see Section~\ref{sec:advsync:SMP Barrier Pairing}.

 \paragraph{General Memory Barriers}

 A general memory barrier gives a guarantee that all the LOAD and STORE
 operations specified before the barrier will appear to happen before all
 the LOAD and STORE operations specified after the barrier with respect to
 the other components of the system.

 A general memory barrier is a partial ordering over both loads and stores.

 General memory barriers imply both read and write memory barriers, and so
 can substitute for either.

 \subsubsection{Implicit Memory Barriers}

 There are a couple of types of implicit memory barriers, so called
 because they are embedded into locking primitives:

 \begin{enumerate}
 \item	ACQUIRE operations and
 \item	RELEASE operations.\footnote{
 		Note that there were times when ACQUIRE/RELEASE operations
 		were called as ``LOCK/UNLOCK'' operations in the Linux
 		kernel community. ``ACQUIRE/RELEASE'' is preferred in
 		discussing lockless techniques.}
 \end{enumerate}

 \paragraph{ACQUIRE Operations}

 An acuire operation acts as a one-way permeable barrier.
 It guarantees that all memory
 operations after the ACQUIRE operation will appear to happen after the ACQUIRE
 operation with respect to the other components of the system.

 Memory operations that occur before an ACQUIRE operation may appear to happen
 after it completes.

 An ACQUIRE operation should almost always be paired with a RELEASE operation.

 \paragraph{RELEASE Operations}

 Release operations also act as a one-way permeable barrier.
 It guarantees that all
 memory operations before the RELEASE operation will appear to happen before
 the RELEASE operation with respect to the other components of the system.

 Memory operations that occur after a RELEASE operation may appear to
 happen before it completes.

 ACQUIRE and RELEASE operations are guaranteed to appear with respect to each
 other strictly in the order specified.

 The use of ACQUIRE and RELEASE operations generally precludes the need for
 other sorts of memory barrier (but note the exceptions mentioned in the
 Section~\ref{sec:advsync:Device Operations}).

 \QuickQuiz{}
 	What effect does the following sequence have on the
 	order of stores to variables~``a'' and~``b''?

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 	\begin{verbatim}
 	a = 1;
 	b = 1;
 	<write barrier>
 	\end{verbatim}
 	\end{minipage}
 \QuickQuizAnswer{
 	Absolutely none.  This barrier {\em would} ensure that the
 	assignments to~``a'' and~``b'' happened before any subsequent
 	assignments, but it does nothing to enforce any order of
 	assignments to~``a'' and~``b'' themselves.
 } \QuickQuizEnd

 \subsubsection{What May Not Be Assumed About Memory Barriers?}
 \label{sec:advsync:What May Not Be Assumed About Memory Barriers?}

 There are certain things that memory barriers cannot guarantee outside
 of the confines of a given architecture:

 \begin{enumerate}
 \item	There is no guarantee that any of the memory accesses specified
 	before a memory barrier will be \emph{complete} by the completion
 	of a memory barrier instruction; the barrier can be considered
 	to draw a line in that CPU's access queue that accesses of the
 	appropriate type may not cross.
 \item	There is no guarantee that issuing a memory barrier on one CPU
 	will have any direct effect on another CPU or any other hardware
 	in the system.	The indirect effect will be the order in which
 	the second CPU sees the effects of the first CPU's accesses occur,
 	but see the next point.
 \item	There is no guarantee that a CPU will see the correct order
 	of effects from a second CPU's accesses, even \emph{if} the second CPU
 	uses a memory barrier, unless the first CPU \emph{also} uses a matching
 	memory barrier (see
 	Section~\ref{sec:advsync:SMP Barrier Pairing}).
 \item	There is no guarantee that some intervening piece of off-the-CPU
 	hardware\footnote{
 		This is of concern primarily in operating-system kernels.
 		For more information on hardware operations and memory
 		ordering, see the files \path{pci.txt}, \path{DMA-API-HOWTO.txt},
 		and \path{DMA-API.txt} in the \path{Documentation} directory in
 		the Linux source tree~\cite{Torvalds2.6kernel}.}
 	will not reorder the memory accesses.  CPU cache
 	coherency mechanisms should propagate the indirect effects of
 	a memory barrier between CPUs, but might not do so in order.
 \end{enumerate}

 \subsubsection{Data Dependency Barriers}
 \label{sec:advsync:Data Dependency Barriers}

 The usage requirements of data dependency barriers are a little subtle, and
 it's not always obvious that they're needed.  To illustrate, consider the
 following sequence of events, with initial values
 {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	B = 4; & \\
 	<write barrier> & \\
 	P = \&B; &	\\
 		&	Q = P; \\
 		&	D = *Q; \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 There's a clear data dependency here, and it would seem intuitively
 obvious that by the end of the sequence, \co{Q}~must be either~\co{&A}
 or~\co{&B}, and that:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{c@{ implies }c}
 	(Q == \&A) & (D == 1) \\
 	(Q == \&B) & (D == 4) \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Counter-intuitive though it might be, it is quite possible that
 CPU~2's perception of~\co{P} might be updated \emph{before} its perception
 of~\co{B}, thus leading to the following situation:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{c@{ and }c}
 	(Q == \&B) & (D == 2) ???? \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Whilst this may seem like a failure of coherency or causality maintenance, it
 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
 Alpha).

 To deal with this, a data dependency barrier must be inserted between the
 address load and the data load (again with initial values of
 {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}):

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	B = 4; & \\
 	<write barrier> & \\
 	P = \&B; &	\\
 		&	Q = P; \\
 		&	<data dependency barrier> \\
 		&	D = *Q; \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 This enforces the occurrence of one of the two implications, and prevents the
 third possibility from arising.

 Note that this extremely counterintuitive situation arises most easily on
 machines with split caches, so that, for example, one cache bank processes
 even-numbered cache lines and the other bank processes odd-numbered cache
 lines.
 The pointer~\co{P} might be stored in an odd-numbered cache line, and the
 variable~\co{B} might be stored in an even-numbered cache line.  Then, if the
 even-numbered bank of the reading CPU's cache is extremely busy while the
 odd-numbered bank is idle, one can see the new value of the
 pointer~\co{P} (which is~\co{&B}),
 but the old value of the variable~\co{B} (which is~2).

 Another example of where data dependency barriers might by required is where a
 number is read from memory and then used to calculate the index for an array
 access with initial values
 {\tt \{M[0]~=~1, M[1]~=~2, M[3]~=~3, P~=~0, Q~=~3\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	M[1] = 4; & \\
 	<write barrier> & \\
 	P = 1;	&	\\
 		&	Q = P; \\
 		&	<data dependency barrier> \\
 		&	D = M[Q]; \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 The data dependency barrier is very important to the Linux kernel's
 RCU system, for example,
 see \co{rcu_dereference()} in \path{include/linux/rcupdate.h}.
 This permits the current
 target of an RCU'd pointer to be replaced with a new modified target, without
 the replacement target appearing to be incompletely initialized.

 See also
 Section~\ref{sec:advsync:Cache Coherency}
 for a larger example.

 \subsubsection{Control Dependencies}
 \label{sec:advsync:Control Dependencies}

 Control dependencies are especially tricky because current compilers
 do not understand them.
 The rules and examples in this section are intended to help you
 prevent your compiler's ignorance from breaking your code.

 A load-load control dependency requires a full read memory barrier,
 not simply a data dependency barrier.
 Consider the following bit of code:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 q = READ_ONCE(x);
   2 if (q) {
   3   <data dependency barrier>
   4   q = READ_ONCE(y);
   5 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 This will not have the desired effect because there is no actual data
 dependency, but rather a control dependency that the CPU may short-circuit
 by attempting to predict the outcome in advance, so that other CPUs see
 the load from~\co{y} as having happened before the load from~\co{x}.
 In such a case what's actually required is:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 q = READ_ONCE(x);
   2 if (q) {
   3   <read barrier>
   4   q = READ_ONCE(y);
   5 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 However, stores are not speculated.
 This means that ordering \emph{is} provided for load-store control
 dependencies, as in the following example:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 q = READ_ONCE(x);
   2 if (q)
   3   WRITE_ONCE(y, 1);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Control dependencies pair normally with other types of barriers.
 That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()}
 are optional!
 Without the \co{READ_ONCE()}, the compiler might combine the load
 from~\co{x} with other loads from~\co{x}.
 Without the \co{WRITE_ONCE()}, the compiler might combine the store
 to~\co{y} with other stores to~\co{y}.
 Either can result in highly counterintuitive effects on ordering.

 Worse yet, if the compiler is able to prove (say) that the value of
 variable~\co{x} is always non-zero, it would be well within its rights
 to optimize the original example by eliminating the ``\co{if}'' statement
 as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 q = READ_ONCE(x);
   2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 It is tempting to try to enforce ordering on identical stores on both
 branches of the ``\co{if}'' statement as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 if (q) {
  3   barrier();
  4   WRITE_ONCE(y, 1);
  5   do_something();
  6 } else {
  7   barrier();
  8   WRITE_ONCE(y, 1);
  9   do_something_else();
 10 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Unfortunately, current compilers will transform this as follows at high
 optimization levels:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 barrier();
  3 WRITE_ONCE(y, 1);  /* BUG: No ordering!!! */
  4 if (q) {
  5   do_something();
  6 } else {
  7   do_something_else();
  8 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Now there is no conditional between the load from~\co{x} and the store
 to~\co{y}, which means that the CPU is within its rights to reorder them:
 The conditional is absolutely required, and must be present in the
 assembly code even after all compiler optimizations have been applied.
 Therefore, if you need ordering in this example, you need explicit
 memory barriers, for example, a release store:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 if (q) {
  3   smp_store_release(&y, 1);
  4   do_something();
  5 } else {
  6   smp_store_release(&y, 1);
  7   do_something_else();
  8 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 The initial \co{READ_ONCE()} is still required to prevent the compiler from
 proving the value of~\co{x}.

 In addition, you need to be careful what you do with the local variable~%
 \co{q},
 otherwise the compiler might be able to guess the value and again remove
 the needed conditional.
 For example:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 if (q % MAX) {
  3   WRITE_ONCE(y, 1);
  4   do_something();
  5 } else {
  6   WRITE_ONCE(y, 2);
  7   do_something_else();
  8 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is
 equal to zero, in which case the compiler is within its rights to
 transform the above code into the following:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 WRITE_ONCE(y, 2);
  3 do_something_else();
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Given this transformation, the CPU is not required to respect the ordering
 between the load from variable~\co{x} and the store to variable~\co{y}.
 It is tempting to add a \co{barrier()} to constrain the compiler,
 but this does not help.
 The conditional is gone, and the barrier won't bring it back.
 Therefore, if you are relying on this ordering, you should make sure
 that \co{MAX} is greater than one, perhaps as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 BUILD_BUG_ON(MAX <= 1);
  3 if (q % MAX) {
  4   WRITE_ONCE(y, 1);
  5   do_something();
  6 } else {
  7   WRITE_ONCE(y, 2);
  8   do_something_else();
  9 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Please note once again that the stores to~\co{y} differ.
 If they were identical, as noted earlier, the compiler could pull this
 store outside of the ``\co{if}'' statement.

 You must also avoid excessive reliance on boolean short-circuit evaluation.
 Consider this example:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 if (q || 1 > 0)
  3   WRITE_ONCE(y, 1);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Because the first condition cannot fault and the second condition is
 always true, the compiler can transform this example as following,
 defeating control dependency:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 WRITE_ONCE(y, 1);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 This example underscores the need to ensure that the compiler cannot
 out-guess your code.
 More generally, although \co{READ_ONCE()} does force
 the compiler to actually emit code for a given load, it does not force
 the compiler to use the results.

 In addition, control dependencies apply only to the then-clause and
 else-clause of the if-statement in question.
 In particular, it does
 not necessarily apply to code following the if-statement:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 q = READ_ONCE(x);
  2 if (q) {
  3   WRITE_ONCE(y, 1);
  4 } else {
  5   WRITE_ONCE(y, 2);
  6 }
  7 WRITE_ONCE(z, 1);  /* BUG: No ordering. */
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 It is tempting to argue that there in fact is ordering because the
 compiler cannot reorder volatile accesses and also cannot reorder
 the writes to~\co{y} with the condition.
 Unfortunately for this line
 of reasoning, the compiler might compile the two writes to~\co{y} as
 conditional-move instructions, as in this fanciful pseudo-assembly
 language:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
  1 ld r1,x
  2 cmp r1,$0
  3 cmov,ne r4,$1
  4 cmov,eq r4,$2
  5 st r4,y
  6 st $1,z
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 A weakly ordered CPU would have no dependency of any sort between the load
 from~\co{x} and the store to~\co{z}.
 The control dependencies would extend
 only to the pair of cmov instructions and the store depending on them.
 In short, control dependencies apply only to the stores in the ``\co{then}''
 and ``\co{else}'' of the ``\co{if}'' in question (including functions
 invoked by those two clauses), not to code following that ``\co{if}''.

 Finally, control dependencies do \emph{not} provide transitivity.\footnote{
 	Refer to Section~\ref{sec:advsync:Transitivity} for
 	the meaning of transitivity.}
 This is demonstrated by two related examples, with the initial values
 of~\co{x} and~\co{y} both being zero:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 0} &	\nf{CPU 1} \\
 	\hline
 	\tco{r1 = READ_ONCE(x);} &
 		\tco{r2 = READ_ONCE(y);} \\
 	if (r1 > 0) &
 		if (r2 > 0) \\
 	~~~\tco{WRITE_ONCE(y, 1);} &
 		~~~\tco{WRITE_ONCE(x, 1);} \\
 	\multicolumn{2}{l}{~} \\
 	\multicolumn{2}{l}{\tco{assert(!(r1 == 1 && r2 == 1));}} \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 The above two-CPU example will never trigger the \co{assert()}.
 However, if control dependencies guaranteed transitivity (which they do
 not), then adding the following CPU would guarantee a related assertion:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l}
 	\nf{CPU 2} \\
 	\hline
 	\tco{WRITE_ONCE(y, 1);} \\
 	\multicolumn{1}{l}{~} \\
 	\multicolumn{1}{l}{\tco{assert(!(r1 == 1 && r2 == 1 && x == 1));}} \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 But because control dependencies do \emph{not} provide transitivity, the above
 assertion can fail after the combined three-CPU example completes.
 If you need the three-CPU example to provide ordering, you will need
 \co{smp_mb()} between the loads and stores in the CPU~0 and CPU~1 code
 fragments, that is, just before or just after the ``\co{if}'' statements.
 Furthermore, the original two-CPU example is very fragile and should be avoided.

 The two-CPU example is known as LB (load buffering) and the three-CPU
 example as WWC~\cite{Maranget2012TutorialARMPower}.

 The following list of rules summarizes the lessons of this section:

 \begin{enumerate}
 \item	Compilers do not understand control dependencies, so it is
 	your job to make sure that the compiler cannot break your code.

 \item	Control dependencies can order prior loads against later stores.
 	However, they do \emph{not} guarantee any other sort of ordering:
 	Not prior loads against later loads, nor prior stores against
 	later anything.
 	If you need these other forms of ordering, use \co{smp_rmb()},
 	\co{smp_wmb()}, or, in the case of prior stores and later loads,
 	\co{smp_mb()}.

 \item	If both legs of the ``\co{if}'' statement begin with identical stores
 	to the same variable, then those stores must be ordered,
 	either by preceding both of them with \co{smp_mb()} or by using
 	\co{smp_store_release()} to carry out the stores.
 	Please note that it is \emph{not} sufficient to use \co{barrier()}
 	at beginning of each leg of the ``\co{if}'' statement because, as shown
 	by the example above, optimizing compilers can destroy the control
 	dependency while respecting the letter of the \co{barrier()} law.

 \item	Control dependencies require at least one run-time conditional
 	between the prior load and the subsequent store, and this
 	conditional must involve the prior load.
 	If the compiler is able to optimize the conditional away, it
 	will have also optimized away the ordering.
 	Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help
 	to preserve the needed conditional.

 \item	Control dependencies require that the compiler avoid reordering
 	the dependency into nonexistence.
 	Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or
 	\co{atomic64_read()} can help to preserve your control
 	dependency.

 \item	Control dependencies apply only to the ``\co{then}'' and
 	``\co{else}'' of the ``\co{if}'' containing the control
 	dependency, including any functions that these two clauses call.
 	Control dependencies do \emph{not} apply to code following the
 	``\co{if}'' containing the control dependency.

 \item	Control dependencies pair normally with other types of barriers.

 \item	Control dependencies do \emph{not} provide transitivity.
 	If you need transitivity, use \co{smp_mb()}.
 \end{enumerate}

 \subsubsection{SMP Barrier Pairing}
 \label{sec:advsync:SMP Barrier Pairing}

 When dealing with CPU-CPU interactions, certain types of memory barrier should
 always be paired.  A lack of appropriate pairing is almost certainly an error.

 A write barrier should always be paired with a data dependency barrier or read
 barrier, though a general barrier would also be viable.  Similarly a read
 barrier or a data dependency barrier should always be paired with at least a
 write barrier, though, again, a general barrier is viable:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	<write barrier> & \\
 	B = 2;	&	\\
 		&	X = B; \\
 		&	<read barrier> \\
 		&	Y = A; \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Or:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	<write barrier> & \\
 	B = \&A; &	\\
 		&	X = B; \\
 		&	<data dependency barrier> \\
 		&	Y = *X; \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 One way or another, the read barrier must always be present, even though
 it might be of a weaker type.\footnote{
 	By ``weaker'', we mean ``makes fewer ordering guarantees''.
 	A weaker barrier is usually also lower-overhead than is a
 	stronger barrier.}

 Note that the stores before the write barrier would normally be expected to
 match the loads after the read barrier or data dependency barrier, and vice
 versa:

 \begin{center}
 \resizebox{3in}{!}{\includegraphics{advsync/MemoryBarrierPairing}}
 \end{center}

 \subsubsection{Examples of Memory Barrier Pairings}
 \label{sec:advsync:Examples of Memory Barrier Pairings}

 Firstly, write barriers act as partial orderings on store operations.
 Consider the following sequence of events:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l}
 	STORE A = 1 \\
 	STORE B = 2 \\
 	STORE C = 3 \\
 	<write barrier> \\
 	STORE D = 4 \\
 	STORE E = 5 \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 This sequence of events is committed to the memory coherence system in an order
 that the rest of the system might perceive as the unordered set of
 {\tt \{A=1,B=2,C=3\}}
 all occurring before the unordered set of
 {\tt \{D=4,E=5\}}, as shown in
 Figure~\ref{fig:advsync:Write Barrier Ordering Semantics}.

 \begin{figure*}[htb]
 \centering
 \includegraphics{advsync/WriteBarrierOrdering}
 \caption{Write Barrier Ordering Semantics}
 \ContributedBy{Figure}{fig:advsync:Write Barrier Ordering Semantics}{David Howells}
 \end{figure*}

 Secondly, data dependency barriers act as partial orderings on data-dependent
 loads.  Consider the following sequence of events with initial values
 {\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	B = 2; & \\
 	<write barrier> & \\
 	C = \&B; & 	LOAD X\\
 	D = 4;	&	LOAD C \nf{(gets \tco{&B})} \\
 		&	LOAD *C \nf{(reads \tco{B})} \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Without intervention, CPU~2 may perceive the events on CPU~1 in some
 effectively random order, despite the write barrier issued by CPU~1, as
 shown in Figure~\ref{fig:advsync:Data Dependency Barrier Omitted}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/DataDependencyNeeded}
 \caption{Data Dependency Barrier Omitted}
 \ContributedBy{Figure}{fig:advsync:Data Dependency Barrier Omitted}{David Howells}
 \end{figure*}

 In the above example, CPU~2 perceives that \co{B} is~7,
 despite the load of~\co{*C}
 (which would be~\co{B}) coming after the \co{LOAD} of~\co{C}.

 If, however, a data dependency barrier were to be placed between the load
 of~\co{C} and the load of~\co{*C} (i.e.:~\co{B}) on CPU~2, again with initial
 values of {\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	B = 2; & \\
 	<write barrier> & \\
 	C = \&B; & 	LOAD X\\
 	D = 4;	&	LOAD C \nf{(gets \tco{&B})} \\
 		&	<data dependency barrier> \\
 		&	LOAD *C \nf{(reads \tco{B})} \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 then ordering will be as intuitively expected, as shown in
 Figure~\ref{fig:advsync:Data Dependency Barrier Supplied}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/DataDependencySupplied}
 \caption{Data Dependency Barrier Supplied}
 \ContributedBy{Figure}{fig:advsync:Data Dependency Barrier Supplied}{David Howells}
 \end{figure*}

 And thirdly, a read barrier acts as a partial order on loads.  Consider the
 following sequence of events, with initial values
 {\tt \{A~=~0, B~=~9\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	<write barrier> & \\
 	B = 2; & \\
 		&	LOAD B \\
 		&	LOAD A \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Without intervention, CPU~2 may then choose to perceive the events on CPU~1 in
 some effectively random order, despite the write barrier issued by CPU~1, as
 shown in Figure~\ref{fig:advsync:Read Barrier Needed}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/ReadBarrierNeeded}
 \caption{Read Barrier Needed}
 \ContributedBy{Figure}{fig:advsync:Read Barrier Needed}{David Howells}
 \end{figure*}

 If, however, a read barrier were to be placed between the load of~\co{B}
 and the load of~\co{A} on CPU~2, again with initial values of
 {\tt \{A~=~0, B~=~9\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	<write barrier> & \\
 	B = 2; & \\
 		&	LOAD B \\
 		&	<read barrier> \\
 		&	LOAD A \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 then the partial ordering imposed by CPU~1's write barrier will be
 perceived correctly by CPU~2, as shown in
 Figure~\ref{fig:advsync:Read Barrier Supplied}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/ReadBarrierSupplied}
 \caption{Read Barrier Supplied}
 \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied}{David Howells}
 \end{figure*}

 To illustrate this more completely, consider what could happen if the code
 contained a load of~\co{A} either side of the read barrier, once again
 with the same initial values of
 {\tt \{A~=~0, B~=~9\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 	A = 1; & \\
 	<write barrier> & \\
 	B = 2; & \\
 		&	LOAD B \\
 		&	LOAD A \nf{(1\textsuperscript{st})} \\
 		&	<read barrier> \\
 		&	LOAD A \nf{(2\textsuperscript{nd})} \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Even though the two loads of~\co{A}
 both occur after the load of~\co{B}, they may both
 come up with different values, as shown in
 Figure~\ref{fig:advsync:Read Barrier Supplied, Double Load}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/ReadBarrierSupplied1}
 \caption{Read Barrier Supplied, Double Load}
 \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Double Load}{David Howells}
 \end{figure*}

 Of course, it may well be that CPU~1's update to~\co{A} becomes perceptible
 to CPU~2 before the read barrier completes, as shown in
 Figure~\ref{fig:advsync:Read Barrier Supplied, Take Two}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/ReadBarrierSupplied2}
 \caption{Read Barrier Supplied, Take Two}
 \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Take Two}{David Howells}
 \end{figure*}

 The guarantee is that the second load will always come up with \nbco{A == 1}
 if the
 load of~\co{B} came up with \nbco{B == 2}.
 No such guarantee exists for the first load
 of~\co{A}; that may come up with either \nbco{A == 0} or \nbco{A == 1}.

 \subsubsection{Read Memory Barriers vs. Load Speculation}
 \label{sec:advsync:Read Memory Barriers vs. Load Speculation}

 Many CPUs speculate with loads: that is, they see that they will need to
 load an item from memory, and they find a time where they're not using
 the bus for any other loads, and then do the load in advance---even though
 they haven't actually got to that point in the instruction execution
 flow yet.
 Later on, this potentially permits the actual load instruction to
 complete immediately because the CPU already has the value on hand.

 It may turn out that the CPU didn't actually need the value (perhaps because a
 branch circumvented the load) in which case it can discard the value or just
 cache it for later use.
 For example, consider the following:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 		&	LOAD B \\
 		&	DIVIDE \\
 		&	DIVIDE \\
 		&	LOAD A \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 On some CPUs, divide instructions can take a long time to complete,
 which means that CPU~2's bus might go idle during that time.
 CPU~2 might therefore speculatively load~\co{A} before the divides
 complete.
 In the (hopefully) unlikely event of an exception from one of the dividees,
 this speculative load will have been wasted, but in the (again, hopefully)
 common case, overlapping the load with the divides will permit the load
 to complete more quickly, as illustrated by
 Figure~\ref{fig:advsync:Speculative Load}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/SpeculativeLoad}
 \caption{Speculative Load}
 \ContributedBy{Figure}{fig:advsync:Speculative Load}{David Howells}
 \end{figure*}

 Placing a read barrier or a data dependency barrier just before the second
 load:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|p{1.5in}}
 	\nf{CPU 1} &	\nf{CPU 2} \\
 	\hline
 		&	LOAD B \\
 		&	DIVIDE \\
 		&	DIVIDE \\
 		&	<read barrier> \\
 		&	LOAD A \\
 \end{tabular}
 \vspace{5pt}
 \end{minipage}
 %
 will force any value speculatively obtained to be reconsidered to an extent
 dependent on the type of barrier used.  If there was no change made to the
 speculated memory location, then the speculated value will just be used,
 as shown in
 Figure~\ref{fig:advsync:Speculative Loads and Barrier}.
 On the other hand, if there was an update or invalidation to~\co{A}
 from some other CPU, then the speculation will be cancelled and the
 value of~\co{A} will be reloaded,
 as shown in Figure~\ref{fig:advsync:Speculative Loads Cancelled by Barrier}.

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/SpeculativeLoadBarrier}
 \caption{Speculative Load and Barrier}
 \ContributedBy{Figure}{fig:advsync:Speculative Loads and Barrier}{David Howells}
 \end{figure*}

 \begin{figure*}[htbp]
 \centering
 \includegraphics{advsync/SpeculativeLoadBarrierCancel}
 \caption{Speculative Load Cancelled by Barrier}
 \ContributedBy{Figure}{fig:advsync:Speculative Loads Cancelled by Barrier}{David Howells}
 \end{figure*}

 \subsubsection{Transitivity}
 \label{sec:advsync:Transitivity}

 ``Transitivity'' is a deeply intuitive notion about ordering that is not
 always provided by real computer systems.  The following example
 demonstrates transitivity with initial values of
 {\tt \{X~=~0, Y~=~0\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l|l}
 	\nf{CPU 1}	& \nf{CPU 2}		& \nf{CPU 3} \\
 	\hline
 	STORE X = 1	& LOAD X		& STORE Y = 1 \\
 			& <general barrier>	& <general barrier> \\
 			& LOAD Y		& LOAD X \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 Suppose that CPU~2's load from~\co{X} returns~1 and its load from~\co{Y} returns~0.
 This indicates that CPU~2's load from~\co{X} in some sense follows CPU~1's
 store to~\co{X} and that CPU~2's load from~\co{Y} in some sense preceded CPU~3's
 store to~\co{Y}.  The question is then ``Can CPU~3's load from~\co{X} return~0?''

 Because CPU~2's load from~\co{X} in some sense came after CPU~1's store, it
 is natural to expect that CPU~3's load from~\co{X} must therefore return~1.
 This expectation is an example of transitivity: if a load executing on
 CPU~A follows a load from the same variable executing on CPU~B, then
 CPU~A's load must either return the same value that CPU~B's load did,
 or must return some later value.

 In the Linux kernel, use of general memory barriers guarantees
 transitivity.  Therefore, in the above example, if CPU~2's load from~\co{X}
 returns~1 and its load from~\co{Y} returns~0, then CPU~3's load from~\co{X} must
 also return~1.

 However, transitivity is {\em not} guaranteed for read or write barriers.
 For example, suppose that CPU~2's general barrier in the above example
 is changed to a read barrier as shown below with the same initial values of
 {\tt \{X~=~0, Y~=~0\}}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \tt
 \scriptsize
 \begin{tabular}{l|l|l}
 	\nf{CPU 1}	& \nf{CPU 2}		& \nf{CPU 3} \\
 	\hline
 	STORE X = 1	& LOAD X		& STORE Y = 1 \\
 			& <read barrier>	& <general barrier> \\
 			& LOAD Y		& LOAD X \\
 \end{tabular}
 \end{minipage}
 \vspace{5pt}

 This substitution destroys transitivity: in this example, it is perfectly
 legal for CPU~2's load from~\co{X} to return~1, its load from~\co{Y} to return~0,
 and CPU~3's load from~\co{X} to return~0.

 The key point is that although CPU~2's read barrier orders its pair
 of loads, it does not guarantee to order CPU~1's store.  Therefore, if
 this example runs on a system where CPUs~1 and~2 share a store buffer
 or a level of cache, CPU~2 might have early access to CPU~1's writes.
 General barriers are therefore required to ensure that all CPUs agree
 on the combined order of CPU~1's and CPU~2's accesses.

 General barriers provide ``global transitivity'', so that all CPUs will
 agree on the order of operations.  In contrast, a chain of release-acquire
 pairs provides only ``local transitivity'', so that only those CPUs on
 the chain are guaranteed to agree on the combined order of the accesses.
 For example, switching to C~code in deference to Herman Hollerith shown in
 Figure~\ref{fig:advsync:Example of Local Transitivity (Release-Acquire Pair)}:

 \begin{figure}[htbp]
 \scriptsize
 \centering
 \begin{verbbox}
    1 int u, v, x, y, z;
    2
    3 void cpu0(void)
    4 {
    5   r0 = smp_load_acquire(&x);
    6   WRITE_ONCE(u, 1);
    7   smp_store_release(&y, 1);
    8 }
    9
   10 void cpu1(void)
   11 {
   12   r1 = smp_load_acquire(&y);
   13   r4 = READ_ONCE(v);
   14   r5 = READ_ONCE(u);
   15   smp_store_release(&z, 1);
   16 }
   17
   18 void cpu2(void)
   19 {
   20   r2 = smp_load_acquire(&z);
   21   smp_store_release(&x, 1);
   22 }
   23
   24 void cpu3(void)
   25 {
   26   WRITE_ONCE(v, 1);
   27   smp_mb();
   28   r3 = READ_ONCE(u);
   29 }
 \end{verbbox}
 \theverbbox
 \caption{Example of Local Transitivity (Release-Acquire Pair)}
 \label{fig:advsync:Example of Local Transitivity (Release-Acquire Pair)}
 \end{figure}

 Because \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} participate in a local transitive
 chain of \co{smp_store_release()}-\co{smp_load_acquire()} pairs, the following
 outcome is prohibited:

 {\scriptsize
 \begin{verbatim}
   r0 == 1 && r1 == 1 && r2 == 1
 \end{verbatim}
 }

 Furthermore, because of the release-acquire relationship between \co{cpu0()}
 and \co{cpu1()}, \co{cpu1()} must see \co{cpu0()}'s writes, so that the following
 outcome is prohibited:

 {\scriptsize
 \begin{verbatim}
   r1 == 1 && r5 == 0
 \end{verbatim}
 }

 However, the transitivity of release-acquire is local to the participating
 CPUs and does not apply to \co{cpu3()}.  Therefore, the following outcome
 is possible:

 {\scriptsize
 \begin{verbatim}
   r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
 \end{verbatim}
 }

 As an aside, the following outcome is also possible:

 {\scriptsize
 \begin{verbatim}
   r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
           && r5 == 1
 \end{verbatim}
 }

 Although \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} will see their respective reads and
 writes in order, CPUs not involved in the release-acquire chain might
 well disagree on the order.  This disagreement stems from the fact that
 the weak memory-barrier instructions used to implement \co{smp_load_acquire()}
 and \co{smp_store_release()} are not required to order prior stores against
 subsequent loads in all cases.  This means that \co{cpu3()} can see \co{cpu0()}'s
 store to~\co{u} as happening {\em after} \co{cpu1()}'s load from~\co{v}, even though
 both \co{cpu0()} and \co{cpu1()} agree that these two operations occurred in the
 intended order.

 However, please keep in mind that \co{smp_load_acquire()} is not magic.
 In particular, it simply reads from its argument with ordering.  It does
 {\em not} ensure that any particular value will be read.  Therefore, the
 following outcome is possible:

 {\scriptsize
 \begin{verbatim}
   r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
 \end{verbatim}
 }

 Note that this outcome can happen even on a mythical sequentially
 consistent system where nothing is ever reordered.

 To reiterate, if your code requires global transitivity, use general
 barriers throughout.

 \subsection{Locking Constraints}
 \label{sec:advsync:Locking Constraints}

 As noted earlier, locking primitives contain implicit memory barriers.
 These primitives imply the following:
 \begin{enumerate}
 \item	ACQUIRE operation implication:
 	\begin{itemize}
 	\item	Memory operations issued after an ACQUIRE will be completed
 		after the ACQUIRE operation has completed.
 	\item	Memory operations issued before an ACQUIRE may be completed
 		after the ACQUIRE operation has completed.
 	\end{itemize}
 \item	RELEASE operation implication:
 	\begin{itemize}
 	\item	Memory operations issued before a RELEASE will be
 		completed before the RELEASE operation has completed.
 	\item	Memory operations issued after a RELEASE may be completed
 		before the RELEASE operation has completed.
 	\end{itemize}
 \item	ACQUIRE vs ACQUIRE implication:
 	\begin{itemize}
 	\item	All ACQUIRE operations issued before another ACQUIRE operation
 		will be completed before that ACQUIRE operation.
 	\end{itemize}
 \item	ACQUIRE vs RELEASE implication:
 	\begin{itemize}
 	\item	All ACQUIRE operations issued before a RELEASE operation
 		will be completed before the RELEASE operation.
 	\item	All RELEASE operations issued before an ACQUIRE operation
 		will be completed before the ACQUIRE operation.
 	\end{itemize}
 \item	Failed conditional ACQUIRE implication:
 	\begin{itemize}
 	\item	Certain variants of ACQUIRE operation may fail, either
 		due to being unable to get the lock immediately, or due
 		to receiving an unblocked signal or exception
 		whilst asleep waiting
 		for the lock to become available.  Failed ACQUIREs do not
 		imply any sort of barrier.
 	\end{itemize}
 \end{enumerate}

 \subsection{Memory-Barrier Examples}
 \label{sec:advsync:Memory-Barrier Examples}

 \subsubsection{Locking Examples}

 \paragraph{ACQUIRE Followed by RELEASE:}
 An ACQUIRE followed by a RELEASE may not be assumed to be a full memory barrier
 because it is possible for an access preceding the ACQUIRE to happen after the
 ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and the
 two accesses can themselves then cross.
 For example, the following:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 *A = a;
   2 ACQUIRE
   3 RELEASE
   4 *B = b;
 \end{verbatim}
 \vspace{1pt}
 \end{minipage}
 %
 might well execute in the following order:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   2 ACQUIRE
   4 *B = b;
   1 *A = a;
   3 RELEASE
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Again, always remember that ACQUIRE and RELEASE are permitted to let preceding
 operations and following operations ``bleed in'' to the critical section
 respectively.

 \QuickQuiz{}
 	What sequence of ACQUIRE-RELEASE operations \emph{would}
 	act as a full memory barrier?
 \QuickQuizAnswer{
 	A series of two back-to-back ACQUIRE-RELEASE operations, or, somewhat
 	less conventionally, a RELEASE operation followed by an ACQUIRE
 	operation.
 } \QuickQuizEnd

 \QuickQuiz{}
 	What (if any) CPUs have memory-barrier instructions
 	from which these semi-permeable locking primitives might
 	be constructed?
 \QuickQuizAnswer{
 	Itanium is one example.
 	The identification of any others is left as an
 	exercise for the reader.
 } \QuickQuizEnd

 \paragraph{Lock-Based Critical Sections:}
 Although an ACQUIRE\-/RELEASE pair does not act as a full memory barrier,
 these operations \emph{do} affect memory ordering.

 Consider the following code:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 *A = a;
   2 *B = b;
   3 ACQUIRE
   4 *C = c;
   5 *D = d;
   6 RELEASE
   7 *E = e;
   8 *F = f;
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 This could legitimately execute in the following order, where pairs
 of operations on the same line indicate that the CPU executed those
 operations concurrently:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   3 ACQUIRE
   1 *A = a; *F = f;
   7 *E = e;
   4 *C = c; *D = d;
   2 *B = b;
   6 RELEASE
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 \begin{table}[htbp]
 \scriptsize\centering
 \begin{tabular}{r|l}
   \# & Ordering: legitimate or not? \\
   \hline
   \hline
   1 & \verb|*A; *B; ACQUIRE; *C; *D; RELEASE; *E; *F;| \\
   \hline
   2 & \verb|*A; {*B; ACQUIRE;} *C; *D; RELEASE; *E; *F;| \\
   \hline
   3 & \verb|{*F; *A;} *B; ACQUIRE; *C; *D; RELEASE; *E;| \\
   \hline
   4 & \verb|*A; *B; {ACQUIRE; *C;} *D; {RELEASE; *E;} *F;| \\
   \hline
   5 & \verb|*B; ACQUIRE; *C; *D; *A; RELEASE; *E; *F;| \\
   \hline
   6 & \verb|*A; *B; *C; ACQUIRE; *D; RELEASE; *E; *F;| \\
   \hline
   7 & \verb|*A; *B; ACQUIRE; *C; RELEASE; *D; *E; *F;| \\
   \hline
   8 & \verb|{*B; *A; ACQUIRE;} {*D; *C;} {RELEASE; *F; *E;}| \\
   \hline
   9 & \verb|*B; ACQUIRE; *C; *D; RELEASE; {*F; *A;} *E;| \\
 \end{tabular}
 \caption{Lock-Based Critical Sections}
 \label{tab:advsync:Lock-Based Critical Sections}
 \end{table}

 \QuickQuiz{}
 	Given that operations grouped in curly braces are executed
 	concurrently, which of the rows of
 	Table~\ref{tab:advsync:Lock-Based Critical Sections}
 	are legitimate reorderings of the assignments to variables~``A''
 	through~`F'' and the ACQUIRE/RELEASE operations?
 	(The order in the code is {\tt *A, *B, ACQUIRE, *C, *D, RELEASE, *E, *F}.)
 	Why or why not?
 \QuickQuizAnswer{
 	\begin{enumerate}
 	\item	Legitimate, executed in order.
 	\item	Legitimate, the lock acquisition was executed concurrently
 		with the last assignment preceding the critical section.
 	\item	Illegitimate, the assignment to~``F'' must follow the ACQUIRE
 		operation.
 	\item	Illegitimate, the ACQUIRE must complete before any operation in
 		the critical section.  However, the RELEASE may legitimately
 		be executed concurrently with subsequent operations.
 	\item	Legitimate, the assignment to~``A'' precedes the RELEASE,
 		as required, and all other operations are in order.
 	\item	Illegitimate, the assignment to~``C'' must follow the ACQUIRE.
 	\item	Illegitimate, the assignment to~``D'' must precede the RELEASE.
 	\item	Legitimate, all assignments are ordered with respect to the
 		ACQUIRE and RELEASE operations.
 	\item	Illegitimate, the assignment to~``A'' must precede the RELEASE.
 	\end{enumerate}
 } \QuickQuizEnd

 \paragraph{Ordering with Multiple Locks:}
 Code containing multiple locks still sees ordering constraints from
 those locks, but one must be careful to keep track of which lock is which.
 For example, consider the code shown in
 Table~\ref{tab:advsync:Ordering With Multiple Locks}, which uses
 a pair of locks named~``M'' and~``Q''.

 \begin{table}[htbp]
 \scriptsize\centering{\tt
 \begin{tabular}{r|l}
   \nf{CPU 1} &	\nf{CPU 2} \\
   \hline
   A = a; &	E = e; \\
   ACQUIRE M; &	ACQUIRE Q; \\
   B = b; &	F = f; \\
   C = c; &	G = g; \\
   RELEASE M; &	RELEASE Q; \\
   D = d; &	H = h; \\
 \end{tabular}}
 \caption{Ordering With Multiple Locks}
 \label{tab:advsync:Ordering With Multiple Locks}
 \end{table}

 In this example, there are no guarantees as to what order the
 assignments to variables~``A'' through~``H'' will appear in, other
 than the constraints imposed by the locks themselves, as
 described in the previous section.

 \QuickQuiz{}
 	What are the constraints for
 	Table~\ref{tab:advsync:Ordering With Multiple Locks}?
 \QuickQuizAnswer{
 	All CPUs must see the following ordering constraints:
 	\begin{enumerate}
 	\item	ACQUIRE M precedes B, C, and D.
 	\item	RELEASE M follows A, B, and C.
 	\item	ACQUIRE Q precedes F, G, and H.
 	\item	RELEASE Q follows E, F, and G.
 	\end{enumerate}
 } \QuickQuizEnd

 \paragraph{Ordering with Multiple CPUs on One Lock:}
 Suppose, instead of the two different locks as shown in
 Table~\ref{tab:advsync:Ordering With Multiple Locks}, both CPUs acquire
 the same lock, as shown in
 Table~\ref{tab:advsync:Ordering With Multiple CPUs on One Lock}?

 \begin{table}[htbp]
 \scriptsize\centering{\tt
 \begin{tabular}{r|l}
   \nf{CPU 1} &	\nf{CPU 2} \\
   \hline
   A = a; &	E = e; \\
   ACQUIRE M; &	ACQUIRE M; \\
   B = b; &	F = f; \\
   C = c; &	G = g; \\
   RELEASE M; &	RELEASE M; \\
   D = d; &	H = h; \\
 \end{tabular}}
 \caption{Ordering With Multiple CPUs on One Lock}
 \label{tab:advsync:Ordering With Multiple CPUs on One Lock}
 \end{table}

 In this case, either CPU~1 acquires~M before CPU~2 does, or vice versa.
 In the first case, the assignments to~A, B, and~C must precede
 those to~F, G, and~H.
 On the other hand, if CPU~2 acquires the lock first, then the
 assignments to~E, F, and~G must precede those to~B, C, and~D.

 \subsection{The Effects of the CPU Cache}
 \label{sec:advsync:The Effects of the CPU Cache}

 The perceived ordering of memory operations is affected by the caches
 that lie between the CPUs and memory, as well as by the cache coherence
 protocol that maintains memory consistency and ordering.
 From a software viewpoint, these caches are for all intents and purposes
 part of memory.
 Memory barriers can be thought of as acting on the vertical dotted line in
 Figure~\ref{fig:advsync:Memory Architecture}, ensuring that the CPU
 presents its values to memory in the proper order, as well as ensuring
 that it sees changes made by other CPUs in the proper order.

 \begin{figure*}[htb]
 \centering
 \includegraphics{advsync/MemoryArchitecture}
 \caption{Memory Architecture}
 \ContributedBy{Figure}{fig:advsync:Memory Architecture}{David Howells}
 \end{figure*}

 Although the caches can ``hide'' a given CPU's memory accesses from the rest of
 the system, the cache-coherence protocol ensures that all other CPUs see
 any effects of these hidden accesses, migrating and invalidating cachelines
 as required.
 Furthermore, the CPU core may execute instructions in any order, restricted
 only by the requirement that program causality and memory ordering
 appear to be maintained.
 Some of these instructions may generate memory accesses that must be queued
 in the CPU's memory access queue, but execution may nonetheless continue
 until the CPU either fills up its internal resources or until it must
 wait for some queued memory access to complete.

 \subsubsection{Cache Coherency}
 \label{sec:advsync:Cache Coherency}

 Although cache-coherence protocols guarantee that a given CPU sees its
 own accesses in order, and that all CPUs agree on the order of modifications
 to a single variable contained within a single cache line, there is no
 guarantee that modifications to different variables will be seen in
 the same order by all CPUs---although some computer systems do make
 some such guarantees, portable software cannot rely on them.

 \begin{figure*}[htb]
 \centering
 \includegraphics{advsync/SplitCache}
 \caption{Split Caches}
 \ContributedBy{Figure}{fig:advsync:Split Caches}{David Howells}
 \end{figure*}

 To see why reordering can occur, consider the two-CPU system shown in
 Figure~\ref{fig:advsync:Split Caches}, in which each CPU has a split
 cache.
 This system has the following properties:
 \begin{enumerate}
 \item	An odd-numbered cache line may be in cache~A, cache~C,
 	in memory, or some combination of the above.
 \item	An even-numbered cache line may be in cache~B, cache~D,
 	in memory, or some combination of the above.
 \item	While the CPU core is interrogating one of its caches,\footnote{
 		But note that in ``superscalar'' systems, the CPU
 		might well be accessing both halves of its cache at
 		once, and might in fact be performing multiple concurrent
 		accesses to each of the halves.}
 	its other cache is not necessarily quiescent.
 	This other cache may instead be responding to an invalidation
 	request, writing back a dirty cache line,
 	processing elements in the CPU's memory-access queue, and
 	so on.
 \item	Each cache has queues of operations that need to be applied
 	to that cache in order to maintain the required coherence
 	and ordering properties.
 \item	These queues are not necessarily flushed by loads from or
 	stores to cache lines affected by entries in those queues.
 \end{enumerate}

 In short, if cache~A is busy, but cache~B is idle, then CPU~1's
 stores to odd-numbered cache lines may be delayed compared to
 CPU~2's stores to even-numbered cache lines.
 In not-so-extreme cases, CPU~2 may see CPU~1's operations out
 of order.

 \IfInBook{Much more detail on memory ordering in hardware and software
 	  may be found in Appendix~\ref{chp:app:whymb:Why Memory Barriers?}.
 	 }
 	 {}

 \subsection{Where Are Memory Barriers Needed?}
 \label{sec:advsync:Where Are Memory Barriers Needed?}

 Memory barriers are only required where there's a possibility of interaction
 between two CPUs or between a CPU and a device.  If it can be guaranteed that
 there won't be any such interaction in any particular piece of code, then
 memory barriers are unnecessary in that piece of code.

 Note that these are the \emph{minimum} guarantees.
 Different architectures may give
 more substantial guarantees,
 \IfInBook{as discussed in Appendix~\ref{chp:app:whymb:Why Memory Barriers?},}{}
 but they may \emph{not}
 be relied upon outside of code specifically designed to run only on
 the corresponding architecture.

 However, primitives that implement atomic operations, such as locking
 primitives and atomic data-structure manipulation and traversal primitives,
 will normally include any needed memory barriers in their definitions.
 However, there are some exceptions, such as \co{atomic_inc()} in the
 Linux kernel, so be sure to review the documentation, and, if possible,
 the actual implementations, for your software environment.

 One final word of advice: use of raw memory-barrier primitives should
 be a last resort.
 It is almost always better to use an existing primitive that takes
 care of memory barriers.