| % advsync/memorybarriers.tex |
| |
| \section{Memory Barriers} |
| \label{sec:advsync:Memory Barriers} |
| \OriginallyPublished{Section}{sec:advsync:Memory Barriers}{Linux Kernel Memory Barriers}{kernel.org}{Howells2009membartxt} |
| |
| Causality and sequencing are deeply intuitive, and hackers often |
| tend to have a much stronger grasp of these concepts than does |
| the general population. |
| These intuitions can be extremely powerful tools when writing, analyzing, |
| and debugging both sequential code and parallel code that makes |
| use of standard mutual-exclusion mechanisms, such as locking and |
| RCU. |
| |
| \begin{table} |
| \centering{\tt |
| \begin{tabular}{l|l} |
| \nf{Thread 1} & \nf{Thread 2} \\ |
| \hline |
| x = 1; & y = 1; \\ |
| r1 = y; & r2 = x; \\ |
| \hline |
| \multicolumn{2}{l}{assert(r1 == 1 || r2 == 1);} \\ |
| \end{tabular}} |
| \caption{Memory Misordering: Dekker} |
| \label{tab:advsync:Memory Misordering: Dekker} |
| \end{table} |
| |
| Unfortunately, these intuitions break down completely in face of |
| code that makes direct use of explicit memory barriers for data structures |
| in shared memory. |
| For example, the litmus test in |
| Table~\ref{tab:advsync:Memory Misordering: Dekker} |
| appears to guarantee that the assertion never fires. |
| After all, if \nbco{r1 != 1}, we might hope that Thread~1's load from~\co{y} |
| must have happened before Thread~2's store to~\co{y}, which might raise |
| further hopes that Thread~2's load from~\co{x} must happen after |
| Thread~1's store to~\co{x}, so that \nbco{r2 == 1}, as required by the |
| assertion. |
| The example is symmetric, so similar hopeful reasoning might lead |
| us to hope that \nbco{r2 != 1} guarantees that \nbco{r1 == 1}. |
| Unfortunately, the lack of memory barriers in |
| Table~\ref{tab:advsync:Memory Misordering: Dekker} |
| dashes these hopes. |
| Both the compiler and the CPU are within their rights to reorder |
| the statements within both Thread~1 and Thread~2, even on relatively |
| strongly ordered systems such as x86. |
| |
| The following sections show exactly where this intuition breaks down, |
| and then puts forward a mental model of memory barriers that can help |
| you avoid these pitfalls. |
| |
| Section~\ref{sec:advsync:Memory Ordering and Memory Barriers} |
| gives a brief overview of memory ordering and memory barriers. |
| Once this background is in place, the next step is to get you to admit |
| that your intuition has a problem. |
| This painful task is taken up by |
| Section~\ref{sec:advsync:If B Follows A, and C Follows B, Why Doesn't C Follow A?}, |
| which shows an intuitively correct code fragment that fails miserably |
| on real hardware, and by |
| Section~\ref{sec:advsync:Variables Can Have More Than One Value}, |
| which presents some code demonstrating that scalar variables can |
| take on multiple values simultaneously. |
| Once your intuition has made it through the grieving process, |
| Section~\ref{sec:advsync:What Can You Trust?} |
| provides the basic rules that memory barriers follow, rules that we |
| will build upon. |
| These rules are further refined in |
| Sections~\ref{sec:advsync:Review of Locking Implementations} |
| through~\ref{sec:advsync:Where Are Memory Barriers Needed?}. |
| |
| \subsection{Memory Ordering and Memory Barriers} |
| \label{sec:advsync:Memory Ordering and Memory Barriers} |
| |
| But why are memory barriers needed in the first place? |
| Can't CPUs keep track of ordering on their own? |
| Isn't that why we have computers in the first place, to keep track of things? |
| |
| Many people do indeed expect their computers to keep track of things, |
| but many also insist that they keep track of things quickly. |
| One difficulty that modern computer-system vendors face is that |
| the main memory cannot keep up with the CPU---modern CPUs can execute |
| hundreds of instructions in the time required to fetch a single variable |
| from memory. |
| CPUs therefore sport increasingly large caches, as shown in |
| Figure~\ref{fig:advsync:Modern Computer System Cache Structure}. |
| Variables that are heavily used by a given CPU will tend to remain |
| in that CPU's cache, allowing high-speed access to the corresponding |
| data. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{appendix/whymb/cacheSC}} |
| \caption{Modern Computer System Cache Structure} |
| \label{fig:advsync:Modern Computer System Cache Structure} |
| \end{figure} |
| |
| Unfortunately, when a CPU accesses data that is not yet in its cache |
| will result in an expensive ``cache miss'', requiring the data to |
| be fetched from main memory. |
| Doubly unfortunately, running typical code results in a significant |
| number of cache misses. |
| To limit the resulting performance degradation, CPUs have been designed to |
| execute other instructions and memory references while waiting for |
| a cache miss to fetch data from memory. |
| This clearly causes instructions and memory references to execute out |
| of order, which could cause serious confusion, as illustrated in |
| Figure~\ref{fig:advsync:CPUs Can Do Things Out of Order}. |
| Compilers and synchronization primitives (such as locking and RCU) |
| are responsible for maintaining the illusion of ordering through use of |
| ``memory barriers'' (for example, \co{smp_mb()} in the Linux kernel). |
| These memory barriers can be explicit instructions, as they are on |
| ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions, |
| as they are on x86. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}} |
| \caption{CPUs Can Do Things Out of Order} |
| \ContributedBy{Figure}{fig:advsync:CPUs Can Do Things Out of Order}{Melissa Broussard} |
| \end{figure} |
| |
| Since the standard synchronization primitives preserve the illusion of |
| ordering, your path of least resistance is to stop reading |
| this section and simply use these primitives. |
| |
| However, if you need to implement the synchronization primitives |
| themselves, or if you are simply interested in understanding how memory |
| ordering and memory barriers work, read on! |
| |
| The next sections present counter-intuitive scenarios that you might |
| encounter when using explicit memory barriers. |
| |
| \subsection{If B Follows A, and C Follows B, Why Doesn't C Follow A?} |
| \label{sec:advsync:If B Follows A, and C Follows B, Why Doesn't C Follow A?} |
| |
| Memory ordering and memory barriers can be extremely counter-intuitive. |
| For example, consider the functions shown in |
| Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} |
| executing in parallel |
| where variables~A, B, and~C are initially zero: |
| |
| \begin{figure}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 thread0(void) |
| 2 { |
| 3 A = 1; |
| 4 smp_wmb(); |
| 5 B = 1; |
| 6 } |
| 7 |
| 8 thread1(void) |
| 9 { |
| 10 while (B != 1) |
| 11 continue; |
| 12 barrier(); |
| 13 C = 1; |
| 14 } |
| 15 |
| 16 thread2(void) |
| 17 { |
| 18 while (C != 1) |
| 19 continue; |
| 20 barrier(); |
| 21 assert(A != 0); |
| 22 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Parallel Hardware is Non-Causal} |
| \label{fig:advsync:Parallel Hardware is Non-Causal} |
| \end{figure} |
| |
| Intuitively, \co{thread0()} assigns to~B after it assigns to~A, |
| \co{thread1()} waits until \co{thread0()} has assigned to~B before |
| assigning to~C, and \co{thread2()} waits until \co{thread1()} has |
| assigned to~C before referencing~A. |
| Therefore, again intuitively, the assertion on line~21 cannot possibly |
| fire. |
| |
| This line of reasoning, intuitively obvious though it may be, is completely |
| and utterly incorrect. |
| Please note that this is \emph{not} a theoretical assertion: |
| actually running this code on real-world weakly-ordered hardware |
| (a 1.5GHz 16-CPU POWER 5 system) resulted in the assertion firing |
| 16~times out of 10~million runs. |
| Clearly, anyone who produces code with explicit memory barriers |
| should do some extreme testing---although a proof of correctness might |
| be helpful, the strongly counter-intuitive nature of the behavior of |
| memory barriers should in turn strongly limit one's trust in such proofs. |
| The requirement for extreme testing should not be taken lightly, given |
| that a number of dirty hardware-dependent tricks were used to |
| greatly \emph{increase} the probability of failure in this run. |
| |
| \QuickQuiz{} |
| How on earth could the assertion on line~21 of the code in |
| Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on |
| page~\pageref{fig:advsync:Parallel Hardware is Non-Causal} |
| \emph{possibly} fail? |
| \QuickQuizAnswer{ |
| The key point is that the intuitive analysis missed is that |
| there is nothing preventing the assignment to~C from overtaking |
| the assignment to~A as both race to reach \co{thread2()}. |
| This is explained in the remainder of this section. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Great \ldots So how do I fix it? |
| \QuickQuizAnswer{ |
| The easiest fix is to replace each of the \co{barrier()}s on |
| line~12 and line~20 with an \co{smp_mb()}. |
| |
| Of course, some hardware is more forgiving than other hardware. |
| For example, on x86 the assertion on line~21 of |
| Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on |
| page~\pageref{fig:advsync:Parallel Hardware is Non-Causal} |
| cannot trigger. |
| On PowerPC, only the \co{barrier()} on line~20 need be |
| replaced with \co{smp_mb()} to prevent the assertion from |
| triggering. |
| } \QuickQuizEnd |
| |
| So what should you do? |
| Your best strategy, if possible, is to use existing primitives that |
| incorporate any needed memory barriers, so that you can simply ignore |
| the rest of this chapter. |
| |
| Of course, if you are implementing synchronization primitives, |
| you don't have this luxury. |
| The following discussion of memory ordering and memory barriers |
| is for you. |
| |
| \subsection{Variables Can Have More Than One Value} |
| \label{sec:advsync:Variables Can Have More Than One Value} |
| |
| It is natural to think of a variable as taking on a well-defined |
| sequence of values in a well-defined, global order. |
| Unfortunately, it is time to say ``goodbye'' to this sort of comforting |
| fiction. |
| |
| To see this, consider the program fragment shown in |
| Figure~\ref{fig:advsync:Software Logic Analyzer}. |
| This code fragment is executed in parallel by several CPUs. |
| Line~1 sets a shared variable to the current CPU's ID, line~2 |
| initializes several variables from a \co{gettb()} function that |
| delivers the value of a fine-grained hardware ``timebase'' counter that is |
| synchronized among all CPUs (not available from all CPU architectures, |
| unfortunately!), and the loop from lines~3-8 records the length of |
| time that the variable retains the value that this CPU assigned to it. |
| Of course, one of the CPUs will ``win'', and would thus never exit |
| the loop if not for the check on lines~6-7. |
| |
| \QuickQuiz{} |
| What assumption is the code fragment |
| in Figure~\ref{fig:advsync:Software Logic Analyzer} |
| making that might not be valid on real hardware? |
| \QuickQuizAnswer{ |
| The code assumes that as soon as a given CPU stops |
| seeing its own value, it will immediately see the |
| final agreed-upon value. |
| On real hardware, some of the CPUs might well see several |
| intermediate results before converging on the final value. |
| } \QuickQuizEnd |
| |
| \begin{figure}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 state.variable = mycpu; |
| 2 lasttb = oldtb = firsttb = gettb(); |
| 3 while (state.variable == mycpu) { |
| 4 lasttb = oldtb; |
| 5 oldtb = gettb(); |
| 6 if (lasttb - firsttb > 1000) |
| 7 break; |
| 8 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Software Logic Analyzer} |
| \label{fig:advsync:Software Logic Analyzer} |
| \end{figure} |
| |
| Upon exit from the loop, \co{firsttb} will hold a timestamp |
| taken shortly after the assignment and \co{lasttb} will hold |
| a timestamp taken before the last sampling of the shared variable |
| that still retained the assigned value, or a value equal to \co{firsttb} |
| if the shared variable had changed before entry into the loop. |
| This allows us to plot each CPU's view of the value of \co{state.variable} |
| over a 532-nanosecond time period, as shown in |
| Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}. |
| This data was collected in 2006 on 1.5GHz POWER5 system with 8 cores, |
| each containing a pair of hardware threads. |
| CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test. |
| The timebase counter period was about 5.32ns, sufficiently fine-grained |
| to allow observations of intermediate cache states. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/MoreThanOneValue}} |
| \caption{A Variable With Multiple Simultaneous Values} |
| \label{fig:advsync:A Variable With Multiple Simultaneous Values} |
| \end{figure} |
| |
| Each horizontal bar represents the observations of a given CPU over time, |
| with the black regions to the left indicating the time before the |
| corresponding CPU's first measurement. |
| During the first 5ns, only CPU~3 has an opinion about the value of the |
| variable. |
| During the next 10ns, CPUs~2 and~3 disagree on the value of the variable, |
| but thereafter agree that the value is~``2'', which is in fact |
| the final agreed-upon value. |
| However, CPU~1 believes that the value is~``1'' for almost 300ns, and |
| CPU~4 believes that the value is~``4'' for almost 500ns. |
| |
| \QuickQuiz{} |
| How could CPUs possibly have different views of the |
| value of a single variable \emph{at the same time?} |
| \QuickQuizAnswer{ |
| Many CPUs have write buffers that record the values of |
| recent writes, which are applied once the corresponding |
| cache line makes its way to the CPU. |
| Therefore, it is quite possible for each CPU to see a |
| different value for a given variable at a single point |
| in time---and for main memory to hold yet another value. |
| One of the reasons that memory barriers were invented was |
| to allow software to deal gracefully with situations like |
| this one. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why do CPUs~2 and~3 come to agreement so quickly, when it |
| takes so long for CPUs~1 and~4 to come to the party? |
| \QuickQuizAnswer{ |
| CPUs~2 and~3 are a pair of hardware threads on the same |
| core, sharing the same cache hierarchy, and therefore have |
| very low communications latencies. |
| This is a NUMA, or, more accurately, a NUCA effect. |
| |
| This leads to the question of why CPUs~2 and~3 ever disagree |
| at all. |
| One possible reason is that they each might have a small amount |
| of private cache in addition to a larger shared cache. |
| Another possible reason is instruction reordering, given the |
| short 10-nanosecond duration of the disagreement and the |
| total lack of memory barriers in the code fragment. |
| } \QuickQuizEnd |
| |
| And if you think that the situation with four CPUs was intriguing, consider |
| Figure~\ref{fig:advsync:A Variable With More Simultaneous Values}, |
| which shows the same situation, but with 15~CPUs each assigning their |
| number to a single shared variable at time~$t=0$. Both diagrams in the |
| figure are drawn in the same way as |
| Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}. |
| The only difference is that the unit of horizontal axis is timebase ticks, |
| with each tick lasting about 5.3~nanoseconds. |
| The entire sequence therefore lasts a bit longer than the events recorded in |
| Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}, |
| consistent with the increase in number of CPUs. |
| The upper diagram shows the overall picture, while the lower one shows |
| the zoom-up of first 50~timebase ticks. |
| |
| Again, CPU~0 coordinates the test, so does not record any values. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{5in}{!}{\includegraphics{advsync/MoreThanOneValue-15CPU}} |
| \caption{A Variable With More Simultaneous Values} |
| \ContributedBy{Figure}{fig:advsync:A Variable With More Simultaneous Values}{Akira Yokosawa} |
| \end{figure*} |
| |
| All CPUs eventually agree on the final value of~9, but not before |
| the values~15 and~12 take early leads. |
| Note that there are fourteen different opinions on the variable's value |
| at time~21 indicated by the vertical line in the lower diagram. |
| Note also that all CPUs see sequences whose orderings are consistent with |
| the directed graph shown in |
| Figure~\ref{fig:advsync:Possible Global Orders With More Simultaneous Values}. |
| Nevertheless, both figures underscore the importance of |
| proper use of memory barriers for code that cares about memory ordering. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{advsync/store15tred}} |
| \caption{Possible Global Orders With More Simultaneous Values} |
| \label{fig:advsync:Possible Global Orders With More Simultaneous Values} |
| \end{figure} |
| |
| We have entered a regime where we must bid a fond farewell to |
| comfortable intuitions about values of variables and the passage of time. |
| This is the regime where memory barriers are needed. |
| |
| All that aside, it is important to remember the lessons from |
| Chapters~\ref{chp:Hardware and its Habits} |
| and~\ref{cha:Partitioning and Synchronization Design}. |
| Having all CPUs write concurrently to the same variable |
| is absolutely no way to design a parallel program, at least |
| not if performance and scalability are at all important to you. |
| |
| \subsection{What Can You Trust?} |
| \label{sec:advsync:What Can You Trust?} |
| |
| You most definitely cannot trust your intuition. |
| |
| What \emph{can} you trust? |
| |
| It turns out that there are a few reasonably simple rules that |
| allow you to make good use of memory barriers. |
| This section derives those rules, for those who wish to get |
| to the bottom of the memory-barrier story, at least from the viewpoint |
| of portable code. |
| If you just want to be told what the rules are rather than suffering |
| through the actual derivation, |
| please feel free to skip to Section~\ref{sec:advsync:A Few Simple Rules}. |
| |
| The exact semantics of memory barriers vary wildly from one CPU family to |
| another, so portable code must rely only on the least-common-denominator |
| semantics of memory barriers. |
| |
| Fortunately, all CPU families impose the following rules: |
| \begin{enumerate} |
| \item All accesses by a given CPU will appear to that CPU to have |
| occurred in program order. |
| \item All CPUs' accesses to a single variable will be consistent with |
| some global ordering of stores to that variable. |
| \item Memory barriers will operate in a pair-wise fashion. |
| \item Operations will be provided from which exclusive locking |
| primitives may be constructed. |
| \end{enumerate} |
| |
| Therefore, if you need to use memory barriers in portable code, |
| you can rely on all of these properties.\footnote{ |
| Or, better yet, you can avoid explicit use of memory barriers |
| entirely. |
| But that would be the subject of other sections.} |
| Each of these properties is described in the following sections. |
| |
| \subsubsection{Self-References Are Ordered} |
| |
| A given CPU will see its own accesses as occurring in ``program order'', |
| as if the CPU was executing only one instruction at a time with no |
| reordering or speculation. |
| For older CPU families, this restriction is necessary for binary compatibility, |
| and only secondarily for the sanity of us software types. |
| There have been a few CPU families that violate this rule to a limited extent, |
| but in those cases, the compiler has been responsible |
| for ensuring that ordering is explicitly enforced as needed. |
| |
| Either way, from the programmer's viewpoint, the CPU sees its own accesses |
| in program order. |
| |
| \subsubsection{Single-Variable Memory Consistency} |
| \label{sec:advsync:Single-Variable Memory Consistency} |
| |
| Because current commercially available computer systems provide |
| \emph{cache coherence}, |
| if a group of CPUs all do concurrent non-atomic stores to a single variable, |
| the series of values seen by all CPUs will be consistent with at |
| least one global ordering. |
| For example, in the series of accesses shown in |
| Figure~\ref{fig:advsync:A Variable With Multiple Simultaneous Values}, |
| CPU~1 sees the sequence {\tt \{1,2\}}, |
| CPU~2 sees the sequence {\tt \{2\}}, |
| CPU~3 sees the sequence {\tt \{3,2\}}, |
| and |
| CPU~4 sees the sequence {\tt \{4,2\}}. |
| This is consistent with the global sequence {\tt \{3,1,4,2\}}, |
| but also with all five of the other sequences of these four numbers that end |
| in~``2''. |
| Thus, there will be agreement on the sequence of values taken on |
| by a single variable, but there might be ambiguity. |
| |
| In contrast, had the CPUs used atomic operations (such as the Linux kernel's |
| \co{atomic_inc_return()} primitive) rather than simple stores of |
| unique values, their observations would |
| be guaranteed to determine a single globally consistent sequence of values. |
| One of the \co{atomic_inc_return()} invocations would happen first, |
| and would change the value from~0 to~1, the second from~1 to~2, and |
| so on. |
| The CPUs could compare notes afterwards and come to agreement on the |
| exact ordering of the sequence of \co{atomic_inc_return()} invocations. |
| This does not work for the non-atomic stores described earlier because |
| the non-atomic stores do not return any indication of the earlier value, |
| hence the possibility of ambiguity. |
| |
| Please note well that this section applies \emph{only} when all |
| CPUs' accesses are to one single variable. |
| In this single-variable case, cache coherence guarantees the |
| global ordering, at least assuming that some of the more aggressive |
| compiler optimizations are disabled via the Linux kernel's |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} directives or C++11's relaxed |
| atomics~\cite{PeteBecker2011N3242}. |
| In contrast, if there are multiple variables, memory barriers are |
| required for the CPUs to consistently agree on the order for current |
| commercially available computer systems. |
| |
| \subsubsection{Pair-Wise Memory Barriers} |
| |
| Pair-wise memory barriers provide conditional ordering semantics. |
| For example, in the following set of operations, CPU~1's access to |
| A does not unconditionally precede its access to~B from the viewpoint |
| of an external logic analyzer |
| \IfInBook{(see Appendix~\ref{chp:app:whymb:Why Memory Barriers?} |
| for examples). |
| } |
| {(the system is only to act \emph{as if} the accesses |
| are in order; it is not necessarily required to actually |
| force them to be in order).} |
| However, if CPU~2's access to~B sees the result of CPU~1's access |
| to~B, then CPU~2's access to~A is guaranteed to see the result of |
| CPU~1's access to~A. |
| Although some CPU families' memory barriers do in fact provide stronger, |
| unconditional ordering guarantees, portable code may rely only |
| on this weaker if-then conditional ordering guarantee. |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| <access> A & <access> B \\ |
| <memory barrier>& <memory barrier> \\ |
| <access> B & <access> A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| But if the memory barriers do not unconditionally force |
| ordering, how the heck can a device driver reliably execute |
| sequences of loads and stores to MMIO registers? |
| \QuickQuizAnswer{ |
| MMIO registers are special cases: because they appear |
| in uncached regions of physical memory. |
| Memory barriers \emph{do} unconditionally force ordering |
| of loads and stores to uncached memory, as discussed in |
| Section~\ref{sec:advsync:Device Operations}. |
| } \QuickQuizEnd |
| |
| Of course, accesses must be either loads or stores, and these |
| do have different properties. |
| Table~\ref{tab:advsync:Memory-Barrier Combinations} |
| shows all possible combinations of loads and stores from a pair |
| of CPUs. |
| Of course, to enforce conditional ordering, there must be |
| a memory barrier between each CPU's pair of operations. |
| |
| \begin{table*} |
| \small |
| \centering |
| \begin{tabular}{r||c|c||c|c||l} |
| \multicolumn{1}{c||}{} & \multicolumn{2}{c||}{CPU 1} & |
| \multicolumn{2}{c||}{CPU 2} & Description \\ |
| \hline |
| \hline |
| 0 & load(A) & load(B) & load(B) & load(A) & |
| Ears to ears. \\ |
| 1 & load(A) & load(B) & load(B) & store(A) & |
| Only one store. \\ |
| 2 & load(A) & load(B) & store(B) & load(A) & |
| Only one store. \\ |
| 3 & load(A) & load(B) & store(B) & store(A) & |
| Pairing 1. \\ |
| \hline |
| 4 & load(A) & store(B) & load(B) & load(A) & |
| Only one store. \\ |
| 5 & load(A) & store(B) & load(B) & store(A) & |
| Pairing 2. \\ |
| 6 & load(A) & store(B) & store(B) & load(A) & |
| Mouth to mouth, ear to ear. \\ |
| 7 & load(A) & store(B) & store(B) & store(A) & |
| Pairing 3. \\ |
| \hline |
| 8 & store(A) & load(B) & load(B) & load(A) & |
| Only one store. \\ |
| 9 & store(A) & load(B) & load(B) & store(A) & |
| Mouth to mouth, ear to ear. \\ |
| A & store(A) & load(B) & store(B) & load(A) & |
| Ears to mouths. \\ |
| B & store(A) & load(B) & store(B) & store(A) & |
| Stores ``pass in the night''. \\ |
| \hline |
| C & store(A) & store(B) & load(B) & load(A) & |
| Pairing 1. \\ |
| D & store(A) & store(B) & load(B) & store(A) & |
| Pairing 3. \\ |
| E & store(A) & store(B) & store(B) & load(A) & |
| Stores ``pass in the night''. \\ |
| F & store(A) & store(B) & store(B) & store(A) & |
| Stores ``pass in the night''. \\ |
| \end{tabular} |
| \caption{Memory-Barrier Combinations} |
| \label{tab:advsync:Memory-Barrier Combinations} |
| \end{table*} |
| |
| \subsubsection{Pair-Wise Memory Barriers: Portable Combinations} |
| |
| The following pairings from |
| Table~\ref{tab:advsync:Memory-Barrier Combinations}, |
| enumerate all the combinations of memory-barrier |
| pairings that portable software may depend on. |
| |
| \paragraph{Pairing 1.} |
| In this pairing, one CPU executes a pair of loads separated |
| by a memory barrier, while a second CPU executes a pair |
| of stores also separated by a memory barrier, as follows |
| (both~A and~B are initially equal to zero): |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| STORE A = 1 & Y = LOAD B \\ |
| <memory barrier>& <memory barrier> \\ |
| STORE B = 1 & X = LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| After both CPUs have completed executing these code sequences, |
| if \co{Y==1}, then we must also have \co{X==1}. |
| In this case, the fact that \co{Y==1} means that |
| CPU~2's load prior to its memory barrier has |
| seen the store following CPU~1's memory barrier. |
| Due to the pairwise nature of memory barriers, CPU~2's |
| load following its memory barrier must therefore see |
| the store that precedes CPU~1's memory barrier, so that |
| \co{X==1}. |
| |
| On the other hand, if \co{Y==0}, the memory-barrier condition |
| does not hold, and so in this case, X~could be either~0 or~1. |
| |
| \paragraph{Pairing 2.} |
| In this pairing, each CPU executes a load followed by a |
| memory barrier followed by a store, as follows |
| (both~A and~B are initially equal to zero): |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| X = LOAD A & Y = LOAD B \\ |
| <memory barrier>& <memory barrier> \\ |
| STORE B = 1 & STORE A = 1 \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| After both CPUs have completed executing these code sequences, |
| if \co{X==1}, then we must also have \co{Y==0}. |
| In this case, the fact that \co{X==1} means that |
| CPU~1's load prior to its memory barrier has |
| seen the store following CPU~2's memory barrier. |
| Due to the pairwise nature of memory barriers, CPU~1's |
| store following its memory barrier must therefore see |
| the results of CPU~2's load preceding its memory barrier, |
| so that \co{Y==0}. |
| |
| On the other hand, if \co{X==0}, the memory-barrier condition |
| does not hold, and so in this case, Y~could be either~0 or~1. |
| |
| The two CPUs' code sequences are symmetric, so if \co{Y==1} |
| after both CPUs have finished executing these code sequences, |
| then we must have \co{X==0}. |
| |
| \paragraph{Pairing 3.} |
| In this pairing, one CPU executes a load followed by a |
| memory barrier followed by a store, while the other CPU |
| executes a pair of stores separated by a memory barrier, |
| as follows (both~A and~B are initially equal to zero): |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| X = LOAD A & STORE B = 2 \\ |
| <memory barrier>& <memory barrier> \\ |
| STORE B = 1 & STORE A = 1 \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| After both CPUs have completed executing these code sequences, |
| if \co{X==1}, then we must also have \co{B==1}. |
| In this case, the fact that \co{X==1} means that |
| CPU~1's load prior to its memory barrier has |
| seen the store following CPU~2's memory barrier. |
| Due to the pairwise nature of memory barriers, CPU~1's |
| store following its memory barrier must therefore see |
| the results of CPU~2's store preceding its memory barrier. |
| This means that CPU~1's store to~B will overwrite CPU~2's |
| store to~B, resulting in \co{B==1}. |
| |
| On the other hand, if \co{X==0}, the memory-barrier condition |
| does not hold, and so in this case, B~could be either~1 or~2. |
| |
| \subsubsection{Pair-Wise Memory Barriers: Semi-Portable Combinations} |
| |
| The following pairings from |
| Table~\ref{tab:advsync:Memory-Barrier Combinations} |
| can be used on modern hardware, but might fail on some systems |
| that were produced in the 1900s. |
| However, these \emph{can} safely be used on all mainstream hardware |
| introduced since the year 2000. |
| So if you think that memory barriers are difficult to deal with, please |
| keep in mind that they used to be a \emph{lot} harder on some systems! |
| |
| \paragraph{Ears to Mouths.} |
| Since the stores cannot see the results of the loads |
| (again, ignoring MMIO registers for the moment), |
| it is not always possible to determine whether the memory-barrier |
| condition has been met. |
| However, 21\textsuperscript{st}-century |
| hardware \emph{would} guarantee that at least one |
| of the loads saw the value stored by the corresponding store |
| (or some later value for that same variable). |
| |
| \QuickQuiz{} |
| How do we know that modern hardware guarantees that at least |
| one of the loads will see the value stored by the other thread |
| in the ears-to-mouths scenario? |
| \QuickQuizAnswer{ |
| The scenario is as follows, with~A and~B both initially zero: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 0} & \nf{CPU 1} \\ |
| \hline |
| STORE A = 1 & STORE B = 1 \\ |
| <memory barrier>& <memory barrier> \\ |
| r1 = LOAD B & r2 = LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| If neither of the loads see the corresponding store, when both |
| CPUs finish, both \co{r1} and \co{r2} will be equal to zero. |
| Let's suppose that \co{r1} is equal to zero. |
| Then we know that CPU~0's load from~B happened before CPU~1's |
| store to~B: After all, we would have had \co{r1} equal to one |
| otherwise. |
| But given that CPU~0's load from~B happened before CPU~1's store |
| to~B, memory-barrier pairing guarantees that CPU~0's store to~A |
| happens before CPU~1's load from~A, which in turn guarantees that |
| \co{r2} will be equal to one, not zero. |
| |
| Therefore, at least one of \co{r1} and \co{r2} must be nonzero, |
| which means that at least one of the loads saw the value from |
| the corresponding store, as claimed. |
| } \QuickQuizEnd |
| |
| \paragraph{Stores ``Pass in the Night''.} |
| In the following example, after both CPUs have finished |
| executing their code sequences, it is quite tempting to |
| conclude that the result {\tt \{A==1,B==2\}} cannot happen. |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| STORE A = 1 & STORE B = 2 \\ |
| <memory barrier>& <memory barrier> \\ |
| STORE B = 1 & STORE A = 2 \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Unfortunately, although this conclusion is correct on |
| 21\textsuperscript{st}-century systems, it does not necessarily hold |
| on all antique 20\textsuperscript{th}-century systems. |
| Suppose that the cache line containing~A is initially owned |
| by CPU~2, and that containing~B is initially owned by CPU~1. |
| Then, in systems that have invalidation queues and store |
| buffers, it is possible for the first assignments to ``pass |
| in the night'', so that the second assignments actually |
| happen first. |
| \IfInBook{This strange effect is explained in |
| Appendix~\ref{chp:app:whymb:Why Memory Barriers?}. |
| } |
| {} |
| |
| This same effect can happen in any memory-barrier pairing |
| where each CPU's memory barrier is preceded by a store, |
| including the ``ears to mouths'' pairing. |
| |
| However, 21\textsuperscript{st}-century hardware \emph{does} |
| accommodate these ordering intuitions, and \emph{do} permit |
| this combination to be used safely. |
| |
| \subsubsection{Pair-Wise Memory Barriers: Dubious Combinations} |
| |
| In the following combinations from |
| Table~\ref{tab:advsync:Memory-Barrier Combinations}, |
| the memory barriers have very limited use in portable code, even |
| on 21\textsuperscript{st}-century hardware. |
| However, ``limited use'' is different than ``no use'', so let's see |
| what can be done! |
| Avid readers will want to write toy programs that rely on each of |
| these combinations in order to fully understand how this works. |
| |
| \paragraph{Ears to Ears.} |
| Since loads do not change the state of memory |
| (ignoring MMIO registers for the moment), |
| it is not possible for one of the loads to see the |
| results of the other load. |
| However, if we know that CPU~2's load from~B returned a |
| newer value than CPU~1's load from~B, then we also know |
| that CPU~2's load from~A returned either the same value |
| as CPU~1's load from~A or some later value. |
| |
| \paragraph{Mouth to Mouth, Ear to Ear.} |
| One of the variables is only loaded from, and the other |
| is only stored to. |
| Because (once again, ignoring MMIO registers) it is not |
| possible for one load to see the results of the other, |
| it is not possible to detect the conditional ordering |
| provided by the memory barrier. |
| |
| However, it is possible to determine which store happened |
| last, but this requires an additional load from~B. |
| If this additional load from~B is executed after both |
| CPUs~1 and~2 complete, and if it turns out that CPU~2's |
| store to~B happened last, then we know |
| that CPU~2's load from~A returned either the same value |
| as CPU~1's load from~A or some later value. |
| |
| \paragraph{Only One Store.} |
| Because there is only one store, only one of the variables |
| permits one CPU to see the results of the other CPU's |
| access. |
| Therefore, there is no way to detect the |
| conditional ordering provided by the memory barriers. |
| |
| At least not straightforwardly. |
| But suppose that in combination~1 from |
| Table~\ref{tab:advsync:Memory-Barrier Combinations}, |
| CPU~1's load from~A returns the value that CPU~2 stored |
| to~A. Then we know that CPU~1's load from~B returned |
| either the same value as CPU~2's load from~A or some later value. |
| |
| \QuickQuiz{} |
| How can the other ``Only one store'' entries in |
| Table~\ref{tab:advsync:Memory-Barrier Combinations} |
| be used? |
| \QuickQuizAnswer{ |
| For combination~2, if CPU~1's load from~B sees a value prior |
| to CPU~2's store to~B, then we know that CPU~2's load from~A |
| will return the same value as CPU~1's load from~A, or some later |
| value. |
| |
| For combination~4, if CPU~2's load from~B sees the value from |
| CPU~1's store to~B, then we know that CPU~2's load from~A |
| will return the same value as CPU~1's load from~A, or some later |
| value. |
| |
| For combination~8, if CPU~2's load from~A sees CPU~1's store |
| to~A, then we know that CPU~1's load from~B will return the same |
| value as CPU~2's load from~A, or some later value. |
| } \QuickQuizEnd |
| |
| \subsubsection{Semantics Sufficient to Implement Locking} |
| |
| Suppose we have an exclusive lock (\co{spinlock_t} in the Linux |
| kernel, \co{pthread_mutex_t} in pthreads code) that guards a number |
| of variables (in other words, these variables are not accessed except |
| from the lock's critical sections). |
| The following properties must then hold true: |
| \begin{enumerate} |
| \item A given CPU or thread must see all of its own loads and stores |
| as if they had occurred in program order. |
| \item The lock acquisitions and releases must appear to have executed |
| in a single global order.\footnote{ |
| Of course, this order might be different from one run |
| to the next. |
| On any given run, however, all CPUs and threads must |
| have a consistent view of the order of critical sections |
| for a given exclusive lock.} |
| \item Suppose a given variable has not yet been stored in a |
| critical section that is currently executing. |
| Then any load from a given variable performed in that critical section |
| must see the last store to that variable from the last previous |
| critical section that stored to it. |
| \end{enumerate} |
| |
| The difference between the last two properties is a bit subtle: |
| the second requires that the lock acquisitions and releases occur |
| in a well-defined order, while the third requires that the critical |
| sections not ``bleed out'' far enough to cause difficulties for |
| other critical section. |
| |
| Why are these properties necessary? |
| |
| Suppose the first property did not hold. |
| Then the assertion in the following code might well fail! |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| a = 1; |
| b = 1 + a; |
| assert(b == 2); |
| \end{verbatim} |
| \end{minipage} |
| \label{codesample:advsync:What Can You Count On? 1} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| How could the assertion {\tt b==2} on |
| page~\pageref{codesample:advsync:What Can You Count On? 1} |
| possibly fail? |
| \QuickQuizAnswer{ |
| If the CPU is not required to see all of its loads and |
| stores in order, then the {\tt b=1+a} might well see an |
| old version of the variable~``a''. |
| |
| This is why it is so very important that each CPU or thread |
| see all of its own loads and stores in program order. |
| } \QuickQuizEnd |
| |
| Suppose that the second property did not hold. |
| Then the following code might leak memory! |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| spin_lock(&mylock); |
| if (p == NULL) |
| p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| spin_unlock(&mylock); |
| \end{verbatim} |
| \end{minipage} |
| \label{codesample:advsync:What Can You Count On? 2} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| How could the code on |
| page~\pageref{codesample:advsync:What Can You Count On? 2} |
| possibly leak memory? |
| \QuickQuizAnswer{ |
| Only the first execution of the critical section should |
| see {\tt p==NULL}. |
| However, if there is no global ordering of critical sections for |
| {\tt mylock}, then how can you say that a particular one was |
| first? |
| If several different executions of that critical section thought |
| that they were first, they would all see {\tt p==NULL}, and |
| they would all allocate memory. |
| All but one of those allocations would be leaked. |
| |
| This is why it is so very important that all the critical sections |
| for a given exclusive lock appear to execute in some well-defined |
| order. |
| } \QuickQuizEnd |
| |
| Suppose that the third property did not hold. |
| Then the counter shown in the following code might well count backwards. |
| This third property is crucial, as it cannot be strictly true with |
| pairwise memory barriers. |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| spin_lock(&mylock); |
| ctr = ctr + 1; |
| spin_unlock(&mylock); |
| \end{verbatim} |
| \end{minipage} |
| \label{codesample:advsync:What Can You Count On? 3} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| How could the code on |
| page~\pageref{codesample:advsync:What Can You Count On? 3} |
| possibly count backwards? |
| \QuickQuizAnswer{ |
| Suppose that the counter started out with the value zero, |
| and that three executions of the critical section had therefore |
| brought its value to three. |
| If the fourth execution of the critical section is not constrained |
| to see the most recent store to this variable, it might well see |
| the original value of zero, and therefore set the counter to |
| one, which would be going backwards. |
| |
| This is why it is so very important that loads from a given variable |
| in a given critical |
| section see the last store from the last prior critical section to |
| store to that variable. |
| } \QuickQuizEnd |
| |
| If you are convinced that these rules are necessary, let's look at how |
| they interact with a typical locking implementation. |
| |
| \subsection{Review of Locking Implementations} |
| \label{sec:advsync:Review of Locking Implementations} |
| |
| Naive pseudocode for simple lock and unlock operations are shown below. |
| Note that the \co{atomic_xchg()} primitive implies a memory barrier |
| both before and after the atomic exchange operation, and that the |
| implicit barrier after the atomic exchange operation eliminates |
| the need for an explicit memory barrier in \co{spin_lock()}. |
| Note also that, despite the names, \co{atomic_read()} and |
| \co{atomic_set()} do \emph{not} |
| execute any atomic instructions, instead, it merely executes a |
| simple load and store, respectively. |
| This pseudocode follows a number of Linux implementations for |
| the unlock operation, which is a simple non-atomic store following a |
| memory barrier. |
| These minimal implementations must possess all the locking properties |
| laid out in Section~\ref{sec:advsync:What Can You Trust?}. |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 void spin_lock(spinlock_t *lck) |
| 2 { |
| 3 while (atomic_xchg(&lck->a, 1) != 0) |
| 4 while (atomic_read(&lck->a) != 0) |
| 5 continue; |
| 6 } |
| 7 |
| 8 void spin_unlock(spinlock_t lck) |
| 9 { |
| 10 smp_mb(); |
| 11 atomic_set(&lck->a, 0); |
| 12 } |
| \end{verbatim} |
| \end{minipage} |
| \label{codesample:advsync:Naive Lock and Unlock Pseudocode} |
| \vspace{5pt} |
| |
| The \co{spin_lock()} primitive cannot proceed until |
| the preceding \co{spin_unlock()} primitive completes. |
| If CPU~1 is releasing a lock that CPU~2 is attempting to acquire, |
| the sequence of operations might be as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| (critical section) & \tco{atomic_xchg(&lck->a, 1)} $\rightarrow$1 \\ |
| <memory barrier> & LOAD lck->a $\rightarrow$1 \\ |
| STORE lck->a = 0 & LOAD lck->a $\rightarrow$1 \\ |
| & LOAD lck->a $\rightarrow$0 \\ |
| & (implicit memory barrier \#1) \\ |
| & \tco{atomic_xchg(&lck->a, 1)} $\rightarrow$0 \\ |
| & (implicit memory barrier \#2) \\ |
| & (critical section) \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| In this particular case, pairwise memory barriers suffice to keep |
| the two critical sections in place. |
| CPU~2's \co{atomic_xchg(&lck->a, 1)} has seen CPU~1's \co{lck->a=0}, |
| so therefore everything in CPU~2's following critical section must see |
| everything that CPU~1's preceding critical section did. |
| Conversely, CPU~1's critical section cannot see anything that CPU~2's |
| critical section will do. |
| |
| \subsection{A Few Simple Rules} |
| \label{sec:advsync:A Few Simple Rules} |
| |
| Probably the easiest way to understand memory barriers is to understand |
| a few simple rules: |
| |
| \begin{enumerate} |
| \item Each CPU sees its own accesses in order. |
| \item If a single shared variable is loaded and stored by multiple |
| CPUs, then the series of values seen by a given CPU will be |
| consistent with the series seen by the other CPUs, and there |
| will be at least one sequence consisting of all values stored |
| to that variable with which each CPUs series will be |
| consistent.\footnote{ |
| A given CPU's series may of course be incomplete, |
| for example, if a given CPU never loaded or stored |
| the shared variable, then it can have no opinion about |
| that variable's value.} |
| \item If one CPU does ordered stores to variables~A and~B,\footnote{ |
| For example, by executing the store to~A, a |
| memory barrier, and then the store to~B.} |
| and if a second CPU does ordered loads from~B and~A,\footnote{ |
| For example, by executing the load from~B, a |
| memory barrier, and then the load from~A.} |
| then if the second CPU's load from~B gives the value stored |
| by the first CPU, then the second CPU's load from~A must |
| give the value stored by the first CPU. |
| \item If one CPU does a load from~A ordered before a store to~B, |
| and if a second CPU does a load from~B ordered before a store to~A, |
| and if the second CPU's load from~B gives the value stored by |
| the first CPU, then the first CPU's load from~A must \emph{not} |
| give the value stored by the second CPU. |
| \item If one CPU does a load from~A ordered before a store to~B, |
| and if a second CPU does a store to~B ordered before a |
| store to~A, and if the first CPU's load from~A gives |
| the value stored by the second CPU, then the first CPU's |
| store to~B must happen after the second CPU's store to~B, |
| hence the value stored by the first CPU persists.\footnote{ |
| Or, for the more competitively oriented, the first |
| CPU's store to~B ``wins''.} |
| \end{enumerate} |
| |
| The next section takes a more operational view of these rules. |
| |
| \subsection{Abstract Memory Access Model} |
| |
| Consider the abstract model of the system shown in |
| Figure~\ref{fig:advsync:Abstract Memory Access Model}. |
| |
| \begin{figure}[htb] |
| \centering |
| \includegraphics{advsync/AbstractMemoryAccessModel} |
| \caption{Abstract Memory Access Model} |
| \ContributedBy{Figure}{fig:advsync:Abstract Memory Access Model}{David Howells} |
| \end{figure} |
| |
| Each CPU executes a program that generates memory access operations. In the |
| abstract CPU, memory operation ordering is very relaxed, and a CPU may actually |
| perform the memory operations in any order it likes, provided program causality |
| appears to be maintained. Similarly, the compiler may also arrange the |
| instructions it emits in any order it likes, provided it doesn't affect the |
| apparent operation of the program. |
| |
| So in Figure~\ref{fig:advsync:Abstract Memory Access Model}, |
| the effects of the memory operations performed by a |
| CPU are perceived by the rest of the system as the operations cross the |
| interface between the CPU and rest of the system (the dotted lines). |
| |
| |
| For example, consider the following sequence of events given the |
| initial values of shared variables {\tt \{A~=~1, B~=~2\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| STORE A = 3 & x = LOAD B \\ |
| STORE B = 4 & y = LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The set of accesses as seen by the memory system in the middle can be arranged |
| in 24 different combinations, with loads denoted by ``ld'' and stores |
| denoted by ``st'': |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{llll} |
| st A=3, & st B=4, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$4 \\ |
| st A=3, & st B=4, & y=ld A$\rightarrow$4, & x=ld B$\rightarrow$3 \\ |
| st A=3, & x=ld B$\rightarrow$3, & st B=4, & y=ld A$\rightarrow$4 \\ |
| st A=3, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$2, & st B=4 \\ |
| st A=3, & y=ld A$\rightarrow$2, & st B=4, & x=ld B$\rightarrow$3 \\ |
| st A=3, & y=ld A$\rightarrow$2, & x=ld B$\rightarrow$3, & st B=4 \\ |
| st B=4, & st A=3, & x=ld B$\rightarrow$3, & y=ld A$\rightarrow$4 \\ |
| st B=4, & ... & & \\ |
| ... & & & \\ |
| \end{tabular} |
| \vspace{3pt} |
| \end{minipage} |
| % |
| and can thus result in four different combinations of local values: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{ll} |
| x == 2, & y == 1 \\ |
| x == 2, & y == 3 \\ |
| x == 4, & y == 1 \\ |
| x == 4, & y == 3 \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Furthermore, the stores committed by a CPU to the memory system may not be |
| perceived by the loads made by another CPU in the same order as the stores were |
| committed. |
| |
| As a further example, consider this sequence of events given the |
| initial values of shared variables |
| {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| STORE B = 4 & Q = LOAD P \\ |
| STORE P = \&B & D = LOAD *Q \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| There is an obvious data dependency here, |
| as the value loaded into~\co{D} depends on |
| the address retrieved from~\co{P} by CPU~2. |
| At the end of the sequence, any of the |
| following results are possible: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{c@{ and }c} |
| (Q == \&A) & (D == 1) \\ |
| (Q == \&B) & (D == 2) \\ |
| (Q == \&B) & (D == 4) \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Note that CPU~2 will never try and load~C into~D because the CPU will load~P |
| into~Q before issuing the load of~*Q.\footnote{Although it might sound |
| counterintuitive, one of the results, |
| {\tt\scriptsize (Q~==~\&B) and (D~==~2)}, |
| does not necessarily mean that CPU~1's |
| {\tt\scriptsize STORE B~=~4} was issued |
| {\em after} {\tt\scriptsize STORE P~=~\&B}. |
| To reach a consensus in ordering, we need a pair of memory barriers |
| as is described in |
| Section~\ref{sec:advsync:Data Dependency Barriers}.} |
| |
| \subsection{Device Operations} |
| \label{sec:advsync:Device Operations} |
| |
| Some devices present their control interfaces as collections of memory |
| locations, but the order in which the control registers are accessed is very |
| important. For instance, imagine an Ethernet card with a set of internal |
| registers that are accessed through an address port register~(A) and a data |
| port register~(D). To read internal register~5, the following code might then |
| be used: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| *A = 5; |
| x = *D; |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| % |
| but this might show up as either of the following two sequences: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| STORE *A = 5, x = LOAD *D |
| x = LOAD *D, STORE *A = 5 |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| % |
| the second of which will almost certainly result in a malfunction, since it set |
| the address \emph{after} attempting to read the register. |
| |
| \subsection{Guarantees} |
| \label{sec:advsync:Guarantees} |
| |
| There are some minimal guarantees that may be expected of a CPU: |
| |
| \begin{enumerate} |
| \item On any given CPU, dependent memory accesses will be issued in order, |
| with respect to itself. This means that for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Q = P; D = *Q; |
| \end{verbatim} |
| \end{minipage} |
| |
| the CPU will issue the following memory operations: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Q = LOAD P, D = LOAD *Q |
| \end{verbatim} |
| \end{minipage} |
| |
| and always in that order. |
| |
| \item Overlapping loads and stores within a particular CPU will appear to be |
| ordered within that CPU. This means that for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| a = *X; *X = b; |
| \end{verbatim} |
| \end{minipage} |
| |
| the CPU will only issue the following sequence of memory operations: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| a = LOAD *X, STORE *X = b |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| |
| And for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| *X = c; d = *X; |
| \end{verbatim} |
| \end{minipage} |
| |
| the CPU will only issue: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| STORE *X = c, d = LOAD *X |
| \end{verbatim} |
| \end{minipage} |
| |
| (Loads and stores overlap if they are targeted at overlapping pieces of |
| memory). |
| \item A series of stores to a single variable will appear to all |
| CPUs to have occurred in a single order, though this order |
| might not be predictable from the code, and in fact the |
| order might vary from one run to another. |
| \end{enumerate} |
| |
| And there are a number of things that \emph{must} or \emph{must not} be assumed: |
| |
| \begin{enumerate} |
| \item It \emph{must not} be assumed that independent loads and stores will |
| be issued in the order given. This means that for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| X = *A; Y = *B; *D = Z; |
| \end{verbatim} |
| \end{minipage} |
| |
| we may get any of the following sequences: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| X = LOAD *A, Y = LOAD *B, STORE *D = Z |
| X = LOAD *A, STORE *D = Z, Y = LOAD *B |
| Y = LOAD *B, X = LOAD *A, STORE *D = Z |
| Y = LOAD *B, STORE *D = Z, X = LOAD *A |
| STORE *D = Z, X = LOAD *A, Y = LOAD *B |
| STORE *D = Z, Y = LOAD *B, X = LOAD *A |
| \end{verbatim} |
| \end{minipage} |
| |
| \item It \emph{must} be assumed that overlapping memory accesses may |
| be merged or discarded. This means that for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| X = *A; Y = *(A + 4); |
| \end{verbatim} |
| \end{minipage} |
| |
| we may get any one of the following sequences: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| X = LOAD *A; Y = LOAD *(A + 4); |
| Y = LOAD *(A + 4); X = LOAD *A; |
| {X, Y} = LOAD {*A, *(A + 4) }; |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| |
| And for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| *A = X; *(A + 4) = Y; |
| \end{verbatim} |
| \end{minipage} |
| |
| we may get any of: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| STORE *A = X; STORE *(A + 4) = Y; |
| STORE *(A + 4) = Y; STORE *A = X; |
| STORE {*A, *(A + 4) } = {X, Y}; |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| |
| Finally, for: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| *A = X; *A = Y; |
| \end{verbatim} |
| \end{minipage} |
| |
| we may get either of: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| STORE *A = X; STORE *A = Y; |
| STORE *A = Y; |
| \end{verbatim} |
| \end{minipage} |
| |
| \end{enumerate} |
| |
| \subsection{What Are Memory Barriers?} |
| \label{sec:advsync:What Are Memory Barriers?} |
| |
| As can be seen above, independent memory operations are effectively performed |
| in random order, but this can be a problem for CPU-CPU interaction and for I/O. |
| What is required is some way of intervening to instruct the compiler and the |
| CPU to restrict the order. |
| |
| Memory barriers are such interventions. They impose a perceived partial |
| ordering over the memory operations on either side of the barrier. |
| |
| Such enforcement is important because the CPUs and other devices in a system |
| can use a variety of tricks to improve performance---including reordering, |
| deferral and combination of memory operations; speculative loads; speculative |
| branch prediction and various types of caching. Memory barriers are used to |
| override or suppress these tricks, allowing the code to sanely control the |
| interaction of multiple CPUs and/or devices. |
| |
| \subsubsection{Explicit Memory Barriers} |
| \label{sec:advsync:Explicit Memory Barriers} |
| |
| Memory barriers come in four basic varieties: |
| |
| \begin{enumerate} |
| \item Write (or store) memory barriers, |
| \item Data dependency barriers, |
| \item Read (or load) memory barriers, and |
| \item General memory barriers. |
| \end{enumerate} |
| |
| Each variety is described below. |
| |
| \paragraph{Write Memory Barriers} |
| |
| A write memory barrier gives a guarantee that all the STORE operations |
| specified before the barrier will appear to happen before all the STORE |
| operations specified after the barrier with respect to the other |
| components of the system. |
| |
| A write barrier is a partial ordering on stores only; it is not required |
| to have any effect on loads. |
| |
| A CPU can be viewed as committing a sequence of store operations to the |
| memory system as time progresses. All stores before a write barrier will |
| occur in the sequence \emph{before} all the stores after the write barrier. |
| |
| Note that write barriers should normally be paired with read |
| or data dependency barriers; see |
| Section~\ref{sec:advsync:SMP Barrier Pairing}. |
| |
| \paragraph{Data Dependency Barriers} |
| |
| A data dependency barrier is a weaker form of read barrier. In the case |
| where two loads are performed such that the second depends on the result |
| of the first (e.g., the first load retrieves the address to which the second |
| load will be directed), a data dependency barrier would be required to |
| make sure that the target of the second load is updated before the address |
| obtained by the first load is accessed. |
| |
| A data dependency barrier is a partial ordering on interdependent loads |
| only; it is not required to have any effect on stores, independent loads |
| or overlapping loads. |
| |
| As mentioned for write memory barriers, |
| the other CPUs in the system can be viewed as |
| committing sequences of stores to the memory system that the CPU being |
| considered can then perceive. A data dependency barrier issued by the CPU |
| under consideration guarantees that for any load preceding it, if that |
| load touches one of a sequence of stores from another CPU, then by the |
| time the barrier completes, the effects of all the stores prior to that |
| touched by the load will be perceptible to any loads issued after the data |
| dependency barrier. |
| |
| See the Section~\ref{sec:advsync:Examples of Memory Barrier Pairings} for |
| diagrams showing the ordering constraints. |
| |
| Note that the first load really has to have a |
| \emph{data} dependency and |
| not a control dependency. If the address for the second load is dependent |
| on the first load, but the dependency is through a conditional rather than |
| actually loading the address itself, then it's a \emph{control} dependency and |
| a full read barrier or better is required. See |
| Section~\ref{sec:advsync:Control Dependencies} for more information. |
| |
| Note that data dependency barriers should normally be paired with |
| write barriers; see Section~\ref{sec:advsync:SMP Barrier Pairing}. |
| |
| \paragraph{Read Memory Barriers} |
| |
| A read barrier is a data dependency barrier plus a guarantee that all the |
| LOAD operations specified before the barrier will appear to happen before |
| all the LOAD operations specified after the barrier with respect to the |
| other components of the system. |
| |
| A read barrier is a partial ordering on loads only; it is not required to |
| have any effect on stores. |
| |
| Read memory barriers imply data dependency barriers, and so can substitute |
| for them. |
| |
| Note that read barriers should normally be paired with write barriers; |
| see Section~\ref{sec:advsync:SMP Barrier Pairing}. |
| |
| \paragraph{General Memory Barriers} |
| |
| A general memory barrier gives a guarantee that all the LOAD and STORE |
| operations specified before the barrier will appear to happen before all |
| the LOAD and STORE operations specified after the barrier with respect to |
| the other components of the system. |
| |
| A general memory barrier is a partial ordering over both loads and stores. |
| |
| General memory barriers imply both read and write memory barriers, and so |
| can substitute for either. |
| |
| \subsubsection{Implicit Memory Barriers} |
| |
| There are a couple of types of implicit memory barriers, so called |
| because they are embedded into locking primitives: |
| |
| \begin{enumerate} |
| \item ACQUIRE operations and |
| \item RELEASE operations.\footnote{ |
| Note that there were times when ACQUIRE/RELEASE operations |
| were called as ``LOCK/UNLOCK'' operations in the Linux |
| kernel community. ``ACQUIRE/RELEASE'' is preferred in |
| discussing lockless techniques.} |
| \end{enumerate} |
| |
| \paragraph{ACQUIRE Operations} |
| |
| An acuire operation acts as a one-way permeable barrier. |
| It guarantees that all memory |
| operations after the ACQUIRE operation will appear to happen after the ACQUIRE |
| operation with respect to the other components of the system. |
| |
| Memory operations that occur before an ACQUIRE operation may appear to happen |
| after it completes. |
| |
| An ACQUIRE operation should almost always be paired with a RELEASE operation. |
| |
| \paragraph{RELEASE Operations} |
| |
| Release operations also act as a one-way permeable barrier. |
| It guarantees that all |
| memory operations before the RELEASE operation will appear to happen before |
| the RELEASE operation with respect to the other components of the system. |
| |
| Memory operations that occur after a RELEASE operation may appear to |
| happen before it completes. |
| |
| ACQUIRE and RELEASE operations are guaranteed to appear with respect to each |
| other strictly in the order specified. |
| |
| The use of ACQUIRE and RELEASE operations generally precludes the need for |
| other sorts of memory barrier (but note the exceptions mentioned in the |
| Section~\ref{sec:advsync:Device Operations}). |
| |
| \QuickQuiz{} |
| What effect does the following sequence have on the |
| order of stores to variables~``a'' and~``b''? |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| a = 1; |
| b = 1; |
| <write barrier> |
| \end{verbatim} |
| \end{minipage} |
| \QuickQuizAnswer{ |
| Absolutely none. This barrier {\em would} ensure that the |
| assignments to~``a'' and~``b'' happened before any subsequent |
| assignments, but it does nothing to enforce any order of |
| assignments to~``a'' and~``b'' themselves. |
| } \QuickQuizEnd |
| |
| \subsubsection{What May Not Be Assumed About Memory Barriers?} |
| \label{sec:advsync:What May Not Be Assumed About Memory Barriers?} |
| |
| There are certain things that memory barriers cannot guarantee outside |
| of the confines of a given architecture: |
| |
| \begin{enumerate} |
| \item There is no guarantee that any of the memory accesses specified |
| before a memory barrier will be \emph{complete} by the completion |
| of a memory barrier instruction; the barrier can be considered |
| to draw a line in that CPU's access queue that accesses of the |
| appropriate type may not cross. |
| \item There is no guarantee that issuing a memory barrier on one CPU |
| will have any direct effect on another CPU or any other hardware |
| in the system. The indirect effect will be the order in which |
| the second CPU sees the effects of the first CPU's accesses occur, |
| but see the next point. |
| \item There is no guarantee that a CPU will see the correct order |
| of effects from a second CPU's accesses, even \emph{if} the second CPU |
| uses a memory barrier, unless the first CPU \emph{also} uses a matching |
| memory barrier (see |
| Section~\ref{sec:advsync:SMP Barrier Pairing}). |
| \item There is no guarantee that some intervening piece of off-the-CPU |
| hardware\footnote{ |
| This is of concern primarily in operating-system kernels. |
| For more information on hardware operations and memory |
| ordering, see the files \path{pci.txt}, \path{DMA-API-HOWTO.txt}, |
| and \path{DMA-API.txt} in the \path{Documentation} directory in |
| the Linux source tree~\cite{Torvalds2.6kernel}.} |
| will not reorder the memory accesses. CPU cache |
| coherency mechanisms should propagate the indirect effects of |
| a memory barrier between CPUs, but might not do so in order. |
| \end{enumerate} |
| |
| \subsubsection{Data Dependency Barriers} |
| \label{sec:advsync:Data Dependency Barriers} |
| |
| The usage requirements of data dependency barriers are a little subtle, and |
| it's not always obvious that they're needed. To illustrate, consider the |
| following sequence of events, with initial values |
| {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| B = 4; & \\ |
| <write barrier> & \\ |
| P = \&B; & \\ |
| & Q = P; \\ |
| & D = *Q; \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| There's a clear data dependency here, and it would seem intuitively |
| obvious that by the end of the sequence, \co{Q}~must be either~\co{&A} |
| or~\co{&B}, and that: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{c@{ implies }c} |
| (Q == \&A) & (D == 1) \\ |
| (Q == \&B) & (D == 4) \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Counter-intuitive though it might be, it is quite possible that |
| CPU~2's perception of~\co{P} might be updated \emph{before} its perception |
| of~\co{B}, thus leading to the following situation: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{c@{ and }c} |
| (Q == \&B) & (D == 2) ???? \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Whilst this may seem like a failure of coherency or causality maintenance, it |
| isn't, and this behaviour can be observed on certain real CPUs (such as the DEC |
| Alpha). |
| |
| To deal with this, a data dependency barrier must be inserted between the |
| address load and the data load (again with initial values of |
| {\tt \{A~=~1, B~=~2, C~=~3, P~=~\&A, Q~=~\&C\}}): |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| B = 4; & \\ |
| <write barrier> & \\ |
| P = \&B; & \\ |
| & Q = P; \\ |
| & <data dependency barrier> \\ |
| & D = *Q; \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This enforces the occurrence of one of the two implications, and prevents the |
| third possibility from arising. |
| |
| Note that this extremely counterintuitive situation arises most easily on |
| machines with split caches, so that, for example, one cache bank processes |
| even-numbered cache lines and the other bank processes odd-numbered cache |
| lines. |
| The pointer~\co{P} might be stored in an odd-numbered cache line, and the |
| variable~\co{B} might be stored in an even-numbered cache line. Then, if the |
| even-numbered bank of the reading CPU's cache is extremely busy while the |
| odd-numbered bank is idle, one can see the new value of the |
| pointer~\co{P} (which is~\co{&B}), |
| but the old value of the variable~\co{B} (which is~2). |
| |
| Another example of where data dependency barriers might by required is where a |
| number is read from memory and then used to calculate the index for an array |
| access with initial values |
| {\tt \{M[0]~=~1, M[1]~=~2, M[3]~=~3, P~=~0, Q~=~3\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| M[1] = 4; & \\ |
| <write barrier> & \\ |
| P = 1; & \\ |
| & Q = P; \\ |
| & <data dependency barrier> \\ |
| & D = M[Q]; \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The data dependency barrier is very important to the Linux kernel's |
| RCU system, for example, |
| see \co{rcu_dereference()} in \path{include/linux/rcupdate.h}. |
| This permits the current |
| target of an RCU'd pointer to be replaced with a new modified target, without |
| the replacement target appearing to be incompletely initialized. |
| |
| See also |
| Section~\ref{sec:advsync:Cache Coherency} |
| for a larger example. |
| |
| \subsubsection{Control Dependencies} |
| \label{sec:advsync:Control Dependencies} |
| |
| Control dependencies are especially tricky because current compilers |
| do not understand them. |
| The rules and examples in this section are intended to help you |
| prevent your compiler's ignorance from breaking your code. |
| |
| A load-load control dependency requires a full read memory barrier, |
| not simply a data dependency barrier. |
| Consider the following bit of code: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 <data dependency barrier> |
| 4 q = READ_ONCE(y); |
| 5 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This will not have the desired effect because there is no actual data |
| dependency, but rather a control dependency that the CPU may short-circuit |
| by attempting to predict the outcome in advance, so that other CPUs see |
| the load from~\co{y} as having happened before the load from~\co{x}. |
| In such a case what's actually required is: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 <read barrier> |
| 4 q = READ_ONCE(y); |
| 5 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| However, stores are not speculated. |
| This means that ordering \emph{is} provided for load-store control |
| dependencies, as in the following example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) |
| 3 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Control dependencies pair normally with other types of barriers. |
| That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()} |
| are optional! |
| Without the \co{READ_ONCE()}, the compiler might combine the load |
| from~\co{x} with other loads from~\co{x}. |
| Without the \co{WRITE_ONCE()}, the compiler might combine the store |
| to~\co{y} with other stores to~\co{y}. |
| Either can result in highly counterintuitive effects on ordering. |
| |
| Worse yet, if the compiler is able to prove (say) that the value of |
| variable~\co{x} is always non-zero, it would be well within its rights |
| to optimize the original example by eliminating the ``\co{if}'' statement |
| as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is tempting to try to enforce ordering on identical stores on both |
| branches of the ``\co{if}'' statement as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 barrier(); |
| 4 WRITE_ONCE(y, 1); |
| 5 do_something(); |
| 6 } else { |
| 7 barrier(); |
| 8 WRITE_ONCE(y, 1); |
| 9 do_something_else(); |
| 10 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Unfortunately, current compilers will transform this as follows at high |
| optimization levels: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 barrier(); |
| 3 WRITE_ONCE(y, 1); /* BUG: No ordering!!! */ |
| 4 if (q) { |
| 5 do_something(); |
| 6 } else { |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Now there is no conditional between the load from~\co{x} and the store |
| to~\co{y}, which means that the CPU is within its rights to reorder them: |
| The conditional is absolutely required, and must be present in the |
| assembly code even after all compiler optimizations have been applied. |
| Therefore, if you need ordering in this example, you need explicit |
| memory barriers, for example, a release store: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 smp_store_release(&y, 1); |
| 4 do_something(); |
| 5 } else { |
| 6 smp_store_release(&y, 1); |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The initial \co{READ_ONCE()} is still required to prevent the compiler from |
| proving the value of~\co{x}. |
| |
| In addition, you need to be careful what you do with the local variable~% |
| \co{q}, |
| otherwise the compiler might be able to guess the value and again remove |
| the needed conditional. |
| For example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q % MAX) { |
| 3 WRITE_ONCE(y, 1); |
| 4 do_something(); |
| 5 } else { |
| 6 WRITE_ONCE(y, 2); |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is |
| equal to zero, in which case the compiler is within its rights to |
| transform the above code into the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 2); |
| 3 do_something_else(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Given this transformation, the CPU is not required to respect the ordering |
| between the load from variable~\co{x} and the store to variable~\co{y}. |
| It is tempting to add a \co{barrier()} to constrain the compiler, |
| but this does not help. |
| The conditional is gone, and the barrier won't bring it back. |
| Therefore, if you are relying on this ordering, you should make sure |
| that \co{MAX} is greater than one, perhaps as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 BUILD_BUG_ON(MAX <= 1); |
| 3 if (q % MAX) { |
| 4 WRITE_ONCE(y, 1); |
| 5 do_something(); |
| 6 } else { |
| 7 WRITE_ONCE(y, 2); |
| 8 do_something_else(); |
| 9 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Please note once again that the stores to~\co{y} differ. |
| If they were identical, as noted earlier, the compiler could pull this |
| store outside of the ``\co{if}'' statement. |
| |
| You must also avoid excessive reliance on boolean short-circuit evaluation. |
| Consider this example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q || 1 > 0) |
| 3 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because the first condition cannot fault and the second condition is |
| always true, the compiler can transform this example as following, |
| defeating control dependency: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This example underscores the need to ensure that the compiler cannot |
| out-guess your code. |
| More generally, although \co{READ_ONCE()} does force |
| the compiler to actually emit code for a given load, it does not force |
| the compiler to use the results. |
| |
| In addition, control dependencies apply only to the then-clause and |
| else-clause of the if-statement in question. |
| In particular, it does |
| not necessarily apply to code following the if-statement: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 WRITE_ONCE(y, 1); |
| 4 } else { |
| 5 WRITE_ONCE(y, 2); |
| 6 } |
| 7 WRITE_ONCE(z, 1); /* BUG: No ordering. */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is tempting to argue that there in fact is ordering because the |
| compiler cannot reorder volatile accesses and also cannot reorder |
| the writes to~\co{y} with the condition. |
| Unfortunately for this line |
| of reasoning, the compiler might compile the two writes to~\co{y} as |
| conditional-move instructions, as in this fanciful pseudo-assembly |
| language: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 ld r1,x |
| 2 cmp r1,$0 |
| 3 cmov,ne r4,$1 |
| 4 cmov,eq r4,$2 |
| 5 st r4,y |
| 6 st $1,z |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| A weakly ordered CPU would have no dependency of any sort between the load |
| from~\co{x} and the store to~\co{z}. |
| The control dependencies would extend |
| only to the pair of cmov instructions and the store depending on them. |
| In short, control dependencies apply only to the stores in the ``\co{then}'' |
| and ``\co{else}'' of the ``\co{if}'' in question (including functions |
| invoked by those two clauses), not to code following that ``\co{if}''. |
| |
| Finally, control dependencies do \emph{not} provide transitivity.\footnote{ |
| Refer to Section~\ref{sec:advsync:Transitivity} for |
| the meaning of transitivity.} |
| This is demonstrated by two related examples, with the initial values |
| of~\co{x} and~\co{y} both being zero: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 0} & \nf{CPU 1} \\ |
| \hline |
| \tco{r1 = READ_ONCE(x);} & |
| \tco{r2 = READ_ONCE(y);} \\ |
| if (r1 > 0) & |
| if (r2 > 0) \\ |
| ~~~\tco{WRITE_ONCE(y, 1);} & |
| ~~~\tco{WRITE_ONCE(x, 1);} \\ |
| \multicolumn{2}{l}{~} \\ |
| \multicolumn{2}{l}{\tco{assert(!(r1 == 1 && r2 == 1));}} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The above two-CPU example will never trigger the \co{assert()}. |
| However, if control dependencies guaranteed transitivity (which they do |
| not), then adding the following CPU would guarantee a related assertion: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l} |
| \nf{CPU 2} \\ |
| \hline |
| \tco{WRITE_ONCE(y, 1);} \\ |
| \multicolumn{1}{l}{~} \\ |
| \multicolumn{1}{l}{\tco{assert(!(r1 == 1 && r2 == 1 && x == 1));}} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| But because control dependencies do \emph{not} provide transitivity, the above |
| assertion can fail after the combined three-CPU example completes. |
| If you need the three-CPU example to provide ordering, you will need |
| \co{smp_mb()} between the loads and stores in the CPU~0 and CPU~1 code |
| fragments, that is, just before or just after the ``\co{if}'' statements. |
| Furthermore, the original two-CPU example is very fragile and should be avoided. |
| |
| The two-CPU example is known as LB (load buffering) and the three-CPU |
| example as WWC~\cite{Maranget2012TutorialARMPower}. |
| |
| The following list of rules summarizes the lessons of this section: |
| |
| \begin{enumerate} |
| \item Compilers do not understand control dependencies, so it is |
| your job to make sure that the compiler cannot break your code. |
| |
| \item Control dependencies can order prior loads against later stores. |
| However, they do \emph{not} guarantee any other sort of ordering: |
| Not prior loads against later loads, nor prior stores against |
| later anything. |
| If you need these other forms of ordering, use \co{smp_rmb()}, |
| \co{smp_wmb()}, or, in the case of prior stores and later loads, |
| \co{smp_mb()}. |
| |
| \item If both legs of the ``\co{if}'' statement begin with identical stores |
| to the same variable, then those stores must be ordered, |
| either by preceding both of them with \co{smp_mb()} or by using |
| \co{smp_store_release()} to carry out the stores. |
| Please note that it is \emph{not} sufficient to use \co{barrier()} |
| at beginning of each leg of the ``\co{if}'' statement because, as shown |
| by the example above, optimizing compilers can destroy the control |
| dependency while respecting the letter of the \co{barrier()} law. |
| |
| \item Control dependencies require at least one run-time conditional |
| between the prior load and the subsequent store, and this |
| conditional must involve the prior load. |
| If the compiler is able to optimize the conditional away, it |
| will have also optimized away the ordering. |
| Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help |
| to preserve the needed conditional. |
| |
| \item Control dependencies require that the compiler avoid reordering |
| the dependency into nonexistence. |
| Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or |
| \co{atomic64_read()} can help to preserve your control |
| dependency. |
| |
| \item Control dependencies apply only to the ``\co{then}'' and |
| ``\co{else}'' of the ``\co{if}'' containing the control |
| dependency, including any functions that these two clauses call. |
| Control dependencies do \emph{not} apply to code following the |
| ``\co{if}'' containing the control dependency. |
| |
| \item Control dependencies pair normally with other types of barriers. |
| |
| \item Control dependencies do \emph{not} provide transitivity. |
| If you need transitivity, use \co{smp_mb()}. |
| \end{enumerate} |
| |
| \subsubsection{SMP Barrier Pairing} |
| \label{sec:advsync:SMP Barrier Pairing} |
| |
| When dealing with CPU-CPU interactions, certain types of memory barrier should |
| always be paired. A lack of appropriate pairing is almost certainly an error. |
| |
| A write barrier should always be paired with a data dependency barrier or read |
| barrier, though a general barrier would also be viable. Similarly a read |
| barrier or a data dependency barrier should always be paired with at least a |
| write barrier, though, again, a general barrier is viable: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| <write barrier> & \\ |
| B = 2; & \\ |
| & X = B; \\ |
| & <read barrier> \\ |
| & Y = A; \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Or: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| <write barrier> & \\ |
| B = \&A; & \\ |
| & X = B; \\ |
| & <data dependency barrier> \\ |
| & Y = *X; \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| One way or another, the read barrier must always be present, even though |
| it might be of a weaker type.\footnote{ |
| By ``weaker'', we mean ``makes fewer ordering guarantees''. |
| A weaker barrier is usually also lower-overhead than is a |
| stronger barrier.} |
| |
| Note that the stores before the write barrier would normally be expected to |
| match the loads after the read barrier or data dependency barrier, and vice |
| versa: |
| |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{advsync/MemoryBarrierPairing}} |
| \end{center} |
| |
| \subsubsection{Examples of Memory Barrier Pairings} |
| \label{sec:advsync:Examples of Memory Barrier Pairings} |
| |
| Firstly, write barriers act as partial orderings on store operations. |
| Consider the following sequence of events: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l} |
| STORE A = 1 \\ |
| STORE B = 2 \\ |
| STORE C = 3 \\ |
| <write barrier> \\ |
| STORE D = 4 \\ |
| STORE E = 5 \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This sequence of events is committed to the memory coherence system in an order |
| that the rest of the system might perceive as the unordered set of |
| {\tt \{A=1,B=2,C=3\}} |
| all occurring before the unordered set of |
| {\tt \{D=4,E=5\}}, as shown in |
| Figure~\ref{fig:advsync:Write Barrier Ordering Semantics}. |
| |
| \begin{figure*}[htb] |
| \centering |
| \includegraphics{advsync/WriteBarrierOrdering} |
| \caption{Write Barrier Ordering Semantics} |
| \ContributedBy{Figure}{fig:advsync:Write Barrier Ordering Semantics}{David Howells} |
| \end{figure*} |
| |
| Secondly, data dependency barriers act as partial orderings on data-dependent |
| loads. Consider the following sequence of events with initial values |
| {\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| B = 2; & \\ |
| <write barrier> & \\ |
| C = \&B; & LOAD X\\ |
| D = 4; & LOAD C \nf{(gets \tco{&B})} \\ |
| & LOAD *C \nf{(reads \tco{B})} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Without intervention, CPU~2 may perceive the events on CPU~1 in some |
| effectively random order, despite the write barrier issued by CPU~1, as |
| shown in Figure~\ref{fig:advsync:Data Dependency Barrier Omitted}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/DataDependencyNeeded} |
| \caption{Data Dependency Barrier Omitted} |
| \ContributedBy{Figure}{fig:advsync:Data Dependency Barrier Omitted}{David Howells} |
| \end{figure*} |
| |
| In the above example, CPU~2 perceives that \co{B} is~7, |
| despite the load of~\co{*C} |
| (which would be~\co{B}) coming after the \co{LOAD} of~\co{C}. |
| |
| If, however, a data dependency barrier were to be placed between the load |
| of~\co{C} and the load of~\co{*C} (i.e.:~\co{B}) on CPU~2, again with initial |
| values of {\tt \{B~=~7, X~=~9, Y~=~8, C~=~\&Y\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| B = 2; & \\ |
| <write barrier> & \\ |
| C = \&B; & LOAD X\\ |
| D = 4; & LOAD C \nf{(gets \tco{&B})} \\ |
| & <data dependency barrier> \\ |
| & LOAD *C \nf{(reads \tco{B})} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| then ordering will be as intuitively expected, as shown in |
| Figure~\ref{fig:advsync:Data Dependency Barrier Supplied}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/DataDependencySupplied} |
| \caption{Data Dependency Barrier Supplied} |
| \ContributedBy{Figure}{fig:advsync:Data Dependency Barrier Supplied}{David Howells} |
| \end{figure*} |
| |
| And thirdly, a read barrier acts as a partial order on loads. Consider the |
| following sequence of events, with initial values |
| {\tt \{A~=~0, B~=~9\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| <write barrier> & \\ |
| B = 2; & \\ |
| & LOAD B \\ |
| & LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Without intervention, CPU~2 may then choose to perceive the events on CPU~1 in |
| some effectively random order, despite the write barrier issued by CPU~1, as |
| shown in Figure~\ref{fig:advsync:Read Barrier Needed}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/ReadBarrierNeeded} |
| \caption{Read Barrier Needed} |
| \ContributedBy{Figure}{fig:advsync:Read Barrier Needed}{David Howells} |
| \end{figure*} |
| |
| If, however, a read barrier were to be placed between the load of~\co{B} |
| and the load of~\co{A} on CPU~2, again with initial values of |
| {\tt \{A~=~0, B~=~9\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| <write barrier> & \\ |
| B = 2; & \\ |
| & LOAD B \\ |
| & <read barrier> \\ |
| & LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| then the partial ordering imposed by CPU~1's write barrier will be |
| perceived correctly by CPU~2, as shown in |
| Figure~\ref{fig:advsync:Read Barrier Supplied}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/ReadBarrierSupplied} |
| \caption{Read Barrier Supplied} |
| \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied}{David Howells} |
| \end{figure*} |
| |
| To illustrate this more completely, consider what could happen if the code |
| contained a load of~\co{A} either side of the read barrier, once again |
| with the same initial values of |
| {\tt \{A~=~0, B~=~9\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = 1; & \\ |
| <write barrier> & \\ |
| B = 2; & \\ |
| & LOAD B \\ |
| & LOAD A \nf{(1\textsuperscript{st})} \\ |
| & <read barrier> \\ |
| & LOAD A \nf{(2\textsuperscript{nd})} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Even though the two loads of~\co{A} |
| both occur after the load of~\co{B}, they may both |
| come up with different values, as shown in |
| Figure~\ref{fig:advsync:Read Barrier Supplied, Double Load}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/ReadBarrierSupplied1} |
| \caption{Read Barrier Supplied, Double Load} |
| \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Double Load}{David Howells} |
| \end{figure*} |
| |
| Of course, it may well be that CPU~1's update to~\co{A} becomes perceptible |
| to CPU~2 before the read barrier completes, as shown in |
| Figure~\ref{fig:advsync:Read Barrier Supplied, Take Two}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/ReadBarrierSupplied2} |
| \caption{Read Barrier Supplied, Take Two} |
| \ContributedBy{Figure}{fig:advsync:Read Barrier Supplied, Take Two}{David Howells} |
| \end{figure*} |
| |
| The guarantee is that the second load will always come up with \nbco{A == 1} |
| if the |
| load of~\co{B} came up with \nbco{B == 2}. |
| No such guarantee exists for the first load |
| of~\co{A}; that may come up with either \nbco{A == 0} or \nbco{A == 1}. |
| |
| \subsubsection{Read Memory Barriers vs. Load Speculation} |
| \label{sec:advsync:Read Memory Barriers vs. Load Speculation} |
| |
| Many CPUs speculate with loads: that is, they see that they will need to |
| load an item from memory, and they find a time where they're not using |
| the bus for any other loads, and then do the load in advance---even though |
| they haven't actually got to that point in the instruction execution |
| flow yet. |
| Later on, this potentially permits the actual load instruction to |
| complete immediately because the CPU already has the value on hand. |
| |
| It may turn out that the CPU didn't actually need the value (perhaps because a |
| branch circumvented the load) in which case it can discard the value or just |
| cache it for later use. |
| For example, consider the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| & LOAD B \\ |
| & DIVIDE \\ |
| & DIVIDE \\ |
| & LOAD A \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| On some CPUs, divide instructions can take a long time to complete, |
| which means that CPU~2's bus might go idle during that time. |
| CPU~2 might therefore speculatively load~\co{A} before the divides |
| complete. |
| In the (hopefully) unlikely event of an exception from one of the dividees, |
| this speculative load will have been wasted, but in the (again, hopefully) |
| common case, overlapping the load with the divides will permit the load |
| to complete more quickly, as illustrated by |
| Figure~\ref{fig:advsync:Speculative Load}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/SpeculativeLoad} |
| \caption{Speculative Load} |
| \ContributedBy{Figure}{fig:advsync:Speculative Load}{David Howells} |
| \end{figure*} |
| |
| Placing a read barrier or a data dependency barrier just before the second |
| load: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| & LOAD B \\ |
| & DIVIDE \\ |
| & DIVIDE \\ |
| & <read barrier> \\ |
| & LOAD A \\ |
| \end{tabular} |
| \vspace{5pt} |
| \end{minipage} |
| % |
| will force any value speculatively obtained to be reconsidered to an extent |
| dependent on the type of barrier used. If there was no change made to the |
| speculated memory location, then the speculated value will just be used, |
| as shown in |
| Figure~\ref{fig:advsync:Speculative Loads and Barrier}. |
| On the other hand, if there was an update or invalidation to~\co{A} |
| from some other CPU, then the speculation will be cancelled and the |
| value of~\co{A} will be reloaded, |
| as shown in Figure~\ref{fig:advsync:Speculative Loads Cancelled by Barrier}. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/SpeculativeLoadBarrier} |
| \caption{Speculative Load and Barrier} |
| \ContributedBy{Figure}{fig:advsync:Speculative Loads and Barrier}{David Howells} |
| \end{figure*} |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{advsync/SpeculativeLoadBarrierCancel} |
| \caption{Speculative Load Cancelled by Barrier} |
| \ContributedBy{Figure}{fig:advsync:Speculative Loads Cancelled by Barrier}{David Howells} |
| \end{figure*} |
| |
| \subsubsection{Transitivity} |
| \label{sec:advsync:Transitivity} |
| |
| ``Transitivity'' is a deeply intuitive notion about ordering that is not |
| always provided by real computer systems. The following example |
| demonstrates transitivity with initial values of |
| {\tt \{X~=~0, Y~=~0\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l|l} |
| \nf{CPU 1} & \nf{CPU 2} & \nf{CPU 3} \\ |
| \hline |
| STORE X = 1 & LOAD X & STORE Y = 1 \\ |
| & <general barrier> & <general barrier> \\ |
| & LOAD Y & LOAD X \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Suppose that CPU~2's load from~\co{X} returns~1 and its load from~\co{Y} returns~0. |
| This indicates that CPU~2's load from~\co{X} in some sense follows CPU~1's |
| store to~\co{X} and that CPU~2's load from~\co{Y} in some sense preceded CPU~3's |
| store to~\co{Y}. The question is then ``Can CPU~3's load from~\co{X} return~0?'' |
| |
| Because CPU~2's load from~\co{X} in some sense came after CPU~1's store, it |
| is natural to expect that CPU~3's load from~\co{X} must therefore return~1. |
| This expectation is an example of transitivity: if a load executing on |
| CPU~A follows a load from the same variable executing on CPU~B, then |
| CPU~A's load must either return the same value that CPU~B's load did, |
| or must return some later value. |
| |
| In the Linux kernel, use of general memory barriers guarantees |
| transitivity. Therefore, in the above example, if CPU~2's load from~\co{X} |
| returns~1 and its load from~\co{Y} returns~0, then CPU~3's load from~\co{X} must |
| also return~1. |
| |
| However, transitivity is {\em not} guaranteed for read or write barriers. |
| For example, suppose that CPU~2's general barrier in the above example |
| is changed to a read barrier as shown below with the same initial values of |
| {\tt \{X~=~0, Y~=~0\}}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|l|l} |
| \nf{CPU 1} & \nf{CPU 2} & \nf{CPU 3} \\ |
| \hline |
| STORE X = 1 & LOAD X & STORE Y = 1 \\ |
| & <read barrier> & <general barrier> \\ |
| & LOAD Y & LOAD X \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This substitution destroys transitivity: in this example, it is perfectly |
| legal for CPU~2's load from~\co{X} to return~1, its load from~\co{Y} to return~0, |
| and CPU~3's load from~\co{X} to return~0. |
| |
| The key point is that although CPU~2's read barrier orders its pair |
| of loads, it does not guarantee to order CPU~1's store. Therefore, if |
| this example runs on a system where CPUs~1 and~2 share a store buffer |
| or a level of cache, CPU~2 might have early access to CPU~1's writes. |
| General barriers are therefore required to ensure that all CPUs agree |
| on the combined order of CPU~1's and CPU~2's accesses. |
| |
| General barriers provide ``global transitivity'', so that all CPUs will |
| agree on the order of operations. In contrast, a chain of release-acquire |
| pairs provides only ``local transitivity'', so that only those CPUs on |
| the chain are guaranteed to agree on the combined order of the accesses. |
| For example, switching to C~code in deference to Herman Hollerith shown in |
| Figure~\ref{fig:advsync:Example of Local Transitivity (Release-Acquire Pair)}: |
| |
| \begin{figure}[htbp] |
| \scriptsize |
| \centering |
| \begin{verbbox} |
| 1 int u, v, x, y, z; |
| 2 |
| 3 void cpu0(void) |
| 4 { |
| 5 r0 = smp_load_acquire(&x); |
| 6 WRITE_ONCE(u, 1); |
| 7 smp_store_release(&y, 1); |
| 8 } |
| 9 |
| 10 void cpu1(void) |
| 11 { |
| 12 r1 = smp_load_acquire(&y); |
| 13 r4 = READ_ONCE(v); |
| 14 r5 = READ_ONCE(u); |
| 15 smp_store_release(&z, 1); |
| 16 } |
| 17 |
| 18 void cpu2(void) |
| 19 { |
| 20 r2 = smp_load_acquire(&z); |
| 21 smp_store_release(&x, 1); |
| 22 } |
| 23 |
| 24 void cpu3(void) |
| 25 { |
| 26 WRITE_ONCE(v, 1); |
| 27 smp_mb(); |
| 28 r3 = READ_ONCE(u); |
| 29 } |
| \end{verbbox} |
| \theverbbox |
| \caption{Example of Local Transitivity (Release-Acquire Pair)} |
| \label{fig:advsync:Example of Local Transitivity (Release-Acquire Pair)} |
| \end{figure} |
| |
| Because \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} participate in a local transitive |
| chain of \co{smp_store_release()}-\co{smp_load_acquire()} pairs, the following |
| outcome is prohibited: |
| |
| {\scriptsize |
| \begin{verbatim} |
| r0 == 1 && r1 == 1 && r2 == 1 |
| \end{verbatim} |
| } |
| |
| Furthermore, because of the release-acquire relationship between \co{cpu0()} |
| and \co{cpu1()}, \co{cpu1()} must see \co{cpu0()}'s writes, so that the following |
| outcome is prohibited: |
| |
| {\scriptsize |
| \begin{verbatim} |
| r1 == 1 && r5 == 0 |
| \end{verbatim} |
| } |
| |
| However, the transitivity of release-acquire is local to the participating |
| CPUs and does not apply to \co{cpu3()}. Therefore, the following outcome |
| is possible: |
| |
| {\scriptsize |
| \begin{verbatim} |
| r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 |
| \end{verbatim} |
| } |
| |
| As an aside, the following outcome is also possible: |
| |
| {\scriptsize |
| \begin{verbatim} |
| r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 |
| && r5 == 1 |
| \end{verbatim} |
| } |
| |
| Although \co{cpu0()}, \co{cpu1()}, and \co{cpu2()} will see their respective reads and |
| writes in order, CPUs not involved in the release-acquire chain might |
| well disagree on the order. This disagreement stems from the fact that |
| the weak memory-barrier instructions used to implement \co{smp_load_acquire()} |
| and \co{smp_store_release()} are not required to order prior stores against |
| subsequent loads in all cases. This means that \co{cpu3()} can see \co{cpu0()}'s |
| store to~\co{u} as happening {\em after} \co{cpu1()}'s load from~\co{v}, even though |
| both \co{cpu0()} and \co{cpu1()} agree that these two operations occurred in the |
| intended order. |
| |
| However, please keep in mind that \co{smp_load_acquire()} is not magic. |
| In particular, it simply reads from its argument with ordering. It does |
| {\em not} ensure that any particular value will be read. Therefore, the |
| following outcome is possible: |
| |
| {\scriptsize |
| \begin{verbatim} |
| r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 |
| \end{verbatim} |
| } |
| |
| Note that this outcome can happen even on a mythical sequentially |
| consistent system where nothing is ever reordered. |
| |
| To reiterate, if your code requires global transitivity, use general |
| barriers throughout. |
| |
| \subsection{Locking Constraints} |
| \label{sec:advsync:Locking Constraints} |
| |
| As noted earlier, locking primitives contain implicit memory barriers. |
| These primitives imply the following: |
| \begin{enumerate} |
| \item ACQUIRE operation implication: |
| \begin{itemize} |
| \item Memory operations issued after an ACQUIRE will be completed |
| after the ACQUIRE operation has completed. |
| \item Memory operations issued before an ACQUIRE may be completed |
| after the ACQUIRE operation has completed. |
| \end{itemize} |
| \item RELEASE operation implication: |
| \begin{itemize} |
| \item Memory operations issued before a RELEASE will be |
| completed before the RELEASE operation has completed. |
| \item Memory operations issued after a RELEASE may be completed |
| before the RELEASE operation has completed. |
| \end{itemize} |
| \item ACQUIRE vs ACQUIRE implication: |
| \begin{itemize} |
| \item All ACQUIRE operations issued before another ACQUIRE operation |
| will be completed before that ACQUIRE operation. |
| \end{itemize} |
| \item ACQUIRE vs RELEASE implication: |
| \begin{itemize} |
| \item All ACQUIRE operations issued before a RELEASE operation |
| will be completed before the RELEASE operation. |
| \item All RELEASE operations issued before an ACQUIRE operation |
| will be completed before the ACQUIRE operation. |
| \end{itemize} |
| \item Failed conditional ACQUIRE implication: |
| \begin{itemize} |
| \item Certain variants of ACQUIRE operation may fail, either |
| due to being unable to get the lock immediately, or due |
| to receiving an unblocked signal or exception |
| whilst asleep waiting |
| for the lock to become available. Failed ACQUIREs do not |
| imply any sort of barrier. |
| \end{itemize} |
| \end{enumerate} |
| |
| \subsection{Memory-Barrier Examples} |
| \label{sec:advsync:Memory-Barrier Examples} |
| |
| \subsubsection{Locking Examples} |
| |
| \paragraph{ACQUIRE Followed by RELEASE:} |
| An ACQUIRE followed by a RELEASE may not be assumed to be a full memory barrier |
| because it is possible for an access preceding the ACQUIRE to happen after the |
| ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and the |
| two accesses can themselves then cross. |
| For example, the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 *A = a; |
| 2 ACQUIRE |
| 3 RELEASE |
| 4 *B = b; |
| \end{verbatim} |
| \vspace{1pt} |
| \end{minipage} |
| % |
| might well execute in the following order: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 2 ACQUIRE |
| 4 *B = b; |
| 1 *A = a; |
| 3 RELEASE |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Again, always remember that ACQUIRE and RELEASE are permitted to let preceding |
| operations and following operations ``bleed in'' to the critical section |
| respectively. |
| |
| \QuickQuiz{} |
| What sequence of ACQUIRE-RELEASE operations \emph{would} |
| act as a full memory barrier? |
| \QuickQuizAnswer{ |
| A series of two back-to-back ACQUIRE-RELEASE operations, or, somewhat |
| less conventionally, a RELEASE operation followed by an ACQUIRE |
| operation. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| What (if any) CPUs have memory-barrier instructions |
| from which these semi-permeable locking primitives might |
| be constructed? |
| \QuickQuizAnswer{ |
| Itanium is one example. |
| The identification of any others is left as an |
| exercise for the reader. |
| } \QuickQuizEnd |
| |
| \paragraph{Lock-Based Critical Sections:} |
| Although an ACQUIRE\-/RELEASE pair does not act as a full memory barrier, |
| these operations \emph{do} affect memory ordering. |
| |
| Consider the following code: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 *A = a; |
| 2 *B = b; |
| 3 ACQUIRE |
| 4 *C = c; |
| 5 *D = d; |
| 6 RELEASE |
| 7 *E = e; |
| 8 *F = f; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This could legitimately execute in the following order, where pairs |
| of operations on the same line indicate that the CPU executed those |
| operations concurrently: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 3 ACQUIRE |
| 1 *A = a; *F = f; |
| 7 *E = e; |
| 4 *C = c; *D = d; |
| 2 *B = b; |
| 6 RELEASE |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \begin{table}[htbp] |
| \scriptsize\centering |
| \begin{tabular}{r|l} |
| \# & Ordering: legitimate or not? \\ |
| \hline |
| \hline |
| 1 & \verb|*A; *B; ACQUIRE; *C; *D; RELEASE; *E; *F;| \\ |
| \hline |
| 2 & \verb|*A; {*B; ACQUIRE;} *C; *D; RELEASE; *E; *F;| \\ |
| \hline |
| 3 & \verb|{*F; *A;} *B; ACQUIRE; *C; *D; RELEASE; *E;| \\ |
| \hline |
| 4 & \verb|*A; *B; {ACQUIRE; *C;} *D; {RELEASE; *E;} *F;| \\ |
| \hline |
| 5 & \verb|*B; ACQUIRE; *C; *D; *A; RELEASE; *E; *F;| \\ |
| \hline |
| 6 & \verb|*A; *B; *C; ACQUIRE; *D; RELEASE; *E; *F;| \\ |
| \hline |
| 7 & \verb|*A; *B; ACQUIRE; *C; RELEASE; *D; *E; *F;| \\ |
| \hline |
| 8 & \verb|{*B; *A; ACQUIRE;} {*D; *C;} {RELEASE; *F; *E;}| \\ |
| \hline |
| 9 & \verb|*B; ACQUIRE; *C; *D; RELEASE; {*F; *A;} *E;| \\ |
| \end{tabular} |
| \caption{Lock-Based Critical Sections} |
| \label{tab:advsync:Lock-Based Critical Sections} |
| \end{table} |
| |
| \QuickQuiz{} |
| Given that operations grouped in curly braces are executed |
| concurrently, which of the rows of |
| Table~\ref{tab:advsync:Lock-Based Critical Sections} |
| are legitimate reorderings of the assignments to variables~``A'' |
| through~`F'' and the ACQUIRE/RELEASE operations? |
| (The order in the code is {\tt *A, *B, ACQUIRE, *C, *D, RELEASE, *E, *F}.) |
| Why or why not? |
| \QuickQuizAnswer{ |
| \begin{enumerate} |
| \item Legitimate, executed in order. |
| \item Legitimate, the lock acquisition was executed concurrently |
| with the last assignment preceding the critical section. |
| \item Illegitimate, the assignment to~``F'' must follow the ACQUIRE |
| operation. |
| \item Illegitimate, the ACQUIRE must complete before any operation in |
| the critical section. However, the RELEASE may legitimately |
| be executed concurrently with subsequent operations. |
| \item Legitimate, the assignment to~``A'' precedes the RELEASE, |
| as required, and all other operations are in order. |
| \item Illegitimate, the assignment to~``C'' must follow the ACQUIRE. |
| \item Illegitimate, the assignment to~``D'' must precede the RELEASE. |
| \item Legitimate, all assignments are ordered with respect to the |
| ACQUIRE and RELEASE operations. |
| \item Illegitimate, the assignment to~``A'' must precede the RELEASE. |
| \end{enumerate} |
| } \QuickQuizEnd |
| |
| \paragraph{Ordering with Multiple Locks:} |
| Code containing multiple locks still sees ordering constraints from |
| those locks, but one must be careful to keep track of which lock is which. |
| For example, consider the code shown in |
| Table~\ref{tab:advsync:Ordering With Multiple Locks}, which uses |
| a pair of locks named~``M'' and~``Q''. |
| |
| \begin{table}[htbp] |
| \scriptsize\centering{\tt |
| \begin{tabular}{r|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = a; & E = e; \\ |
| ACQUIRE M; & ACQUIRE Q; \\ |
| B = b; & F = f; \\ |
| C = c; & G = g; \\ |
| RELEASE M; & RELEASE Q; \\ |
| D = d; & H = h; \\ |
| \end{tabular}} |
| \caption{Ordering With Multiple Locks} |
| \label{tab:advsync:Ordering With Multiple Locks} |
| \end{table} |
| |
| In this example, there are no guarantees as to what order the |
| assignments to variables~``A'' through~``H'' will appear in, other |
| than the constraints imposed by the locks themselves, as |
| described in the previous section. |
| |
| \QuickQuiz{} |
| What are the constraints for |
| Table~\ref{tab:advsync:Ordering With Multiple Locks}? |
| \QuickQuizAnswer{ |
| All CPUs must see the following ordering constraints: |
| \begin{enumerate} |
| \item ACQUIRE M precedes B, C, and D. |
| \item RELEASE M follows A, B, and C. |
| \item ACQUIRE Q precedes F, G, and H. |
| \item RELEASE Q follows E, F, and G. |
| \end{enumerate} |
| } \QuickQuizEnd |
| |
| \paragraph{Ordering with Multiple CPUs on One Lock:} |
| Suppose, instead of the two different locks as shown in |
| Table~\ref{tab:advsync:Ordering With Multiple Locks}, both CPUs acquire |
| the same lock, as shown in |
| Table~\ref{tab:advsync:Ordering With Multiple CPUs on One Lock}? |
| |
| \begin{table}[htbp] |
| \scriptsize\centering{\tt |
| \begin{tabular}{r|l} |
| \nf{CPU 1} & \nf{CPU 2} \\ |
| \hline |
| A = a; & E = e; \\ |
| ACQUIRE M; & ACQUIRE M; \\ |
| B = b; & F = f; \\ |
| C = c; & G = g; \\ |
| RELEASE M; & RELEASE M; \\ |
| D = d; & H = h; \\ |
| \end{tabular}} |
| \caption{Ordering With Multiple CPUs on One Lock} |
| \label{tab:advsync:Ordering With Multiple CPUs on One Lock} |
| \end{table} |
| |
| In this case, either CPU~1 acquires~M before CPU~2 does, or vice versa. |
| In the first case, the assignments to~A, B, and~C must precede |
| those to~F, G, and~H. |
| On the other hand, if CPU~2 acquires the lock first, then the |
| assignments to~E, F, and~G must precede those to~B, C, and~D. |
| |
| \subsection{The Effects of the CPU Cache} |
| \label{sec:advsync:The Effects of the CPU Cache} |
| |
| The perceived ordering of memory operations is affected by the caches |
| that lie between the CPUs and memory, as well as by the cache coherence |
| protocol that maintains memory consistency and ordering. |
| From a software viewpoint, these caches are for all intents and purposes |
| part of memory. |
| Memory barriers can be thought of as acting on the vertical dotted line in |
| Figure~\ref{fig:advsync:Memory Architecture}, ensuring that the CPU |
| presents its values to memory in the proper order, as well as ensuring |
| that it sees changes made by other CPUs in the proper order. |
| |
| \begin{figure*}[htb] |
| \centering |
| \includegraphics{advsync/MemoryArchitecture} |
| \caption{Memory Architecture} |
| \ContributedBy{Figure}{fig:advsync:Memory Architecture}{David Howells} |
| \end{figure*} |
| |
| Although the caches can ``hide'' a given CPU's memory accesses from the rest of |
| the system, the cache-coherence protocol ensures that all other CPUs see |
| any effects of these hidden accesses, migrating and invalidating cachelines |
| as required. |
| Furthermore, the CPU core may execute instructions in any order, restricted |
| only by the requirement that program causality and memory ordering |
| appear to be maintained. |
| Some of these instructions may generate memory accesses that must be queued |
| in the CPU's memory access queue, but execution may nonetheless continue |
| until the CPU either fills up its internal resources or until it must |
| wait for some queued memory access to complete. |
| |
| \subsubsection{Cache Coherency} |
| \label{sec:advsync:Cache Coherency} |
| |
| Although cache-coherence protocols guarantee that a given CPU sees its |
| own accesses in order, and that all CPUs agree on the order of modifications |
| to a single variable contained within a single cache line, there is no |
| guarantee that modifications to different variables will be seen in |
| the same order by all CPUs---although some computer systems do make |
| some such guarantees, portable software cannot rely on them. |
| |
| \begin{figure*}[htb] |
| \centering |
| \includegraphics{advsync/SplitCache} |
| \caption{Split Caches} |
| \ContributedBy{Figure}{fig:advsync:Split Caches}{David Howells} |
| \end{figure*} |
| |
| To see why reordering can occur, consider the two-CPU system shown in |
| Figure~\ref{fig:advsync:Split Caches}, in which each CPU has a split |
| cache. |
| This system has the following properties: |
| \begin{enumerate} |
| \item An odd-numbered cache line may be in cache~A, cache~C, |
| in memory, or some combination of the above. |
| \item An even-numbered cache line may be in cache~B, cache~D, |
| in memory, or some combination of the above. |
| \item While the CPU core is interrogating one of its caches,\footnote{ |
| But note that in ``superscalar'' systems, the CPU |
| might well be accessing both halves of its cache at |
| once, and might in fact be performing multiple concurrent |
| accesses to each of the halves.} |
| its other cache is not necessarily quiescent. |
| This other cache may instead be responding to an invalidation |
| request, writing back a dirty cache line, |
| processing elements in the CPU's memory-access queue, and |
| so on. |
| \item Each cache has queues of operations that need to be applied |
| to that cache in order to maintain the required coherence |
| and ordering properties. |
| \item These queues are not necessarily flushed by loads from or |
| stores to cache lines affected by entries in those queues. |
| \end{enumerate} |
| |
| In short, if cache~A is busy, but cache~B is idle, then CPU~1's |
| stores to odd-numbered cache lines may be delayed compared to |
| CPU~2's stores to even-numbered cache lines. |
| In not-so-extreme cases, CPU~2 may see CPU~1's operations out |
| of order. |
| |
| \IfInBook{Much more detail on memory ordering in hardware and software |
| may be found in Appendix~\ref{chp:app:whymb:Why Memory Barriers?}. |
| } |
| {} |
| |
| \subsection{Where Are Memory Barriers Needed?} |
| \label{sec:advsync:Where Are Memory Barriers Needed?} |
| |
| Memory barriers are only required where there's a possibility of interaction |
| between two CPUs or between a CPU and a device. If it can be guaranteed that |
| there won't be any such interaction in any particular piece of code, then |
| memory barriers are unnecessary in that piece of code. |
| |
| Note that these are the \emph{minimum} guarantees. |
| Different architectures may give |
| more substantial guarantees, |
| \IfInBook{as discussed in Appendix~\ref{chp:app:whymb:Why Memory Barriers?},}{} |
| but they may \emph{not} |
| be relied upon outside of code specifically designed to run only on |
| the corresponding architecture. |
| |
| However, primitives that implement atomic operations, such as locking |
| primitives and atomic data-structure manipulation and traversal primitives, |
| will normally include any needed memory barriers in their definitions. |
| However, there are some exceptions, such as \co{atomic_inc()} in the |
| Linux kernel, so be sure to review the documentation, and, if possible, |
| the actual implementations, for your software environment. |
| |
| One final word of advice: use of raw memory-barrier primitives should |
| be a last resort. |
| It is almost always better to use an existing primitive that takes |
| care of memory barriers. |