| % memorder/memorder.tex |
| |
| \QuickQuizChapter{chp:memorder:Memory Ordering}{Memory Ordering} |
| \OriginallyPublished{Chapter}{chp:memorder:Memory Ordering}{Linux Kernel Memory Barriers}{kernel.org}{Howells2009membartxt} |
| \OriginallyPublished{Chapter}{chp:memorder:Memory Ordering}{A formal kernel memory-ordering model}{Linux Weekly News}{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2} |
| % |
| \Epigraph{The art of progress is to preserve order amid change and to preserve change amid order.}{\emph{Alfred North Whitehead}} |
| |
| Causality and sequencing are deeply intuitive, and hackers often |
| tend to have a much stronger grasp of these concepts than does |
| the general population. |
| These intuitions can be extremely powerful tools when writing, analyzing, |
| and debugging both sequential code and parallel code that makes |
| use of standard mutual-exclusion mechanisms, such as locking and |
| RCU. |
| Unfortunately, these intuitions break down completely in face of |
| code that fails to use standard mechanisms, one important example |
| of course being the code that implements these standard mechanisms. |
| This chapter will help you gain an understanding of memory ordering |
| sufficient to implement synchronization primitives and other |
| performance-critical code. |
| |
| @@@ roadmap @@@ |
| |
| \section{Ordering: Why and How?} |
| \label{sec:memorder:Ordering: Why and How?} |
| |
| One motivation for memory ordering can be seen in the trivial-seeming |
| litmus test in |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test} |
| (\path{C-SB+o-o+o-o.litmus}), |
| which at first glance might |
| appear to guarantee that the \co{exists} clause never triggers.\footnote{ |
| Purists would instead say that the \co{exists} clause is |
| never \emph{satisfied}, but we use ``trigger'' here by |
| analogy with assertions.} |
| After all, if \nbco{0:r2=0} as shown in the \co{exists} clause,\footnote{ |
| That is, Thread~\co{P0()}'s instance of local variable \co{r2} |
| equals zero. |
| See Section~\ref{sec:formal:Anatomy of a Litmus Test} |
| for documentation of litmus-test nomenclature.} |
| we might hope that Thread~\co{P0()}'s |
| load from~\co{x1} must have happened before Thread~\co{P1()}'s store to~\co{x1}, |
| which might raise |
| further hopes that Thread~\co{P1()}'s load from~\co{x0} must happen after |
| Thread~\co{P0()}'s store to~\co{x0}, so that \nbco{1:r2=2}, |
| thus not triggering the \co{exists} clause. |
| The example is symmetric, so similar reasoning might lead |
| us to hope that \nbco{1:r2=0} guarantees that \nbco{0:r2=2}. |
| Unfortunately, the lack of memory barriers dashes these hopes. |
| The CPU is within its rights to reorder |
| the statements within both Thread~\co{P0()} and Thread~\co{P1()}, |
| even on relatively strongly ordered systems such as x86. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-SB+o-o+o-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| WRITE_ONCE(*x0, 2); |
| r2 = READ_ONCE(*x1); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| WRITE_ONCE(*x1, 2); |
| r2 = READ_ONCE(*x0); |
| } |
| |
| exists (1:r2=0 /\ 0:r2=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Memory Misordering: Store-Buffering Litmus Test} |
| \label{lst:memorder:Memory Misordering: Store-Buffering Litmus Test} |
| \end{listing} |
| |
| \QuickQuiz{} |
| The compiler can also reorder Thread~\co{P0()}'s and |
| Thread~\co{P1()}'s memory accesses in |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}, |
| right? |
| \QuickQuizAnswer{ |
| In general, compiler optimizations carry out more extensive |
| and profound reorderings than CPUs can. |
| However, in this case, the volatile accesses in |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} |
| prevent the compiler from reordering. |
| And also from doing much else as well, so the examples in this |
| section will be making heavy use of |
| \co{READ_ONCE()} and \co{WRITE_ONCE()}. |
| % @@@ forward reference to a section describing READ_ONCE() |
| % @@@ and WRITE_ONCE(). |
| } \QuickQuizEnd |
| |
| This willingness to reorder can be confirmed using tools such as |
| \co{litmus7}~\cite{Alglave:2014:HCM:2594291.2594347}, |
| which found that the counter-intuitive ordering happened |
| 314 times out of 100,000,000 trials on my x86 laptop. |
| Oddly enough, the perfectly legal outcome where both loads return the |
| value 2 occurred less frequently, in this case, only 167 times.\footnote{ |
| Please note that results are sensitive to the exact hardware |
| configuration, |
| how heavily the system is loaded, and much else besides.} |
| The lesson here is clear: Increased counterintuitivity does not necessarily |
| imply decreased probability! |
| % Run on June 23, 2017: |
| % litmus7 -r 1000 -carch X86 C-SB+o-o+o-o.litmus |
| % Test C-SB+o-o+o-o Allowed |
| % Histogram (4 states) |
| % 314 *>0:r2=0; 1:r2=0; |
| % 49999625:>0:r2=2; 1:r2=0; |
| % 49999894:>0:r2=0; 1:r2=2; |
| % 167 :>0:r2=2; 1:r2=2; |
| |
| The following sections show exactly where this intuition breaks down, |
| and then puts forward a mental model of memory ordering that can help |
| you avoid these pitfalls. |
| |
| Section~\ref{sec:memorder:Why Hardware Misordering?} |
| gives a brief overview of why hardware misorders memory accesses, and then |
| Section~\ref{sec:memorder:How to Force Ordering?} |
| gives an equally brief overview of how you can thwart evil hardware |
| misordering schemes. |
| Finally, Section~\ref{sec:memorder:Basic Rules of Thumb} |
| lists some basic rules of thum, which will be further refined in |
| later sections. |
| |
| \subsection{Why Hardware Misordering?} |
| \label{sec:memorder:Why Hardware Misordering?} |
| |
| But why does memory misordering happen in the first place? |
| Can't CPUs keep track of ordering on their own? |
| Isn't that why we have computers in the first place, to keep track of things? |
| |
| Many people do indeed expect their computers to keep track of things, |
| but many also insist that they keep track of things quickly. |
| However, as seen in Chapter~\ref{chp:Hardware and its Habits}, |
| one difficulty that modern computer-system vendors face is that |
| the main memory cannot keep up with the CPU---modern CPUs can execute |
| hundreds of instructions in the time required to fetch a single variable |
| from memory. |
| CPUs therefore sport increasingly large caches, as seen back in |
| Figure~\ref{fig:cpu:System Hardware Architecture}, which means that |
| although the first load by a given CPU from a given variable will |
| result in an expensive \emph{cache miss} as was discussed in |
| Section~\ref{sec:cpu:Cache Misses}, subsequent |
| repeated loads from that variable by that CPU might execute |
| very quickly because the initial cache miss will have loaded that |
| variable into that CPU's cache. |
| |
| However, it is also necessary to accommodate frequent concurrent stores |
| from multiple CPUs to a set of shared variables. |
| In cache-coherent systems, if the caches hold multiple copies of a given |
| variable, all the copies of that variable must have the same value. |
| This works extremely well for concurrent loads, but not so well for |
| concurrent stores: Each store must do something about all |
| copies of the old value (another cache miss!), which, given the finite |
| speed of light and the atomic nature of matter, will be slower |
| than impatient software hackers would like. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{memorder/SystemArchSB}} |
| \caption{System Architecture With Store Buffers} |
| \label{fig:memorder:System Architecture With Store Buffers} |
| \end{figure} |
| |
| CPUs therefore come equipped with store buffers, as shown in |
| Figure~\ref{fig:memorder:System Architecture With Store Buffers}. |
| When a given CPU does a store to a variable that |
| is not present in that CPU's cache, then the new value |
| is instead placed in that CPU's store buffer. |
| The CPU can then proceed immediately, without having to wait for the |
| store to do something about all the old values of that variable |
| residing in other CPUs' caches. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}} |
| \caption{CPUs Can Do Things Out of Order} |
| \ContributedBy{Figure}{fig:memorder:CPUs Can Do Things Out of Order}{Melissa Broussard} |
| \end{figure} |
| |
| Although store buffers can greatly increase performance, |
| they can cause instructions and memory references to execute out |
| of order, which can in turn cause serious confusion, as illustrated in |
| Figure~\ref{fig:memorder:CPUs Can Do Things Out of Order}. |
| In particular, these store buffers can cause the memory misordering |
| shown in the store-buffering litmus test in |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.1in} |
| \begin{tabular}{r||l|l|l||l|l|l} |
| & \multicolumn{3}{c||}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\ |
| \cline{2-7} |
| & Instruction & Store Buffer & Cache & |
| Instruction & Store Buffer & Cache \\ |
| \hline |
| \hline |
| 1 & (Initial state) & & \tco{x1==0} & |
| (Initial state) & & \tco{x0==0} \\ |
| \hline |
| 2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} & |
| \tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\ |
| \hline |
| 3 & \tco{r2 = x1;} (0) & \tco{x0==2} & \tco{x1==0} & |
| \tco{r2 = x0;} (0) & \tco{x1==2} & \tco{x0==0} \\ |
| \hline |
| 4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} & |
| (Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\ |
| \hline |
| 5 & (Finish store) & & \tco{x0==2} & |
| (Finish store) & & \tco{x1==2} \\ |
| \end{tabular} |
| \caption{Memory Misordering: Store-Buffering Sequence of Events} |
| \label{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| \end{table*} |
| |
| Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| shows how this memory misordering can happen. |
| Row~1 shows the initial state, where CPU~0 has \co{x1} in its cache |
| and CPU~1 has \co{x0} in its cache, both variables having a value of zero. |
| Row~2 shows the state change due to each CPU's store (lines~9 and~18 of |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}). |
| Because neither CPU has the stored-to variable in its cache, both CPUs |
| record their stores in their respective store buffers. |
| |
| \QuickQuiz{} |
| But wait!!! |
| On row~2 of |
| Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| both \co{x0} and \co{x1} each have two values at the same time, |
| namely zero and two. |
| How can that possibly work??? |
| \QuickQuizAnswer{ |
| There is an underlying cache-coherence protocol that straightens |
| things out, which will be discussed later. |
| % @@@ Add forward reference. |
| But if you think that a given variable having two values at |
| the same time is surprising, just wait until you get to |
| Section~\ref{sec:memorder:Variables With Multiple Values}! |
| } \QuickQuizEnd |
| |
| Row~3 shows the two loads (lines~10 and~19 of |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}). |
| Because the variable being loaded by each CPU is in that CPU's cache, |
| each load immediately returns the cached value, which in both cases |
| is zero. |
| |
| But the CPUs are not done yet: Sooner or later, they must empty their |
| store buffers. |
| Because caches move data around in relatively large blocks called |
| \emph{cachelines}, and because each cacheline can hold several |
| variables, each CPU must get the cacheline into its own cache so |
| that it can update the portion of that cacheline corresponding |
| to the variable in its store buffer, but without disturbing any |
| other part of the cacheline. |
| Each CPU must also ensure that the cacheline is not present in any other |
| CPU's cache, for which a read-invalidate operation is used. |
| As shown on row~4, after both read-invalidate operations complete, |
| the two CPUs have traded cachelines, so that CPU~0's cache now contains |
| \co{x0} and CPU~1's cache now contains \co{x1}. |
| Once these two variables are in their new homes, each CPU can flush |
| its store buffer into the corresponding cache line, leaving each |
| variable with its final value as shown on row~5. |
| |
| \QuickQuiz{} |
| But don't the values also need to be flushed from the cache |
| to main memory? |
| \QuickQuizAnswer{ |
| Perhaps surprisingly, not necessarily! |
| On some systems, |
| if the two variables are being used heavily, they might |
| be bounced back and forth between the CPUs' caches and never |
| land in main memory. |
| } \QuickQuizEnd |
| |
| In summary, store buffers are needed to allow CPUs to handle |
| store instructions efficiently, but they can result in |
| counter-intuitive memory misordering. |
| |
| But what do you do if your algorithm really needs its memory |
| references to be ordered? |
| For example, suppose that you are communicating with a driver using |
| a pair of flags, one that says whether or not the driver is running |
| and the other that says whether there is a request pending for that |
| driver. |
| The requester needs to set the request-pending flag, then check |
| the driver-running flag, and if false, wake it up. |
| Once the driver has serviced all the pending requests that it knows about, |
| it needs to clear its driver-running flag, then check the request-pending |
| flag to see if it needs to restart. |
| This very reasonable approach cannot work unless there is some way |
| to make sure that the hardware processes the stores and loads in order. |
| This is the subject of the next section. |
| |
| \subsection{How to Force Ordering?} |
| \label{sec:memorder:How to Force Ordering?} |
| |
| It turns out that there are compiler directives and standard |
| synchronization primitives (such as locking and RCU) |
| that are responsible for maintaining the illusion of ordering through use of |
| \emph{memory barriers} (for example, \co{smp_mb()} in the Linux kernel). |
| These memory barriers can be explicit instructions, as they are on |
| ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions, |
| as they often are on x86. |
| Since these standard synchronization primitives preserve the illusion of |
| ordering, your path of least resistance is to simply use these primitives, |
| thus allowing you to stop reading this section. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-SB+o-mb-o+o-mb-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| WRITE_ONCE(*x0, 2); |
| smp_mb(); |
| r2 = READ_ONCE(*x1); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| WRITE_ONCE(*x1, 2); |
| smp_mb(); |
| r2 = READ_ONCE(*x0); |
| } |
| |
| exists (1:r2=0 /\ 0:r2=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Memory Ordering: Store-Buffering Litmus Test} |
| \label{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| \end{listing} |
| |
| However, if you need to implement the synchronization primitives |
| themselves, or if you are simply interested in understanding how memory |
| ordering works, read on! |
| The first stop on the journey is |
| Listing~\ref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| (\path{C-SB+o-mb-o+o-mb-o.litmus}), |
| which places an \co{smp_mb()} Linux-kernel full memory barrier between |
| the store and load in both \co{P0()} and \co{P1()}, but is otherwise |
| identical to |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| % Test C-SB+o-mb-o+o-mb-o Allowed |
| % Histogram (3 states) |
| % 49553298:>0:r2=2; 1:r2=0; |
| % 49636449:>0:r2=0; 1:r2=2; |
| % 810253:>0:r2=2; 1:r2=2; |
| % No |
| These barriers prevent the counter-intuitive outcome from happening |
| on 100,000,000 trials on my x86 laptop. |
| Interestingly enough, the added overhead due to these barriers causes the |
| legal outcome where both loads return the value two to happen more |
| than 800,000 times, as opposed to only 167 times for the |
| barrier-free code in |
| Listing~\ref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.1in} |
| \begin{tabular}{r||l|l|l||l|l|l} |
| & \multicolumn{3}{c||}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\ |
| \cline{2-7} |
| & Instruction & Store Buffer & Cache & |
| Instruction & Store Buffer & Cache \\ |
| \hline |
| \hline |
| 1 & (Initial state) & & \tco{x1==0} & |
| (Initial state) & & \tco{x0==0} \\ |
| \hline |
| 2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} & |
| \tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\ |
| \hline |
| 3 & \tco{smp_mb();} & \tco{x0==2} & \tco{x1==0} & |
| \tco{smp_mb();} & \tco{x1==2} & \tco{x0==0} \\ |
| \hline |
| 4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} & |
| (Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\ |
| \hline |
| 5 & (Finish store) & & \tco{x0==2} & |
| (Finish store) & & \tco{x1==2} \\ |
| \hline |
| 6 & \tco{r2 = x1;} (2) & & \tco{x1==2} & |
| \tco{r2 = x0;} (2) & & \tco{x0==2} \\ |
| \end{tabular} |
| \caption{Memory Ordering: Store-Buffering Sequence of Events} |
| \label{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events} |
| \end{table*} |
| |
| These barriers have a profound effect on ordering, as can be seen in |
| Table~\ref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}. |
| Although the first two rows are the same as in |
| Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| and although the \co{smp_mb()} instructions on row~3 |
| do not change state |
| in and of themselves, they do cause the stores to complete |
| (rows~4 and~5) before the |
| loads (row~6), which rules out the counter-intuitive outcome shown in |
| Table~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}. |
| Note that variables \co{x0} and \co{x1} each still have more than one |
| value on row~2, however, as promised earlier, the \co{smp_mb()} |
| instances straighten things out in the end. |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.7in} |
| \begin{tabular}{l||c||c|c|c|c||c|c|c|c|c|c|c} |
| & & \multicolumn{4}{c||}{Prior Ordered Operation} & |
| \multicolumn{7}{c}{Subsequent Ordered Operation} \\ |
| \cline{3-6} \cline{7-13} |
| Operation Providing Ordering & C & |
| Self & R & W & RMW & Self & R & W & DR & DW & RMW & SV \\ |
| \hline |
| \hline |
| Store, for example, \tco{WRITE_ONCE()} & & |
| Y & & & & & & & & & & Y \\ |
| \hline |
| Load, for example, \tco{READ_ONCE()} & & |
| Y & & & & & & & & Y & & Y \\ |
| \hline |
| Unsuccessful RMW operation & & |
| Y & & & & & & & & Y & & Y \\ |
| \hline |
| \tco{smp_read_barrier_depends()} & & |
| & Y & & & & & & Y & Y & & \\ |
| \hline |
| \tco{*_dereference()} & & |
| Y & & & & & & & Y & Y & & Y \\ |
| \hline |
| Successful \tco{*_acquire()} & & |
| R & & & & & Y & Y & Y & Y & Y & Y \\ |
| \hline |
| Successful \tco{*_release()} & C & |
| & Y & Y & Y & W & & & & & & Y \\ |
| \hline |
| \tco{smp_rmb()} & & |
| & Y & & R & & Y & & Y & & R & \\ |
| \hline |
| \tco{smp_wmb()} & & |
| & & Y & W & & & Y & & Y & W & \\ |
| \hline |
| \tco{smp_mb()} and \tco{synchronize_rcu()} & CP & |
| & Y & Y & Y & & Y & Y & Y & Y & Y & \\ |
| \hline |
| Successful full-strength non-\tco{void} RMW & CP & |
| Y & Y & Y & Y & Y & Y & Y & Y & Y & Y & Y \\ |
| \hline |
| \tco{smp_mb__before_atomic()} & CP & |
| & Y & Y & Y & & a & a & a & a & Y & \\ |
| \hline |
| \tco{smp_mb__after_atomic()} & CP & |
| & a & a & Y & & Y & Y & Y & Y & & \\ |
| \multicolumn{13}{c}{~} \\ |
| \multicolumn{1}{r}{Key:} & |
| \multicolumn{12}{l}{C: Ordering is cumulative} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{P: Ordering propagates} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{R: Read, for example, \tco{READ_ONCE()}, or read portion of RMW} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{W: Write, for example, \tco{WRITE_ONCE()}, or write portion of RMW} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{Y: Provides the specified ordering} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{a: Provides specified ordering given intervening RMW atomic operation} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{DR: Dependent read (address dependency, Section~\ref{sec:memorder:Address Dependencies})} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{DW: Dependent write (address, data, or control dependency, Sections~\ref{sec:memorder:Address Dependencies}--\ref{sec:memorder:Control Dependencies})} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{RMW: Atomic read-modify-write operation} \\ |
| \multicolumn{1}{r}{} & |
| \multicolumn{12}{l}{SV: Same-variable access} \\ |
| \end{tabular} |
| \caption{Linux-Kernel Memory-Ordering Cheat Sheet} |
| \label{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| \end{table*} |
| |
| Although full barriers such as \co{smp_mb()} have extremely strong |
| ordering guarantees, their strength comes at a high price. |
| A great many situations can be handled with much weaker ordering guarantees |
| that use much cheaper memory-ordering instructions, or, in some case, no |
| memory-ordering instructions at all. |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| provides a cheatsheet of the Linux kernel's ordering primitives and their |
| guarantees. |
| Each row corresponds to a primitive or category of primitives that might |
| or might not provide ordering, with the columns labeled |
| ``Prior Ordered Operation'' and ``Subsequent Ordered Operation'' |
| being the operations that might (or might not) be ordered against. |
| Cells containing ``Y'' indicate that ordering is supplied unconditionally, |
| while other characters indicate that ordering is supplied only partially or |
| conditionally. |
| Blank cells indicate that no ordering is supplied. |
| |
| The \co{*_acquire} row covers \co{smp_load_acquire()}, |
| \co{cmpxchg_acquire()}, \co{xchg_acquire()}, and so on; |
| the \co{*_release} row covers \co{smp_store_release()}, |
| \co{cmpxchg_release()}, \co{xchg_release()}, and so on; and |
| the ``Successful full-strength non-\co{void} RMW'' row covers |
| \co{atomic_add_return()}, \co{atomic_add_unless()}, \co{atomic_dec_and_test()}, |
| \co{cmpxchg()}, \co{xchg()}, and so on. |
| The ``Successful'' qualifiers apply to primitives such as |
| \co{atomic_add_unless()}, \co{cmpxchg_acquire()}, and \co{cmpxchg_release()}, |
| which have no effect on either memory or on ordering when they indicate |
| failure, as indicated by the earlier ``Unsuccessful RMW operation'' row. |
| |
| Column ``C'' indicates cumulativity and propagation, as explained in |
| Sections~\ref{sec:memorder:Cumulativity} |
| and~\ref{sec:memorder:Propagation}. |
| In the meantime, this column can usually be ignored when there |
| are at most two threads involved. |
| |
| \QuickQuiz{} |
| The rows in |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| seem quite random and confused. |
| Whatever is the conceptual basis of this table??? |
| \QuickQuizAnswer{ |
| The rows correspond roughly to hardware mechanisms of increasing |
| power and overhead. |
| |
| The \co{WRITE_ONCE()} row captures the fact that accesses to |
| a single variable are always fully ordered, as indicated by |
| the ``SV''column. |
| Note that all other operations providing ordering against an |
| access to a |
| specific variable also provide this same-variable ordering. |
| |
| The \co{READ_ONCE()} row captures the fact that (as of 2017) compilers |
| and CPUs do not indulge in user-visible speculative stores, so that |
| any store whose address, data, or execution depends on a prior load |
| will happen after that load completes. |
| At least assuming that these dependencies have been constructed |
| carefully as described in |
| Sections~\ref{sec:memorder:Address- and Data-Dependency Restrictions} |
| and~\ref{sec:memorder:Control-Dependency Restrictions}. |
| |
| The ``Unsuccessful RMW operation'' row captures the fact that |
| even an unsuccessful RMW has done a read, and that read is |
| every bit as good as a \co{READ_ONCE()}. |
| |
| The \co{smp_read_barrier_depends()} row captures the fact that, with the |
| notable exception of DEC Alpha, compilers and CPUs do not indulge |
| in user-visible breakage of address dependencies, again assuming |
| that these dependencies have been constructed carefully as described in |
| Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}. |
| |
| The \co{*_dereference()} row captures the address and data |
| dependency ordering provided by \co{lockless_dereference()}, |
| \co{rcu_dereference()}, and friends. |
| |
| The ``Successful \co{*_acquire()}'' row captures the fact that many |
| CPUs have special ``acquire'' forms of loads and of atomic RMW |
| instructions, |
| and that many other CPUs have light-weight memory-barrier |
| instructions that order prior loads against subsequent loads |
| and stores. |
| |
| The ``Successful \co{*_release()}'' row captures the fact that many |
| CPUs have special ``release'' forms of stores and of atomic RMW |
| instructions, and that many other CPUs have light-weight memory-barrier |
| instructions that order prior loads and stores against |
| subsequent stores. |
| |
| The \co{smp_rmb()} row captures the fact that many CPUs have |
| light-weight memory-barrier instructions that order prior loads against |
| subsequent loads. |
| Similarly, |
| the \co{smp_wmb()} row captures the fact that many CPUs have |
| light-weight memory-barrier instructions that order prior stores against |
| subsequent stores. |
| |
| None of the ordering operations thus far require prior stores to be |
| ordered against subsequent loads, which means that these operations |
| need not interfere with store buffers, whose main purpose in life |
| is in fact to reorder prior stores against subsequent loads. |
| The light-weight nature of these operations is precisely due to |
| their policy of store-buffer non-interference. |
| However, as noted earlier, it is sometimes necessary to interfere |
| with the store buffer in order to prevent prior stores from being |
| reordered against later stores, which brings us to the remaining |
| rows in this table. |
| |
| The \co{smp_mb())} row corresponds to the full memory barrier |
| available on most platforms, with Itanium being the exception |
| that proves the rule. |
| |
| The ``Successful full-strength non-\co{void} RMW'' row captures |
| the fact that on some platforms (such as x86) atomic RMW instructions |
| provide full ordering both before and after. |
| The Linux kernel therefore requires that full-strength non-\co{void} |
| atomic RMW operations provide full ordering in cases where these |
| operations succeed. |
| (Full-strength atomic RMW operation's names do not end in |
| \co{_relaxed}, \co{_acquire}, or \co{_release}.) |
| |
| However, the Linux kernel does not require that \co{void} atomic |
| RMW operations provide any ordering whatsoever, with the |
| canonical example being \co{atomic_inc()}. |
| Therefore, these operations, along with failing non-\co{void} |
| atomic RMW operations may be preceded by \co{smp_mb__before_atomic()} |
| and followed by \co{smp_mb__after_atomic()} to provide full |
| ordering for any accesses preceding or following both. |
| No ordering need be provided for accesses between the |
| \co{smp_mb__before_atomic()} (or, similarly, the |
| \co{smp_mb__after_atomic()}) and the atomic RMW operation, as |
| indicated by the ``a'' entries on the \co{smp_mb__before_atomic()} |
| and \co{smp_mb__after_atomic()} rows of the table. |
| |
| In short, any randomness in the table is due to the properties |
| of the underlying hardware, which are constrained by nothing other |
| than the laws of physics, as was explained back in |
| Chapter~\ref{chp:Hardware and its Habits}. |
| % forward reference to rf co fr section |
| } \QuickQuizEnd |
| |
| It is important to note that this table is just a cheat sheet, |
| and is therefore in no way a replacement for a good understanding |
| of memory ordering. |
| To begin building such an understanding, the next section will |
| present some basic rules of thumb. |
| |
| \subsection{Basic Rules of Thumb} |
| \label{sec:memorder:Basic Rules of Thumb} |
| |
| This section presents some basic rules of thumb that are ``good and |
| sufficient'' for a great many situations. |
| In fact, you could write a great deal of concurrent code having |
| excellent performance and scalability without needing anything more |
| than these rules of thumb. |
| |
| \QuickQuiz{} |
| But how can I know that a given project can be designed |
| and coded within the confines of these rules of thumb? |
| \QuickQuizAnswer{ |
| Much of the purpose of the remainder of this chapter is |
| to answer exactly that question! |
| } \QuickQuizEnd |
| |
| \paragraph{A given thread sees its own accesses in order.} |
| This rule assumes that loads and stores from/to shared variables use |
| \co{READ_ONCE()} and \co{WRITE_ONCE()}, respectively. |
| Otherwise, the compiler can profoundly scramble\footnote{ |
| Many compiler writers prefer the word ``optimize'' instead of |
| ``scramble'', but we all have our preferences.} |
| your code, and sometimes the CPU can do a bit of scrambling as well. |
| % @@@ Itanium forward reference? |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{memorder/memorybarrier}} |
| \caption{Memory Barriers Provide Conditional If-Then Ordering} |
| \label{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| \end{figure} |
| |
| \paragraph{Ordering has conditional if-then semantics.} |
| Figure~\ref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| illustrates this for memory barriers. |
| Assuming that both memory barriers are strong enough (and when in doubt, you |
| can always use \co{smp_mb()}), if CPU~1's access Y1 happens after CPU~0's |
| access Y0, then CPU~1's access X1 is guaranteed to happen after CPU~0's |
| access X1. |
| |
| \QuickQuiz{} |
| How can you tell which memory barriers are strong enough for |
| a given use case? |
| \QuickQuizAnswer{ |
| Ah, that is a deep question whose answer requires most of the |
| rest of this chapter. |
| } \QuickQuizEnd |
| |
| Listing~\ref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| is a case in point. |
| The \co{smp_mb()} on line~10 and~20 serve as the barriers, |
| the store to \co{x0} on line~9 as X0, the load from \co{x1} on line~11 |
| as Y0, the store to \co{x1} on line~19 as Y1, and the load from |
| \co{x0} as X1. |
| Applying the if-then rule step by step, we know that the store to |
| \co{x1} on line~19 happens after the load from \co{x1} on line~11 if |
| \co{P0()}'s local variable \co{r2} is set to the value zero. |
| The if-then rule would then state that the load from \co{x0} on |
| line~21 happens after the store to \co{x0} on line~9. |
| In other words, |
| \co{P1()}'s local variable \co{r2} is guaranteed |
| to end up with the value two \emph{only if} |
| \co{P0()}'s local variable \co{r2} ends up with the value zero. |
| Again, memory ordering guarantees are conditional, not absolute. |
| |
| Although |
| Figure~\ref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| specifically mentions memory barriers, the same rule applies to the |
| rest of the Linux kernel's ordering operations. |
| |
| \paragraph{Ordering operations must be paired.} |
| If you carefully order the operations in one thread, but then fail to do |
| so in another thread, then there is no ordering. |
| Both threads must provide ordering for the if-then rule to apply.\footnote{ |
| In Section~\ref{sec:memorder:Propagation}, pairing will be |
| generalized to cycles.} |
| |
| \paragraph{Ordering operations almost never speed things up.} |
| If you find yourself tempted to add a memory barrier in an attempt |
| to force a prior store to be flushed to memory faster, resist! |
| Adding ordering usually slows things down. |
| Of course, there are situations where adding instructions speeds things |
| up, but careful benchmarking is required in such cases. |
| And even then, it is quite possible that although you sped things up |
| a little bit on \emph{your} system, you might well have slowed things |
| down significantly on your users' systems. |
| Or on your future system. |
| |
| \paragraph{Ordering operations are not magic.} |
| When your program is failing due to some race condition, it is often |
| tempting to toss in a few memory-ordering operations in an attempt |
| to barrier your bugs out of existence. |
| A far better reaction is to use higher-level primitives in a carefully |
| designed manner. |
| With concurrent programming, it is almost always easier to design |
| your bugs out of existence than to hack them out of existence! |
| |
| \paragraph{These are only rough rules of thumb.} |
| Although these rules of thumb cover the vast majority of situations |
| seen in actual practice, as with any set of rules of thumb, they |
| do have their limits. |
| The next section will demonstrate some of these limits by introducing |
| trick-and-trap litmus tests that are intended to insult your |
| intuition while increasing your understanding. |
| These litmus tests will also illuminate many of the concepts |
| represented by the Linux-kernel memory-ordering cheat sheet shown in |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| Section~\ref{sec:memorder:Where is Memory Ordering Needed?} will |
| circle back to this cheat sheet in light of learnings from all the |
| intervening tricks and traps. |
| |
| \section{Tricks and Traps} |
| \label{sec:memorder:Tricks and Traps} |
| |
| Now that you know that hardware can reorder memory accesses and that you |
| can prevent it from doing so, the next step is to get you to admit |
| that your intuition has a problem. |
| This painful task is taken up by |
| Section~\ref{sec:memorder:Variables With Multiple Values}, |
| which presents some code demonstrating that scalar variables can |
| take on multiple values simultaneously, |
| and by |
| Sections~\ref{sec:memorder:Memory-Reference Reordering} through |
| \ref{sec:memorder:Multicopy Atomicity}, |
| which show a series of intuitively correct code fragments that fail miserably |
| on real hardware. |
| Once your intuition has made it through the grieving process, later |
| sections will summarize the basic rules that memory ordering follows. |
| |
| But first, let's take a quick look at just how many values a single |
| variable might have at a single moment of time. |
| |
| \subsection{Variables With Multiple Values} |
| \label{sec:memorder:Variables With Multiple Values} |
| |
| It is natural to think of a variable as taking on a well-defined |
| sequence of values in a well-defined, global order. |
| Unfortunately, the next stop on the journey says ``goodbye'' to this comforting fiction. |
| Hopefully, you already started to say ``goodbye'' in response to row~2 of |
| Tables~\ref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| and~\ref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}, |
| and if so, the purpose of this section is to drive this point home. |
| |
| To this end, consider the program fragment shown in |
| Listing~\ref{lst:memorder:Software Logic Analyzer}. |
| This code fragment is executed in parallel by several CPUs. |
| Line~1 sets a shared variable to the current CPU's ID, line~2 |
| initializes several variables from a \co{gettb()} function that |
| delivers the value of a fine-grained hardware ``timebase'' counter that is |
| synchronized among all CPUs (not available from all CPU architectures, |
| unfortunately!), and the loop from lines~3-8 records the length of |
| time that the variable retains the value that this CPU assigned to it. |
| Of course, one of the CPUs will ``win'', and would thus never exit |
| the loop if not for the check on lines~6-7. |
| |
| \QuickQuiz{} |
| What assumption is the code fragment |
| in Listing~\ref{lst:memorder:Software Logic Analyzer} |
| making that might not be valid on real hardware? |
| \QuickQuizAnswer{ |
| The code assumes that as soon as a given CPU stops |
| seeing its own value, it will immediately see the |
| final agreed-upon value. |
| On real hardware, some of the CPUs might well see several |
| intermediate results before converging on the final value. |
| The actual code used to produce the data in the figures |
| discussed later in this section was therefore somewhat more |
| complex. |
| } \QuickQuizEnd |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 state.variable = mycpu; |
| 2 lasttb = oldtb = firsttb = gettb(); |
| 3 while (state.variable == mycpu) { |
| 4 lasttb = oldtb; |
| 5 oldtb = gettb(); |
| 6 if (lasttb - firsttb > 1000) |
| 7 break; |
| 8 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Software Logic Analyzer} |
| \label{lst:memorder:Software Logic Analyzer} |
| \end{listing} |
| |
| Upon exit from the loop, \co{firsttb} will hold a timestamp |
| taken shortly after the assignment and \co{lasttb} will hold |
| a timestamp taken before the last sampling of the shared variable |
| that still retained the assigned value, or a value equal to \co{firsttb} |
| if the shared variable had changed before entry into the loop. |
| This allows us to plot each CPU's view of the value of \co{state.variable} |
| over a 532-nanosecond time period, as shown in |
| Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}. |
| This data was collected in 2006 on 1.5\,GHz POWER5 system with 8 cores, |
| each containing a pair of hardware threads. |
| CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test. |
| The timebase counter period was about 5.32\,ns, sufficiently fine-grained |
| to allow observations of intermediate cache states. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{memorder/MoreThanOneValue}} |
| \caption{A Variable With Multiple Simultaneous Values} |
| \label{fig:memorder:A Variable With Multiple Simultaneous Values} |
| \end{figure} |
| |
| Each horizontal bar represents the observations of a given CPU over time, |
| with the black regions to the left indicating the time before the |
| corresponding CPU's first measurement. |
| During the first 5\,ns, only CPU~3 has an opinion about the value of the |
| variable. |
| During the next 10\,ns, CPUs~2 and~3 disagree on the value of the variable, |
| but thereafter agree that the value is~``2'', which is in fact |
| the final agreed-upon value. |
| However, CPU~1 believes that the value is~``1'' for almost 300\,ns, and |
| CPU~4 believes that the value is~``4'' for almost 500\,ns. |
| |
| \QuickQuiz{} |
| How could CPUs possibly have different views of the |
| value of a single variable \emph{at the same time?} |
| \QuickQuizAnswer{ |
| As discussed in |
| Section~\ref{sec:memorder:Why Hardware Misordering?}, |
| many CPUs have store buffers that record the values of |
| recent stores, which do not become globally visible until |
| the corresponding cache line makes its way to the CPU. |
| Therefore, it is quite possible for each CPU to see a |
| different value for a given variable at a single point |
| in time---and for main memory to hold yet another value. |
| One of the reasons that memory barriers were invented was |
| to allow software to deal gracefully with situations like |
| this one. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why do CPUs~2 and~3 come to agreement so quickly, when it |
| takes so long for CPUs~1 and~4 to come to the party? |
| \QuickQuizAnswer{ |
| CPUs~2 and~3 are a pair of hardware threads on the same |
| core, sharing the same cache hierarchy, and therefore have |
| very low communications latencies. |
| This is a NUMA, or, more accurately, a NUCA effect. |
| |
| This leads to the question of why CPUs~2 and~3 ever disagree |
| at all. |
| One possible reason is that they each might have a small amount |
| of private cache in addition to a larger shared cache. |
| Another possible reason is instruction reordering, given the |
| short 10-nanosecond duration of the disagreement and the |
| total lack of memory-ordering operations in the code fragment. |
| } \QuickQuizEnd |
| |
| And if you think that the situation with four CPUs was intriguing, consider |
| Figure~\ref{fig:memorder:A Variable With More Simultaneous Values}, |
| which shows the same situation, but with 15~CPUs each assigning their |
| number to a single shared variable at time~$t=0$. Both diagrams in the |
| figure are drawn in the same way as |
| Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}. |
| The only difference is that the unit of horizontal axis is timebase ticks, |
| with each tick lasting about 5.3~nanoseconds. |
| The entire sequence therefore lasts a bit longer than the events recorded in |
| Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}, |
| consistent with the increase in number of CPUs. |
| The upper diagram shows the overall picture, while the lower one shows |
| the zoom-up of first 50~timebase ticks. |
| |
| Again, CPU~0 coordinates the test, so does not record any values. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{5in}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}} |
| \caption{A Variable With More Simultaneous Values} |
| \ContributedBy{Figure}{fig:memorder:A Variable With More Simultaneous Values}{Akira Yokosawa} |
| \end{figure*} |
| |
| All CPUs eventually agree on the final value of~9, but not before |
| the values~15 and~12 take early leads. |
| Note that there are fourteen different opinions on the variable's value |
| at time~21 indicated by the vertical line in the lower diagram. |
| Note also that all CPUs see sequences whose orderings are consistent with |
| the directed graph shown in |
| Figure~\ref{fig:memorder:Possible Global Orders With More Simultaneous Values}. |
| Nevertheless, both figures underscore the importance of |
| proper use of memory-ordering operations, such as memory barriers, |
| for code that cares about memory ordering. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{2.0in}{!}{\includegraphics{memorder/store15tred}} |
| \caption{Possible Global Orders With More Simultaneous Values} |
| \label{fig:memorder:Possible Global Orders With More Simultaneous Values} |
| \end{figure} |
| |
| How many values can a single variable take on at a single point in |
| time? |
| As many as one per store buffer in the system! |
| We have therefore entered a regime where we must bid a fond farewell to |
| comfortable intuitions about values of variables and the passage of time. |
| This is the regime where memory-ordering operations are needed. |
| |
| All that aside, it is important to remember the lessons from |
| Chapters~\ref{chp:Hardware and its Habits} |
| and~\ref{cha:Partitioning and Synchronization Design}. |
| Having all CPUs store concurrently to the same variable |
| is absolutely no way to design a parallel program, at least |
| not if performance and scalability are at all important to you. |
| |
| Unfortunately, memory ordering has many other ways of insulting your |
| intuition, and not all of these ways conflict with performance and |
| scalability. |
| The next section will give an overview of reordering of unrelated |
| memory reference. |
| |
| \subsection{Memory-Reference Reordering} |
| \label{sec:memorder:Memory-Reference Reordering} |
| |
| Section~\ref{sec:memorder:Why Hardware Misordering?} |
| showed that even relatively strongly ordered systems like x86 |
| can reorder prior stores with later loads, at least when the |
| store and load are to different variables. |
| This section builds on that result, looking at the other combinations of |
| loads and stores. |
| |
| % @@@ Rationale for further reordering. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-wmb-o+o-o |
| |
| { |
| } |
| |
| |
| P0(int* x0, int* x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, 2); |
| |
| } |
| |
| P1(int* x0, int* x1) { |
| |
| int r2; |
| int r3; |
| |
| r2 = READ_ONCE(*x1); |
| r3 = READ_ONCE(*x0); |
| |
| } |
| |
| exists (1:r2=2 /\ 1:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Message-Passing Litmus Test (No Ordering)} |
| \label{lst:memorder:Message-Passing Litmus Test (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Load Followed By Load} |
| \label{sec:memorder:Load Followed By Load} |
| |
| Listing~\ref{lst:memorder:Message-Passing Litmus Test (No Ordering)} |
| (\path{C-MP+o-wmb-o+o-o.litmus}) |
| shows the classic \emph{message-passing} litmus test, where \co{x0} is |
| the message and \co{x1} is a flag indicating whether or not a message |
| is available. |
| In this test, the \co{smp_wmb()} forces \co{P0()} stores to be ordered, |
| but no ordering is specified for the loads. |
| Relatively strongly ordered architectures, such as x86, do enforce ordering. |
| However, weakly ordered architectures often do |
| not~\cite{JadeAlglave2011ppcmem}. |
| Therefore, the \co{exists} clause on line~25 of the listing \emph{can} |
| trigger. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-wmb-o+o-rmb-o |
| |
| { |
| } |
| |
| P0(int* x0, int* x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, 2); |
| |
| } |
| |
| P1(int* x0, int* x1) { |
| |
| int r2; |
| int r3; |
| |
| r2 = READ_ONCE(*x1); |
| smp_rmb(); |
| r3 = READ_ONCE(*x0); |
| |
| } |
| |
| exists (1:r2=2 /\ 1:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Enforcing Order of Message-Passing Litmus Test} |
| \label{lst:memorder:Enforcing Order of Message-Passing Litmus Test} |
| \end{listing} |
| |
| Thus, portable code relying on ordering in this case must |
| add explicit ordering, for example, the \co{smp_rmb()} shown on |
| line~20 of |
| Listing~\ref{lst:memorder:Enforcing Order of Message-Passing Litmus Test} |
| (\path{C-MP+o-wmb-o+o-rmb-o.litmus}), which prevents |
| the \co{exists} clause from triggering. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+o-o+o-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| WRITE_ONCE(*x0, 2); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x0); |
| WRITE_ONCE(*x1, 2); |
| } |
| |
| exists (1:r2=2 /\ 0:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Load-Buffering Litmus Test (No Ordering)} |
| \label{lst:memorder:Load-Buffering Litmus Test (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Load Followed By Store} |
| \label{sec:memorder:Load Followed By Store} |
| |
| Listing~\ref{lst:memorder:Load-Buffering Litmus Test (No Ordering)} |
| (\path{C-LB+o-o+o-o.litmus}) |
| shows the classic \emph{load-buffering} litmus test. |
| Although relatively strongly ordered systems such as x86 |
| or the IBM Mainframe do not reorder prior loads with subsequent stores, |
| more weakly ordered architectures really do allow such |
| reordering~\cite{JadeAlglave2011ppcmem}. |
| Therefore, the \co{exists} clause on line~22 really can trigger. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+o-r+a-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| smp_store_release(x0, 2); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x0); |
| WRITE_ONCE(*x1, 2); |
| } |
| |
| exists (1:r2=2 /\ 0:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Enforcing Ordering of Load-Buffering Litmus Test} |
| \label{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| \end{listing} |
| |
| Although it is rare for actual hardware to |
| exhibit this reordering~\cite{LucMaranget2017aarch64}, |
| portable code must nevertheless enforce any required ordering, for example, |
| as shown in |
| Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| (\path{C-LB+o-r+a-o.litmus}). |
| This guarantees that the \co{exists} clause on line~22 never triggers. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-o+o-rmb-o |
| |
| { |
| } |
| |
| P0(int* x0, int* x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| WRITE_ONCE(*x1, 2); |
| |
| } |
| |
| P1(int* x0, int* x1) { |
| |
| int r2; |
| int r3; |
| |
| r2 = READ_ONCE(*x1); |
| smp_rmb(); |
| r3 = READ_ONCE(*x0); |
| |
| } |
| |
| exists (1:r2=2 /\ 1:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Message-Passing Litmus Test, No Writer Ordering (No Ordering)} |
| \label{lst:memorder:Message-Passing Litmus Test, No Writer Ordering (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Store Followed By Store} |
| \label{sec:memorder:Store Followed By Store} |
| |
| Listing~\ref{lst:memorder:Message-Passing Litmus Test, No Writer Ordering (No Ordering)} |
| (\path{C-MP+o-o+o-rmb-o.litmus}) |
| once again shows the classic message-passing litmus test, but without |
| explicit ordering for \co{P0()}'s stores and with the \co{smp_rmb()} |
| providing ordering for \co{P1()}'s loads. |
| Again, the relatively strongly ordered architectures do enforce ordering, |
| but weakly ordered architectures do not necessarily do |
| so~\cite{JadeAlglave2011ppcmem}, which means that the |
| \co{exists} clause can trigger. |
| Therefore, portable code must explicitly order the stores, for |
| example, as shown in |
| Listing~\ref{lst:memorder:Enforcing Order of Message-Passing Litmus Test}, |
| thus preventing the \co{exists} clause from triggering. |
| |
| \subsection{Address Dependencies} |
| \label{sec:memorder:Address Dependencies} |
| |
| An \emph{address dependency} occurs when the value returned by a load |
| instruction is used to compute the address used by a later memory-reference |
| instruction. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-wmb-o+o-addr-o |
| |
| { |
| int y=1; |
| int *x1 = &y; |
| } |
| |
| P0(int* x0, int** x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, x0); |
| |
| } |
| |
| P1(int** x1) { |
| |
| int *r2; |
| int r3; |
| |
| r2 = READ_ONCE(*x1); |
| r3 = READ_ONCE(*r2); |
| |
| } |
| |
| exists (1:r2=x0 /\ 1:r3=1) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Message-Passing Address-Dependency Litmus Test (No Ordering)} |
| \label{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)} |
| (\path{C-MP+o-wmb-o+o-addr-o.litmus}) |
| shows a linked variant of the message-passing pattern. |
| The head pointer is \co{x1}, which initially |
| references the \co{int} variable \co{y} (line~5), which is in turn |
| initialized to the value $1$ (line~4). |
| \co{P0()} updates head pointer \co{x1} to reference \co{x0} (line~12), |
| but only afer initializing it to $2$ (line~10) and forcing ordering |
| (line~11). |
| \co{P1()} picks up the head pointer \co{x1} (line~21), and then loads |
| the referenced value (line~22). |
| There is thus an address dependency from the load on line~21 to the |
| load on line~22. |
| In this case, the value returned by line~21 is exactly the address |
| used by line~22, but many variations are possible, |
| including field access using the C-language \co{->} operator, |
| addition, subtraction, and array indexing. |
| |
| One might hope that line~21's load from the head pointer would be ordered |
| before line~22's dereference. |
| However, this is not the case on DEC Alpha, which can in effect use |
| a speculated value for the dependent load, as described in more detail in |
| Section~\ref{sec:memorder:Alpha}. |
| Therefore, |
| Listing~\ref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering)}'s |
| \co{exists} clause \emph{can} trigger. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-wmb-o+ld-addr-o |
| |
| { |
| int y=1; |
| int *x1 = &y; |
| } |
| |
| P0(int* x0, int** x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, x0); |
| |
| } |
| |
| P1(int** x1) { |
| |
| int *r2; |
| int r3; |
| |
| r2 = lockless_dereference(*x1); |
| r3 = READ_ONCE(*r2); |
| |
| } |
| |
| exists (1:r2=x0 /\ 1:r3=1) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| \label{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| (\path{C-MP+o-wmb-o+ld-addr-o.litmus}) |
| shows how to make this work portably, even on DEC Alpha, by |
| replacing line~21's \co{READ_ONCE()} with \co{lockless_dereference()}, |
| which acts like \co{READ_ONCE()} on all platforms other than DEC Alpha, |
| where it acts like a \co{READ_ONCE()} followed by an \co{smp_mb()}, |
| thereby forcing the required ordering on all platforms, in turn |
| preventing the \co{exists} clause from triggering. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-S+o-wmb-o+o-addr-o |
| |
| { |
| int y=1; |
| int *x1 = &y; |
| } |
| |
| P0(int* x0, int** x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, x0); |
| |
| } |
| |
| P1(int** x1) { |
| |
| int *r2; |
| |
| r2 = READ_ONCE(*x1); |
| WRITE_ONCE(*r2, 3); |
| |
| } |
| |
| exists (1:r2=x0 /\ x0=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{S Address-Dependency Litmus Test} |
| \label{lst:memorder:S Address-Dependency Litmus Test} |
| \end{listing} |
| |
| But what happens if the dependent operation is a store rather than |
| a load, for example, in the \emph{S} |
| litmus test~\cite{JadeAlglave2011ppcmem} shown in |
| Listing~\ref{lst:memorder:S Address-Dependency Litmus Test} |
| (\path{C-S+o-wmb-o+o-addr-o.litmus})? |
| Because no production-quality platform speculates stores, |
| it is not possible for the \co{WRITE_ONCE()} on line~10 to overwrite |
| the \co{WRITE_ONCE()} on line~21, meaning that the \co{exists} |
| clause on line~25 cannot trigger, even on DEC Alpha, even |
| without the \co{lockless_dereference()} that is required in the |
| dependent-load case. |
| |
| \QuickQuiz{} |
| But how do we know that \emph{all} platforms really avoid |
| triggering the \co{exists} clauses in |
| Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| and~\ref{lst:memorder:S Address-Dependency Litmus Test}? |
| \QuickQuizAnswer{ |
| Answering this requires identifying three major groups of platforms: |
| (1)~Total-store-order (TSO) platforms, |
| (2)~Weakly ordered platorms, and |
| (3)~DEC Alpha. |
| |
| The TSO platforms order all pairs of memory references except for |
| prior stores against later loads. |
| Because the address dependency on lines~21 and~22 of |
| Listing~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| is instead a load followed by another load, TSO platforms preserve |
| this address dependency. |
| They also preserve the address dependency on lines~20 and~21 of |
| Listing~\ref{lst:memorder:S Address-Dependency Litmus Test} |
| because this is a load followed by a store. |
| Because address dependencies must start with a load, TSO platforms |
| implicitly but completely respect them. |
| |
| Weakly ordered platforms don't necessarily maintain ordering of |
| unrelated accesses. |
| However, the address dependencies in |
| Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test} |
| and~\ref{lst:memorder:S Address-Dependency Litmus Test} |
| are not unrelated: There is an address dependency. |
| The hardware tracks dependencies and maintains the needed |
| ordering. |
| |
| There is one (famous) exception to this rule for weakly ordered |
| platforms, and that exception is DEC Alpha for load-to-load |
| address dependencies. |
| And this is why DEC Alpha requires the explicit memory barrier |
| supplied for it by the \co{lockless_dereference()} on line~21 of |
| Listings~\ref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test}. |
| However, DEC Alpha does track load-to-store address dependencies, |
| which is why line~20 of |
| Listings~\ref{lst:memorder:S Address-Dependency Litmus Test} |
| does not have a \co{lockless_dereference()}. |
| |
| To sum up, current platforms either respect address dependencies |
| implicitly, as is the case for TSO platforms (x86, mainframe, |
| SPARC,~...), have hardware tracking for address dependencies |
| (ARM, PowerPC, MIPS,~...), or have the required memory barriers |
| supplied by \co{lockless_dereference()} (DEC Alpha). |
| } \QuickQuizEnd |
| |
| However, it is important to note that address dependencies can |
| be fragile and easily broken by compiler optimizations, as discussed in |
| Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}. |
| |
| \subsection{Data Dependencies} |
| \label{sec:memorder:Data Reordering} |
| |
| A \emph{data dependency} occurs when the value returned by a load |
| instruction is used to compute the data stored by a later store |
| instruction. |
| Note well the ``data'' above: If the value returned by a load |
| was instead used to compute the address used by a later store |
| instruction, that would instead be an address dependency. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+o-r+o-data-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| smp_store_release(x0, 2); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x0); |
| WRITE_ONCE(*x1, r2); |
| } |
| |
| exists (1:r2=2 /\ 0:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Load-Buffering Data-Dependency Litmus Test} |
| \label{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| (\path{C-LB+o-r+o-data-o.litmus}) |
| is similar to |
| Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}, |
| except that \co{P1()}'s ordering between lines~18 and~19 is |
| enforced not by an acquire load, but instead by a data dependency: |
| The value loaded by line~18 is what line~19 stores. |
| The ordering provided by this data dependency is sufficient to prevent |
| the \co{exists} clause from triggering. |
| |
| \QuickQuiz{} |
| Why doesn't line~18 of |
| Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| need a \co{lockless_dereference()}? |
| \QuickQuizAnswer{ |
| Data dependencies are always load-to-store dependencies, and |
| so all platforms respect them, even DEC Alpha, and for the |
| same reasons that they respect load-to-store address dependencies. |
| } \QuickQuizEnd |
| |
| Just as with address dependencies, data dependencies are |
| fragile and can be easily broken by compiler optimizations, as discussed in |
| Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions}. |
| In fact, data dependencies can be even more fragile than are address |
| dependencies. |
| The reason for this is that address dependencies normally involve |
| pointer values. |
| In contrast, as shown in |
| Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}, |
| it is tempting to carry data dependencies through integral values, |
| which the compiler has much more freedom to optimize into nonexistence. |
| For but one example, if the integer loaded was multiplied by the constant |
| zero, the compiler would know that the result was zero, and could therefore |
| substitute the constant zero for the value loaded, thus breaking |
| the dependency. |
| |
| \QuickQuiz{} |
| But wait!!! |
| Line~18 of |
| Listing~\ref{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| uses \co{READ_ONCE()}, which marks the load as volatile, |
| which means that the compiler absolutely must emit the load |
| instruction even if the value is later multiplied by zero. |
| So do you really need to work so hard to keep the compiler from |
| breaking your data dependencies? |
| \QuickQuizAnswer{ |
| Yes, the compiler absolutely must emit a load instruction for |
| a volatile load. |
| But if you multiply the value loaded by zero, the compiler is |
| well within its rights to substitute a constant zero for the |
| result of that multiplication, which will break the data |
| dependency on many platforms. |
| |
| Worse yet, if the dependent store does not use \co{WRITE_ONCE()}, |
| the compiler could hoist it above the load, which would cause |
| even TSO platforms to fail to provide ordering. |
| } \QuickQuizEnd |
| |
| In short, you can rely on data dependencies, but only if you take care |
| to prevent your compiler from breaking them. |
| |
| \subsection{Control Dependencies} |
| \label{sec:memorder:Control Dependencies} |
| |
| A \emph{control dependency} occurs when the value returned by a load |
| instruction is tested to determine whether or not a later store instruction |
| is executed. |
| Note well the ``later store instruction'': Many platforms do not respect |
| load-to-load control dependencies. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+o-r+o-ctrl-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| smp_store_release(x0, 2); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x0); |
| if (r2 >= 0) |
| WRITE_ONCE(*x1, 2); |
| } |
| |
| exists (1:r2=2 /\ 0:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Load-Buffering Control-Dependency Litmus Test} |
| \label{lst:memorder:Load-Buffering Control-Dependency Litmus Test} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Load-Buffering Control-Dependency Litmus Test} |
| (\path{C-LB+o-r+o-ctrl-o.litmus}) |
| shows another load-buffering example, this time using a control |
| dependency (line~19) to order the load on line~18 and the store on |
| line~20. |
| The ordering is sufficient to prevent the \co{exists} from triggering. |
| |
| However, control dependencies are even more susceptible to being optimized |
| out of existence than are data dependencies, and |
| Section~\ref{sec:memorder:Control-Dependency Restrictions} |
| describes some of the rules that must be followed in order to prevent |
| your compiler from breaking your control dependencies. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP+o-r+o-ctrl-o |
| |
| { |
| } |
| |
| P0(int* x0, int* x1) { |
| |
| WRITE_ONCE(*x0, 2); |
| smp_store_release(x1, 2); |
| |
| } |
| |
| P1(int* x0, int* x1) { |
| int r2; |
| int r3 = 0; |
| |
| r2 = READ_ONCE(*x1); |
| if (r2 >= 0) |
| r3 = READ_ONCE(*x0); |
| |
| } |
| |
| exists (1:r2=2 /\ 1:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| \label{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| \end{listing} |
| |
| It is worth reiterating that control dependencies provide ordering only |
| from loads to stores. |
| Therefore, the load-to-load control dependency shown on lines~17-19 of |
| Listing~\ref{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| (\path{C-MP+o-r+o-ctrl-o.litmus}) |
| does \emph{not} provide ordering, and therefore does \emph{not} |
| prevent the \co{exists} clause from triggering. |
| |
| In summary, control dependencies can be useful, but they are |
| high-maintenance items. |
| You should therefore use them only when performance considerations |
| permit no other solution. |
| |
| \QuickQuiz{} |
| Wouldn't control dependencies be more robust if they were |
| mandated by language standards??? |
| \QuickQuizAnswer{ |
| In the fullness of time, perhaps they will be so mandated. |
| } \QuickQuizEnd |
| |
| \subsection{Cache Coherence} |
| \label{sec:memorder:Cache Coherence} |
| |
| On cache-coherent platforms, all CPUs agree on the order of loads and |
| stores to a given variable. |
| Fortunately, when \co{READ_ONCE()} and \co{WRITE_ONCE()} are used, |
| almost all platforms are cache-coherent, as indicated by the ``SV'' |
| column of the cheat sheet shown in |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| Unfortunately, this property is so popular that it has been named |
| multiple times, with ``single-variable SC'',\footnote{ |
| Recall that SC stands for sequentially consistent.} |
| ``single-copy atomic''~\cite{Stone:1995:SP:623262.623912}, |
| and just plain ``coherence''~\cite{JadeAlglave2011ppcmem} |
| having seen use. |
| Rather than further compound the confusion by inventing yet another term |
| for this concept, this book uses ``cache coherence'' and ``coherence'' |
| interchangeably. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-CCIRIW+o+o+o-o+o-o |
| |
| { |
| int x = 0; |
| } |
| |
| P0(int *x) |
| { |
| WRITE_ONCE(*x, 1); |
| } |
| |
| P1(int *x) |
| { |
| WRITE_ONCE(*x, 2); |
| } |
| |
| P2(int *x) |
| { |
| int r1; |
| int r2; |
| |
| r1 = READ_ONCE(*x); |
| r2 = READ_ONCE(*x); |
| } |
| |
| P3(int *x) |
| { |
| int r3; |
| int r4; |
| |
| r3 = READ_ONCE(*x); |
| r4 = READ_ONCE(*x); |
| } |
| |
| exists(2:r1=1 /\ 2:r2=2 /\ 3:r3=2 /\ 3:r4=1) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Cache-Coherent IRIW Litmus Test} |
| \label{lst:memorder:Cache-Coherent IRIW Litmus Test} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Cache-Coherent IRIW Litmus Test} |
| (\path{C-CCIRIW+o+o+o-o+o-o.litmus}) |
| shows a litmus test that tests for cache coherence, |
| where ``IRIW'' stands |
| for ``independent reads of independent writes''. |
| Because this litmus test uses only one variable, |
| \co{P2()} and \co{P3()} must agree |
| on the order of \co{P0()}'s and \co{P1()}'s stores. |
| In other words, if \co{P2()} believes that \co{P0()}'s store |
| came first, then \co{P3()} had better not believe that |
| \co{P1()}'s store came first. |
| And in fact the \co{exists} clause on line~35 will trigger if this |
| situation arises. |
| |
| It is tempting to speculate that different-sized overlapping loads |
| and stores to a single region of memory (as might be set up using |
| the C-language \co{union} keyword) would provide similar ordering |
| guarantees. |
| However, Flur et al.~discovered some surprisingly simple |
| litmus tests that demonstrate that these guarantees can be violated on |
| real hardware~\cite{Flur:2017:MCA:3093333.3009839}. |
| It is therefore necessary to restrict code to non-overlapping |
| same-sized aligned accesses to a given variable, at least if portability |
| is a consideration.\footnote{ |
| There is reason to believe that using atomic RMW operations |
| (for example, \co{xchg()}) for all the stores will |
| provide sequentially consistent ordering, but this has not |
| yet been proven either way.} |
| |
| Adding more variables and threads increases the scope for reordering |
| and other counterintuitive behavior, as discussed in the next section. |
| |
| \subsection{Multicopy Atomicity} |
| \label{sec:memorder:Multicopy Atomicity} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{memorder/SystemArchBus}} |
| \caption{Global System Bus And Multi-Copy Atomicity} |
| \label{fig:memorder:Global System Bus And Multi-Copy Atomicity} |
| \end{figure} |
| |
| Threads running on a \emph{multicopy atomic}~\cite{Stone:1995:SP:623262.623912} |
| platform are guaranteed |
| to agree on the order of stores, even to different variables. |
| A useful mental model of such a system is the single-bus architecture |
| shown in |
| Figure~\ref{fig:memorder:Global System Bus And Multi-Copy Atomicity}. |
| If each store resulted in a message on the bus, and if the bus could |
| accommodate only one store at a time, then any pair of CPUs would |
| agree on the order of all stores that they observed. |
| Unfortunately, building a computer system as shown in the figure, |
| without store buffers or even caches, would result in glacial computation. |
| CPU vendors interested in providing multicopy atomicity have therefore |
| instead provided the slightly weaker |
| \emph{other-multicopy atomicity}~\cite[Section B2.3]{ARMv8A:2017}, |
| which excludes the CPU doing a given store from the requirement that all |
| CPUs agree on the order of all stores. |
| This means that if only a subset of CPUs are doing stores, the |
| other CPUs will agree on the order of stores, hence the ``other'' |
| in ``other-multicopy atomicity''. |
| Unlike multicopy-atomic platforms, within other-multicopy-atomic platforms, |
| the CPU doing the store is permitted to observe its |
| store early, which allows its later loads to obtain the newly stored |
| value directly from the store buffer. |
| This in turn avoids abysmal performance. |
| |
| \QuickQuiz{} |
| Can you give a specific example showing different behavior for |
| multicopy atomic on the one hand and other-multicopy atomic |
| on the other? |
| \QuickQuizAnswer{ |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-MP-OMCA+o-o-o+o-rmb-o |
| |
| { |
| } |
| |
| P0(int *x, int *y) |
| { |
| int r0; |
| |
| WRITE_ONCE(*x, 1); |
| r0 = READ_ONCE(*x); |
| WRITE_ONCE(*y, r0); |
| } |
| |
| P1(int *x, int *y) |
| { |
| int r1; |
| int r2; |
| |
| r1 = READ_ONCE(*y); |
| smp_rmb(); |
| r2 = READ_ONCE(*x); |
| } |
| |
| exists (1:r1=1 /\ 1:r2=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| \label{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| (\path{C-MP-OMCA+o-o-o+o-rmb-o.litmus}) |
| shows such a test. |
| |
| On a multicopy-atomic platform, \co{P0()}'s store to \co{x} on |
| line~10 must become visible to both \co{P0()} and \co{P1()} |
| simultaneously. |
| Because this store becomes visible to \co{P0()} on line~11, before |
| \co{P0()}'s store to \co{y} on line~12, \co{P0()}'s store to |
| \co{x} must become visible before its store to \co{y} everywhere, |
| including \co{P1()}. |
| Therefore, if \co{P1()}'s load from \co{y} on line~20 returns the |
| value 1, so must its load from \co{x} on line~22, given that |
| the \co{smp_rmb()} on line~21 forces these two loads to execute |
| in order. |
| Therefore, the \co{exists} clause on line~25 cannot trigger on a |
| multicopy-atomic platform. |
| |
| In contrast, on an other-multicopy-atomic platform, \co{P0()} |
| could see its own store early, so that there would be no constraint |
| on the order of visibility of the two stores from to \co{P1()}, |
| which in turn allows the \co{exists} clause to trigger. |
| } \QuickQuizEnd |
| |
| Perhaps there will come a day when all platforms provide some flavor |
| of multi-copy atomicity, but |
| in the meantime, non-multicopy-atomic platforms do exist, and so software |
| does need to deal with them. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-WRC-o+o-data-o+o-rmb-o |
| |
| { |
| } |
| |
| P0(int *x) |
| { |
| WRITE_ONCE(*x, 1); |
| } |
| |
| P1(int *x, int* y) |
| { |
| int r1; |
| |
| r1 = READ_ONCE(*x); |
| WRITE_ONCE(*y, r1); |
| } |
| |
| P2(int *x, int* y) |
| { |
| int r2; |
| int r3; |
| |
| r2 = READ_ONCE(*y); |
| smp_rmb(); |
| r3 = READ_ONCE(*x); |
| } |
| |
| exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{WRC Litmus Test With Dependencies (No Ordering)} |
| \label{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| (\path{C-WWC+o+o-data-o+o-addr-o.litmus}) |
| demonstrates multicopy atomicity, that is, on a multicopy-atomic platform, |
| the \co{exists} clause on line~29 cannot trigger. |
| In contrast, on a non-multicopy-atomic |
| platform this \co{exists} clause can trigger, despite |
| \co{P1()}'s accesses being ordered by a data dependency and \co{P2()}'s |
| accesses being ordered by an \co{smp_rmb()}. |
| Recall that the definition of multicopy atomicity requires that all |
| threads agree on the order of stores, which can be thought of as |
| all stores reaching all threads at the same time. |
| Therefore, a non-multicopy-atomic platform can have a store reach |
| different threads at different times. |
| In particular, \co{P0()}'s store might reach \co{P1()} long before it |
| reaches \co{P2()}, which raises the possibility that \co{P1()}'s store |
| might reach \co{P2()} before \co{P0()}'s store does. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{memorder/NonMCAplatform}} |
| \caption{Shared Store Buffers And Multi-Copy Atomicity} |
| \label{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity} |
| \end{figure} |
| |
| This leads to the question of why a real system constrained by the |
| usual laws of physics would ever trigger the \co{exists} clause of |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}. |
| The cartoonish diagram of a such a real system is shown in |
| Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}. |
| CPU~0 and CPU~1 share a store buffer, as do CPUs~2 and~3. |
| This means that CPU~1 can load a value out of the store buffer, thus |
| potentially immediately seeing a value stored by CPU~0. |
| In contrast, CPUs~2 and~3 will have to wait for the corresponding cache |
| line to carry this new value to them. |
| |
| \QuickQuiz{} |
| Then who would even \emph{think} of designing a system with shared |
| store buffers??? |
| \QuickQuizAnswer{ |
| This is in fact a very natural design for any system having |
| multiple hardware threads per core. |
| Natural from a hardware point of view, that is! |
| } \QuickQuizEnd |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.8in} |
| \begin{tabular}{r||l|l|l|l||l|l|l} |
| & \tco{P0()} & \multicolumn{2}{c|}{\tco{P0()} \& \tco{P1()}} & \tco{P1()} & \multicolumn{3}{c}{\tco{P2()}} \\ |
| \cline{2-8} |
| & Instruction & Store Buffer & Cache & Instruction & |
| Instruction & Store Buffer & Cache \\ |
| \hline |
| \hline |
| 1 & (Initial state) & & \tco{y==0} & |
| (Initial state) & |
| (Initial state) & & \tco{x==0} \\ |
| \hline |
| 2 & \tco{x = 1;} & \tco{x==1} & \tco{y==0} & |
| & & & \tco{x==0} \\ |
| \hline |
| 3 & (Read-Invalidate \tco{x}) & \tco{x==1} & \tco{y==0} & \tco{r1 = x} (1) |
| & & & \tco{x==0} \\ |
| \hline |
| 4 & & \tco{x==1} \tco{y==1} & \tco{y==0} & \tco{y = r1} |
| & \tco{r2 = y} & & \tco{x==0} \\ |
| \hline |
| 5 & & \tco{x==1} & \tco{y==1} & (Finish store) |
| & (Read \tco{y}) & & \tco{x==0} \\ |
| \hline |
| 6 & (Respond \tco{y}) & \tco{x==1} & \tco{y==1} & |
| & (\tco{r2==1}) & & \tco{x==0} \tco{y==1} \\ |
| \hline |
| 7 & & \tco{x==1} & \tco{y==1} & |
| & \tco{smp_rmb()} & & \tco{x==0} \tco{y==1} \\ |
| \hline |
| 8 & & \tco{x==1} & \tco{y==1} & |
| & \tco{r3 = x (0)} & & \tco{x==0} \tco{y==1} \\ |
| \hline |
| 9 & & \tco{x==1} & \tco{x==0} \tco{y==1} & |
| & (Respond \tco{x}) & & \tco{y==1} \\ |
| \hline |
| 10 & (Finish store) & & \tco{x==1} \tco{y==1} & |
| & & & \tco{y==1} \\ |
| \end{tabular} |
| \caption{Memory Ordering: WRC Sequence of Events} |
| \label{tab:memorder:Memory Ordering: WRC Sequence of Events} |
| \end{table*} |
| |
| Table~\ref{tab:memorder:Memory Ordering: WRC Sequence of Events} |
| shows one sequence of events that can result in the \co{exists} clause in |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| triggering. |
| This sequence of events will depend critically on \co{P0()} and |
| \co{P1()} sharing both cache and a store buffer in the manner shown in |
| Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}. |
| |
| \QuickQuiz{} |
| But just how is it fair that P0 and P1 must share a store buffer |
| and a cache, but P2 gets one each of its very own??? |
| \QuickQuizAnswer{ |
| Presumably there is a P3, as is in fact shown in |
| Figure~\ref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}, |
| that shares P2's store buffer and cache. |
| But not necessarily. |
| Some platforms allow different cores to disable different numbers |
| of threads, allowing the hardware to adjust to the needs of the |
| workload at hand. |
| For example, a single-threaded critical-path portion of the workload |
| might be assigned to a core with only one thread enabled, thus |
| allowing the single thread running that portion of the workload |
| to use the entire capabilities of that core. |
| Other more highly parallel but cache-miss-prone portions of the |
| workload might be assigned to cores with all hardware threads |
| enabled to provide improved throughput. |
| This improved throughput could be due to the fact that while one |
| hardware thread is stalled on a cache miss, the other hardware |
| threads can make forward progress. |
| |
| In such cases, performance requirements override quaint human |
| notions of fairness. |
| } \QuickQuizEnd |
| |
| Row~1 shows the initial state, with the initial value of \co{y} in |
| P0's and P1's shared cache, and the initial value of \co{x} in |
| P2's cache. |
| |
| Row~2 shows the immediate effect of \co{P0()} executing its store on line~8. |
| Because the cacheline containing \co{x} is not in \co{P0()}'s and \co{P1()}'s |
| shared cache, the new value (\co{1}) is stored in the shared store buffer. |
| |
| Row~3 shows two transitions. |
| First, \co{P0()} issues a read-invalidate operation to fetch the cacheline |
| containing \co{x} so that it can flush the new value for \co{x} out of |
| the shared store buffer. |
| Second, \co{P1()} loads from \co{x} (line~15), an operation that completes |
| immediately because the new value of \co{x} is immediately available |
| from the shared store buffer. |
| |
| Row~4 also shows two transitions. |
| First, it shows the immediate effect of \co{P1()} executing its store to |
| \co{y} (line~16), placing the new value into the shared store buffer. |
| Second, it shows the start of \co{P2()}'s load from \co{y} (line~24). |
| |
| Row~5 continues the tradition of showing two transitions. |
| First, it shows \co{P1()} complete its store to \co{y}, flushing |
| from the shared store buffer to the cache. |
| Second, it shows \co{P2()} request the cacheline containing \co{y}. |
| |
| Row~6 shows \co{P2()} receive the cacheline containing \co{y}, allowing |
| it to finish its load into \co{r2}, which takes on the value \co{1}. |
| |
| Row~7 shows \co{P2()} execute its \co{smp_rmb()} (line~25), thus keeping |
| its two loads ordered. |
| |
| Row~8 shows \co{P2()} execute its load from \co{x}, which immediately |
| returns with the value zero from \co{P2()}'s cache. |
| |
| Row~9 shows \co{P2()} \emph{finally} responding to \co{P0()}'s request for |
| the cacheline containing \co{x}, which was made way back up on row~3. |
| |
| Finally, row~10 shows \co{P0()} finish its store, flushing its value of |
| \co{x} from the shared store buffer to the shared cache. |
| |
| Note well that the \co{exists} clause on line~29 has triggered. |
| The values of \co{r1} and \co{r2} are both the value one, and |
| the final value of \co{r3} the value zero. |
| This strange result occurred because \co{P0()}'s new value of \co{x} was |
| communicated to \co{P1()} long before it was communicated to \co{P2()}. |
| |
| \QuickQuiz{} |
| Referring to |
| Table~\ref{tab:memorder:Memory Ordering: WRC Sequence of Events}, |
| why on earth would \co{P0()}'s store take so long to complete when |
| \co{P1()}'s store complete so quickly? |
| In other words, does the \co{exists} clause on line~32 of |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| really trigger on real systems? |
| \QuickQuizAnswer{ |
| You need to face the fact that it really can trigger. |
| Akira Yokosawa used the \co{litmus7} tool to run this litmus test |
| on a Power8 system. |
| Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause. |
| Thus, triggering the \co{exists} clause is not merely a one-in-a-million |
| occurrence, but rather a one-in-a-hundred-million occurrence. |
| But it nevertheless really does trigger on real systems. |
| } \QuickQuizEnd |
| |
| This counter-intuitive result happens because although dependencies |
| do provide ordering, they provide it only within the confines of their |
| own thread. |
| This three-thread example requires stronger ordering, which |
| is the subject of |
| Sections~\ref{sec:memorder:Cumulativity} |
| through~\ref{sec:memorder:Release-Acquire Chains}. |
| |
| \subsubsection{Cumulativity} |
| \label{sec:memorder:Cumulativity} |
| |
| The three-thread example shown in |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| requires \emph{cumulative} ordering, or \emph{cumulativity}. |
| A cumulative memory-ordering operation orders not just any given |
| access preceding it, but also earlier accesses by any thread to that |
| same variable. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-WRC+o+o-r+a-o.litmus |
| |
| { |
| } |
| |
| P0(int *x) |
| { |
| WRITE_ONCE(*x, 1); |
| } |
| |
| P1(int *x, int* y) |
| { |
| int r1; |
| |
| r1 = READ_ONCE(*x); |
| smp_store_release(y, r1); |
| } |
| |
| P2(int *x, int* y) |
| { |
| int r2; |
| int r3; |
| |
| r2 = smp_load_acquire(y); |
| r3 = READ_ONCE(*x); |
| } |
| |
| exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{WRC Litmus Test With Release} |
| \label{lst:memorder:WRC Litmus Test With Release} |
| \end{listing} |
| |
| Dependencies do not provide cumulativity, |
| which is why the ``C'' column is blank for |
| both the \co{READ_ONCE()} and the \co{smp_read_barrier_depends()} rows |
| of Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| However, as indicated by the ``C'' in their ``C'' column, |
| release operations do provide cumulativity. |
| Therefore, |
| Listing~\ref{lst:memorder:WRC Litmus Test With Release} |
| (\path{C-WWC+o+o-r+o-addr-o.litmus}) |
| substitutes a release operation for |
| Listing~\ref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}'s |
| data dependency. |
| Because the release operation is cumulative, its ordering applies not only to |
| Listing~\ref{lst:memorder:WRC Litmus Test With Release}'s |
| load from \co{x} by \co{P1()} on line~15, but also to the store to \co{x} |
| by \co{P0()} on line~8---but only if that load returns the value stored, |
| which matches the \co{1:r1=1} in the \co{exists} clause on line~28. |
| This means that \co{P2()}'s load-acquire suffices to force the |
| load from \co{x} on line~25 to happen after the store on line~8, so |
| the value returned is one, which does not match \co{2:r3=0}, which |
| in turn prevents the \co{exists} clause from triggering. |
| |
| \begin{figure*}[htbp] |
| \centering |
| \includegraphics{memorder/memorybarriercum} |
| \caption{Cumulativity} |
| \label{fig:memorder:Cumulativity} |
| \end{figure*} |
| |
| These ordering constraints are depicted graphically in |
| Figure~\ref{fig:memorder:Cumulativity}. |
| Note also that cumulativity is not limited to a single step back in time. |
| If there was another load from \co{x} or store to \co{x} from any thread |
| that came before the store on line~13, that prior load or store would also |
| be ordered before the store on line~32, though only if both \co{r1} and |
| \co{r2} both end up containing the address of \co{x}. |
| |
| In short, use of cumulative ordering operations can suppress |
| non-multicopy-atomic behaviors in some situations. |
| Cumulativity nevertheless has limits, which are examined in the next section. |
| |
| \subsubsection{Propagation} |
| \label{sec:memorder:Propagation} |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-W+RWC+o-r+a-o+o-mb-o |
| |
| { |
| int x = 0; |
| int y = 0; |
| int z = 0; |
| } |
| |
| P0(int *x, int *y) |
| { |
| WRITE_ONCE(*x, 1); |
| smp_store_release(y, 1); |
| } |
| |
| P1(int *y, int *z) |
| { |
| int r1; |
| int r2; |
| |
| r1 = smp_load_acquire(y); |
| r2 = READ_ONCE(*z); |
| } |
| |
| P2(int *z, int *x) |
| { |
| int r3; |
| |
| WRITE_ONCE(*z, 1); |
| smp_mb(); |
| r3 = READ_ONCE(*x); |
| } |
| |
| exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{W+RWC Litmus Test With Release (No Ordering)} |
| \label{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| \end{listing} |
| |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| (\path{C-W+RWC+o-r+a-o+o-mb-o.litmus}) |
| shows the limitations of cumulativity and of store-release, |
| even with a full memory barrier helping out. |
| The problem is that although the \co{smp_store_release()} on |
| line~12 has cumulativity, and although that cumulativity does |
| order \co{P2()}'s load on line~30, the \co{smp_store_release()}'s |
| ordering cannot propagate through the combination of \co{P1()}'s |
| load (line~21) and \co{P2()}'s store (line~28). |
| This means that the \co{exists} clause on line~33 really can trigger. |
| |
| \QuickQuiz{} |
| But it is not necessary to worry about propagation unless |
| there are at least three threads in the litmus test, right? |
| \QuickQuizAnswer{ |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-R+o-wmb-o+o-mb-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| WRITE_ONCE(*x0, 1); |
| smp_wmb(); |
| WRITE_ONCE(*x1, 1); |
| } |
| |
| |
| P1(int *x0, int *x1) |
| { |
| int r2; |
| |
| WRITE_ONCE(*x1, 2); |
| smp_mb(); |
| r2 = READ_ONCE(*x0); |
| } |
| |
| exists (1:r2=0 /\ x1=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{R Litmus Test With Write Memory Barrier (No Ordering)} |
| \label{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| \end{listing} |
| Wrong. |
| |
| Listing~\ref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| (\path{C-R+o-wmb-o+o-mb-o.litmus}) |
| shows a two-thread litmus test that requires propagation due to |
| the fact that it only has store-to-store and load-to-store |
| links between its pair of threads. |
| Even though \co{P0()} is fully ordered by the \co{smp_wmb()} and |
| \co{P1()} is fully ordered by the \co{smp_mb()}, the |
| counter-temporal nature of the links means that |
| the \co{exists} clause on line~22 really can trigger. |
| To prevent this triggering, the \co{smp_wmb()} on line~8 |
| must become an \co{smp_mb()}, bringing propagation into play |
| twice, once for each non-temporal link. |
| } \QuickQuizEnd |
| |
| \begin{figure}[htbp] |
| \centering |
| \resizebox{\columnwidth}{!}{\includegraphics{memorder/fr}} |
| \caption{Load-to-Store is Counter-Temporal} |
| \label{fig:memorder:Load-to-Store is Counter-Temporal} |
| \end{figure} |
| |
| This situation might seem completely counter-intuitive, but keep |
| in mind that the speed of light is finite and computers are of |
| non-zero size. |
| It therefore takes time for the effect of the \co{P2()}'s store to |
| \co{z} to propagate to \co{P1()}, which in turn means that it is possible |
| that \co{P1()}'s read from \co{z} happens much later in time, but |
| nevertheless still sees the old value of zero. |
| This situation is depicted in |
| Figure~\ref{fig:memorder:Load-to-Store is Counter-Temporal}: |
| Just because a load sees the old value does \emph{not} mean that |
| this load executed at an earlier time than did the store of the |
| new value. |
| |
| Note that |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| also shows the limitations of memory-barrier pairing, given that |
| there are not two but three processes. |
| These more complex litmus tests can instead be said to have \emph{cycles}, |
| where memory-barrier pairing is the special case of a two-thread cycle. |
| The cycle in |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| goes through \co{P0()} (lines~11 and~12), \co{P1()} (lines~20 and~21), |
| \co{P2()} (lines~28, 29, and~30), and back to \co{P0()} (line~11). |
| The \co{exists} clause delineates this cycle: |
| the \co{1:r1=1} indicates that the \co{smp_load_acquire()} on line~20 |
| returned the value stored by the \co{smp_store_release()} on line~12, |
| the \co{1:r2=0} indicates that the \co{WRITE_ONCE()} on line~28 came |
| too late to affect the value returned by the \co{READ_ONCE()} on line~21, |
| and finally the \co{2:r3=0} indicates that the |
| \co{WRITE_ONCE()} on line~11 came too late to affect the value returned |
| by the \co{READ_ONCE()} on line~30. |
| In this case, the fact that the \co{exists} clause can trigger means that |
| the cycle is said to be \emph{allowed}. |
| In contrast, in cases where the \co{exists} clause cannot trigger, |
| the cycle is said to be \emph{prohibited}. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-W+RWC+o-mb-o+a-o+o-mb-o |
| |
| { |
| int x = 0; |
| int y = 0; |
| int z = 0; |
| } |
| |
| P0(int *x, int *y) |
| { |
| WRITE_ONCE(*x, 1); |
| smp_mb(); |
| WRITE_ONCE(*y, 1); |
| } |
| |
| P1(int *y, int *z) |
| { |
| int r1; |
| int r2; |
| |
| r1 = smp_load_acquire(y); |
| r2 = READ_ONCE(*z); |
| } |
| |
| P2(int *z, int *x) |
| { |
| int r3; |
| |
| WRITE_ONCE(*z, 1); |
| smp_mb(); |
| r3 = READ_ONCE(*x); |
| } |
| |
| exists(1:r1=1 /\ 1:r2=0 /\ 2:r3=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{W+WRC Litmus Test With More Barriers} |
| \label{lst:memorder:W+WRC Litmus Test With More Barriers} |
| \end{listing} |
| |
| But what if we need to keep the \co{exists} clause on line~33 of |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}? |
| One solution is to replace \co{P0()}'s \co{smp_store_release()} |
| with an \co{smp_mb()}, which |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| shows to have not only cumulativity, but also propagation. |
| The result is shown in |
| Listing~\ref{lst:memorder:W+WRC Litmus Test With More Barriers} |
| (\path{C-W+RWC+o-mb-o+a-o+o-mb-o.litmus}). |
| |
| \QuickQuiz{} |
| But given that \co{smp_mb()} has the propagation property, |
| why doesn't the \co{smp_mb()} on line~29 of |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| prevent the \co{exists} clause from triggering? |
| \QuickQuizAnswer{ |
| As a rough rule of thumb, the \co{smp_mb()} barrier's |
| propagation property is sufficient to maintain ordering |
| through only one store-to-load link between |
| processes. |
| Unfortunately, |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| has not one but two store-to-load links, with the |
| first being from the \co{READ_ONCE()} on line~21 to the |
| \co{WRITE_ONCE()} on line~28 and the second being from |
| the \co{READ_ONCE()} on line~30 to the \co{WRITE_ONCE()} |
| on line~11. |
| Therefore, preventing the \co{exists} clause from triggering |
| should be expected to require not one but two |
| instances of \co{smp_mb()}. |
| |
| As a special exception to this rule of thumb, a release-acquire |
| chain can have one load-to-store link between processes |
| and still prohibit the cycle. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| \centering |
| \resizebox{\columnwidth}{!}{\includegraphics{memorder/co}} |
| \caption{Store-to-Store is Counter-Temporal} |
| \label{fig:memorder:Store-to-Store is Counter-Temporal} |
| \end{figure} |
| |
| For completeness, |
| Figure~\ref{fig:memorder:Store-to-Store is Counter-Temporal} |
| shows that the ``winning'' store among a group of stores to the |
| same variable is not necessarily the store that started last. |
| This should not come as a surprise to anyone who carefully examined |
| Figure~\ref{fig:memorder:A Variable With More Simultaneous Values}. |
| |
| But sometimes time is on our side, as shown in the next section. |
| |
| \subsubsection{Happens-Before} |
| \label{sec:memorder:Happens-Before} |
| |
| \begin{figure}[tbp] |
| \centering |
| \resizebox{\columnwidth}{!}{\includegraphics{memorder/rf}} |
| \caption{Store-to-Load is Temporal} |
| \label{fig:memorder:Store-to-Load is Temporal} |
| \end{figure} |
| |
| As shown in |
| Figure~\ref{fig:memorder:Store-to-Load is Temporal}, |
| on non-speculative platforms, if a load returns the value from a |
| particular store, then, courtesy of the finite speed of light and |
| the non-zero size of modern computing systems, the store absolutely |
| has to have executed at an earlier time than did the load. |
| This means that carefully constructed programs can rely on the |
| passage of time itself as an memory-ordering operation. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+a-o+o-data-o+o-data-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x0); |
| WRITE_ONCE(*x1, 2); |
| } |
| |
| |
| P1(int *x1, int *x2) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| WRITE_ONCE(*x2, r2); |
| } |
| |
| P2(int *x2, int *x0) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x2); |
| WRITE_ONCE(*x0, r2); |
| } |
| |
| exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{LB Litmus Test With One Acquire} |
| \label{lst:memorder:LB Litmus Test With One Acquire} |
| \end{listing} |
| |
| Of course, just the passage of time by itself is not enough, as |
| was seen in |
| Listing~\ref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}, |
| which has nothing but store-to-load links and still can |
| trigger its \co{exists} clause. |
| However, as long as each thread provides even the weakest possible |
| ordering, \co{exists} clause would not be able to trigger. |
| For example, |
| Listing~\ref{lst:memorder:LB Litmus Test With One Acquire} |
| (\path{C-LB+a-o+o-data-o+o-data-o.litmus}) |
| shows \co{P0()} ordered with an \co{smp_load_acquire()} and |
| both \co{P1()} and \co{P2()} ordered with data dependencies. |
| These orderings, which are close to the top of |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}, |
| suffice to prevent the \co{exists} clause from triggering. |
| |
| \QuickQuiz{} |
| Can you construct a litmus test similar to that in |
| Listing~\ref{lst:memorder:LB Litmus Test With One Acquire} |
| that uses \emph{only} dependencies? |
| \QuickQuizAnswer{ |
| Listing~\ref{lst:memorder:LB Litmus Test With No Acquires} |
| shows a somewhat nonsensical but very real example. |
| Creating a more useful (but still real) litmus test is left |
| as an exercise for the reader. |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+o-data-o+o-data-o+o-data-o |
| { |
| int x0=0; |
| int x1=1; |
| int x2=2; |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x0); |
| WRITE_ONCE(*x1, r2); |
| } |
| |
| |
| P1(int *x1, int *x2) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x1); |
| WRITE_ONCE(*x2, r2); |
| } |
| |
| P2(int *x2, int *x0) |
| { |
| int r2; |
| |
| r2 = READ_ONCE(*x2); |
| WRITE_ONCE(*x0, r2); |
| } |
| |
| exists (0:r2=2 /\ 1:r2=0 /\ 2:r2=1) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{LB Litmus Test With No Acquires} |
| \label{lst:memorder:LB Litmus Test With No Acquires} |
| \end{listing} |
| } \QuickQuizEnd |
| |
| An important, to say nothing of more useful, use of time for ordering |
| memory accesses is covered in the next section. |
| |
| \subsubsection{Release-Acquire Chains} |
| \label{sec:memorder:Release-Acquire Chains} |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-LB+a-r+a-r+a-r+a-r |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x0); |
| smp_store_release(x1, 2); |
| } |
| |
| |
| P1(int *x1, int *x2) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x1); |
| smp_store_release(x2, 2); |
| } |
| |
| P2(int *x2, int *x3) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x2); |
| smp_store_release(x3, 2); |
| } |
| |
| P3(int *x3, int *x0) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x3); |
| smp_store_release(x0, 2); |
| } |
| |
| exists (0:r2=2 /\ 1:r2=2 /\ 2:r2=2 /\ 3:r2=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Long LB Release-Acquire Chain} |
| \label{lst:memorder:Long LB Release-Acquire Chain} |
| \end{listing} |
| |
| A minimal release-acquire chain was shown in |
| Listing~\ref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| (\path{C-LB+a-r+a-r+a-r+a-r.litmus}), |
| but these chains can be much longer, as shown in |
| Listing~\ref{lst:memorder:Long LB Release-Acquire Chain}. |
| The longer the release-acquire chain, the more benefit is gained |
| from the passage of time, so that no matter how many threads are |
| involved, the corresponding \co{exists} clause cannot trigger. |
| Give or take the increased computational demands of the larger number |
| of threads, of course! |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-ISA2+o-r+a-r+a-r+a-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| WRITE_ONCE(*x0, 2); |
| smp_store_release(x1, 2); |
| } |
| |
| |
| P1(int *x1, int *x2) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x1); |
| smp_store_release(x2, 2); |
| } |
| |
| P2(int *x2, int *x3) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x2); |
| smp_store_release(x3, 2); |
| } |
| |
| P3(int *x3, int *x0) |
| { |
| int r1; |
| int r2; |
| |
| r1 = smp_load_acquire(x3); |
| r2 = READ_ONCE(*x0); |
| } |
| |
| exists (1:r2=2 /\ 2:r2=2 /\ 3:r1=2 /\ 3:r2=0) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Long ISA2 Release-Acquire Chain} |
| \label{lst:memorder:Long ISA2 Release-Acquire Chain} |
| \end{listing} |
| |
| Although release-acquire chains are inherently store-to-load creatures, |
| it turns out that they can tolerate one load-to-store step, despite |
| such steps being counter-temporal, as shown in |
| Figure~\ref{fig:memorder:Load-to-Store is Counter-Temporal}. |
| For example, |
| Listing~\ref{lst:memorder:Long ISA2 Release-Acquire Chain} |
| (\path{C-ISA2+o-r+a-r+a-r+a-o.litmus}) |
| shows a three-step release-acquire chain, but where \co{P3()}'s |
| final access is a \co{READ_ONCE()} from \co{x0}, which is |
| accessed via \co{WRITE_ONCE()} by \co{P0()}, forming a non-temporal |
| load-to-store link between these two processes. |
| However, because \co{P0()}'s \co{smp_store_release()} (line~12) |
| is cumulative, if \co{P3()}'s \co{READ_ONCE()} returns zero, |
| this cumulativity will force the \co{READ_ONCE()} to be ordered |
| before \co{P0()}'s \co{smp_store_release()}. |
| In addition, the release-acquire chain (lines~12, 20, 21, 28, 29, and~37) |
| forces \co{P3()}'s \co{READ_ONCE()} to be ordered after \co{P0()}'s |
| \co{smp_store_release()}. |
| Because \co{P3()}'s \co{READ_ONCE()} cannot be both before and after |
| \co{P0()}'s \co{smp_store_release()}, either or both of two things must |
| be true: |
| |
| \begin{enumerate} |
| \item \co{P3()}'s \co{READ_ONCE()} came after \co{P0()}'s |
| \co{WRITE_ONCE()}, so that the \co{READ_ONCE()} returned |
| the value two, so that the \co{exists} clause's \co{3:r2=0} |
| is false. |
| \item The release-acquire chain did not form, that is, one or more |
| of the \co{exists} clause's \co{1:r2=2}, \co{2:r2=2}, or \co{3:r1=2} |
| is false. |
| \end{enumerate} |
| |
| Either way, the \co{exists} clause cannot trigger, despite this litmus |
| test containing a notorious load-to-store link between |
| \co{P3()} and \co{P0()}. |
| But never forget that release-acquire chains can tolerate only one |
| load-to-store link, as was seen in |
| Listing~\ref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox}[\LstLineNo] |
| C C-Z6.2+o-r+a-r+a-r+a-o |
| { |
| } |
| |
| P0(int *x0, int *x1) |
| { |
| WRITE_ONCE(*x0, 2); |
| smp_store_release(x1, 2); |
| } |
| |
| |
| P1(int *x1, int *x2) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x1); |
| smp_store_release(x2, 2); |
| } |
| |
| P2(int *x2, int *x3) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x2); |
| smp_store_release(x3, 2); |
| } |
| |
| P3(int *x3, int *x0) |
| { |
| int r2; |
| |
| r2 = smp_load_acquire(x3); |
| WRITE_ONCE(*x0, 3); |
| } |
| |
| exists (1:r2=2 /\ 2:r2=2 /\ 3:r2=2 /\ x0=2) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Long Z6.2 Release-Acquire Chain} |
| \label{lst:memorder:Long Z6.2 Release-Acquire Chain} |
| \end{listing} |
| |
| Release-acquire chains can also tolerate a single store-to-store step, |
| as shown in |
| Listing~\ref{lst:memorder:Long Z6.2 Release-Acquire Chain} |
| (\path{C-Z6.2+o-r+a-r+a-r+a-o.litmus}). |
| As with the previous example, \co{smp_store_release()}'s cumulativity |
| combined with the temporal nature of the release-acquire chain |
| prevents the \co{exists} clause on line~36 from triggering. |
| But beware: Adding a second store-to-store step would allow the correspondingly |
| updated \co{exists} clause to trigger. |
| |
| % @@@ QQ on having a release-acquire chain with one load-to-store and |
| % @@@ one store-to-store link. |
| |
| \QuickQuiz{} |
| There are store-to-load links, load-to-store links, and |
| store-to-store links. |
| But what about load-to-load links? |
| \QuickQuizAnswer{ |
| The problem with the concept of load-to-load links is that |
| if the two loads from the same variable return the same |
| value, there is no way to determine their ordering. |
| The only way to determine their ordering is if they return |
| different values, in which case there had to have been an |
| intervening store. |
| And that intervening store means that there is no load-to-load |
| link, but rather a load-to-store link followed by a |
| store-to-load link. |
| } \QuickQuizEnd |
| |
| In short, properly constructed release-acquire chains form a peaceful |
| island of intuitive bliss surrounded by a strongly counter-intuitive |
| sea of more complex memory-ordering constraints. |
| |
| % Exercises? |
| % Hardware details from Appendix? |
| |
| \section{Compile-Time Consternation} |
| \label{sec:memorder:Compile-Time Consternation} |
| |
| Most languages, including C, were developed on uniprocessor systems |
| by people with little or no parallel-programming experience. |
| As a results, unless explicitly told otherwise, these languages assume |
| that nothing else is reading or writing memory. |
| This in turn means that these languages' compilers' optimizers |
| are ready, willing, and oh so able to make dramatic changes to the |
| order, number, and sizes of memory references that your program |
| executes. |
| In fact, the reordering carried out by hardware can seem quite tame |
| by comparison. |
| |
| This section will help you tame your compiler, thus avoiding a great |
| deal of compile-time consternation. |
| Section~\ref{sec:memorder:Memory-Reference Restrictions} |
| describes how to keep the compiler from destructively optimizing |
| your code's memory references, |
| Section~\ref{sec:memorder:Address- and Data-Dependency Restrictions} |
| describes how to protect address and data dependencies, |
| and finally, |
| Section~\ref{sec:memorder:Control-Dependency Restrictions} |
| describes how to protect those delicate control dependencies. |
| |
| \subsection{Memory-Reference Restrictions} |
| \label{sec:memorder:Memory-Reference Restrictions} |
| |
| Again, unless told otherwise, compilers assume that nothing else |
| is affecting the variables being accessed by the generated code. |
| This assumption is not simply some design error, but is instead enshrined in |
| various standards.\footnote{ |
| Or perhaps it is a standardized design error.} |
| This assumption means that compilers are within their rights |
| (as defined by the standards) to optimize the following code |
| so as to hoist the load from \co{a} out of the loop, at least |
| in cases where the compiler can prove that \co{do_something()} |
| does not modify \co{a}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 while (a) |
| 2 do_something(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The optimized code might look something like this, essentially |
| fusing an arbitrarily large number of intended loads into a single |
| actual load: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 if (a) |
| 2 for (;;) |
| 3 do_something(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This optimization might come as a fatal surprise to code elsewhere |
| that expected to terminate this loop by storing a zero to \co{a}. |
| Prevent this by using \co{READ_ONCE()}, for example, as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 while (READ_ONCE(a)) |
| 2 do_something(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \co{READ_ONCE()} marks the load with a \co{volatile} cast. |
| Now \co{volatile} was originally designed for accessing |
| memory-mapped I/O (MMIO) registers, which are accessed using |
| the same load and store instructions that are used when |
| accessing normal memory. |
| However, MMIO registers need not act at all like normal memory. |
| Storing a value to an MMIO register does not necessarily mean |
| that a subsequent load from that register will return the value |
| stored. |
| Loading from an MMIO register might well have side effects, |
| for example, changing the device state or affecting the response |
| to subsequent loads and stores involving other MMIO registers. |
| Loads and stores of different sizes to the same MMIO address |
| might well have different effects. |
| |
| This means that, even on a uniprocessor system, changing the |
| order, number, or size of MMIO accesses must be strictly |
| forbidden. |
| And this is exactly the purpose of the C-language \co{volatile} keyword, |
| to constrain the compiler to allow implementation of reliable device |
| drivers. |
| |
| This is why \co{READ_ONCE()} prevents the destructive hoisting of |
| the load from \co{a} out of the loop: Doing so changes the number |
| of \co{volatile} loads from \co{a}, so this optimization is disallowed. |
| However, note well that \co{volatile} does absolutely nothing to |
| constrain the hardware. |
| Therefore, if the code following the loop needs to see the result of |
| any memory references preceding the store of zero that terminated |
| the loop, you will instead need to use something like |
| \co{smp_store_release()} to store the zero and \co{smp_load_acquire()} |
| in the loop condition. |
| But if all you need is to reliably control the loop without any |
| other ordering, \co{READ_ONCE()} can do the job. |
| |
| Compilers can also replicate loads. |
| For example, consider this all-too-real code fragment: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 tmp = p; |
| 2 if (tmp != NULL && tmp <= q) |
| 3 do_something(tmp); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Here the intent is that the \co{do_something()} function is never |
| passed a \co{NULL} pointer or a pointer that is greater than \co{q}. |
| However, the compiler is within its rights to transform this into the |
| following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 if (p != NULL && p <= q) |
| 2 do_something(p); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| In this transformed code, the value of \co{p} is loaded three |
| separate times. |
| This transformation might seem silly at first glance, but it is quite |
| tempting to compilers when the surrounding code has consumed all of the |
| machine registers. |
| It is possible that the current value of \co{p} passes the test on |
| line~1, but that some other thread stores \co{NULL} to \co{p} before |
| line~2 executes, and the resulting \co{NULL} pointer could be a fatal |
| surprise to \co{do_something()}.\footnote{ |
| Your editor made exactly this mistake in the DYNIX/ptx |
| kernel's memory allocator in the early 1990s. |
| Tracking down the bug consumed a holiday weekend not just |
| for your editor, but also for several of his colleagues. |
| In short, this is not a new problem.} |
| |
| Compilers can also fuse stores. |
| The most infamous example is probably the progress-bar example |
| shown below: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 while (!am_done()) { |
| 2 do_something(p); |
| 3 progress++; |
| 4 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| If the compiler used a feedback-driven optimizer, it might well |
| notice that the store to the shared variable \co{progress} was |
| quite expensive, resulting in the following well-intentioned |
| optimization: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 while (!am_done()) { |
| 2 do_something(p); |
| 3 tmp++; |
| 4 } |
| 5 progress = tmp; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This might well slightly increase performance, but the poor user |
| watching the progress bar might be forgiven for harboring significant |
| ill will towards this particular optimization. |
| The progress bar will after all be stuck at zero for a long time, |
| then jump at the very end. |
| The following code will usually prevent this problem: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 while (!am_done()) { |
| 2 do_something(p); |
| 3 WRITE_ONCE(progress, progress + 1); |
| 4 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Exceptions can occur if the compiler is able to analyze \co{do_something()} |
| and learn that it has no accesses to atomic or \co{volatile} variables. |
| In these cases the compiler could produce two loops, one invoking |
| \co{do_something()} and the other incrementing \co{progress}. |
| It may be necessary to replace the \co{WRITE_ONCE()} with something |
| like \co{smp_store_release()} in the unlikely event that this occurs. |
| It is important to note that although the compiler is forbidden from |
| changing the number, size, or order of \co{volatile} accesses, it is |
| perfectly within its rights to reorder normal accesses with unrelated |
| \co{volatile} accesses. |
| |
| Oddly enough, the compiler is within its rights to use a variable |
| as temporary storage just before a store to that variable, thus |
| inventing stores to that variable. |
| Fortunately, most compilers avoid this sort of thing, at least outside |
| of stack variables. |
| Nevertheless, using \co{WRITE_ONCE()} (or declaring the variable |
| \co{volatile}) should prevent this sort of thing. |
| But take care: If you have a translation unit that uses that variable, |
| and never makes a volatile access to it, the compiler has no way of |
| knowing that it needs to be careful. |
| |
| The previous examples involved compiler optimizations that |
| changed the number of accesses. |
| Now, it might seem that preventing the compiler from changing the |
| order of accesses is an act of futility, given that the underlying |
| hardware is free to reorder them. |
| However, modern machines have \emph{exact exceptions} and |
| \emph{exact interrupts}, meaning that any interrupt or exception will |
| appear to have happened at a specific place in the instruction |
| stream, so that the handler will see the effect of all prior |
| instructions, but won't see the effect of any subsequent instructions. |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} can therefore be used to |
| control communication between interrupted code and interrupt handlers. |
| |
| This leaves changes to the size of accesses, which is known as |
| \emph{load tearing} and \emph{store tearing} when the actual size |
| is smaller than desired. |
| For example, storing the constant \co{0x00010002} into a 32-bit |
| variable might seem quite safe. |
| However, there are CPUs that can store small immediate values |
| directly into memory, and on such CPUs, the compiler can be expected |
| to split this into two 16-bit stores in order to avoid the overhead |
| of explicitly forming the 32-bit constant. |
| This could come as a fatal surprise to another thread concurrently |
| loading from this variable, which might not expect to see the |
| result of a half-completed store. |
| Use of \co{READ_ONCE()} and \co{WRITE_ONCE()} prevent the compiler |
| from engaging in load tearing and store tearing, respectively. |
| |
| In short, use of \co{READ_ONCE()}, \co{WRITE_ONCE()}, and |
| \co{volatile} are valuable tools in preventing the compiler from |
| optimizing your parallel algorithm out of existence. |
| Compilers are starting to provide other mechanisms for avoiding |
| load and store tearing, for example, \co{memory_order_relaxed} |
| atomic loads and stores, however, \co{volatile} is still needed |
| to avoid fusing and splitting of accesses. |
| |
| \subsection{Address- and Data-Dependency Restrictions} |
| \label{sec:memorder:Address- and Data-Dependency Restrictions} |
| |
| Compilers do not understand either address or data dependencies, |
| although there are efforts underway to teach them, or at the very |
| least, standardize the process of teaching |
| them~\cite{PaulEMcKennneyConsumeP0190R0,PaulEMcKenney2017markconsumeP0462R1}. |
| In the meantime, it is necessary to be very careful in order to avoid |
| giving your compiler the opportunity to break your dependencies. |
| |
| \subsubsection{Give your dependency chain a good start} |
| The load that heads your dependency chain must use proper |
| ordering, for example, \co{lockless_dereference()}, |
| \co{rcu_dereference()}, or |
| a \co{READ_ONCE()} followed by \co{smp_read_barrier_depends()}. |
| Failure to follow this rule can have serious side effects: |
| |
| \begin{enumerate} |
| \item On DEC Alpha, a dependent load might not be ordered with |
| the load heading the dependency chain, as described in |
| Section~\ref{sec:memorder:Alpha}. |
| \item If the load heading the dependency chain is a |
| C11 non-volatile \co{memory_order_relaxed} load, |
| the compiler could omit the load, for example, by using a value |
| that it loaded in the past. |
| \item If the load heading the dependency chain is a plain load, |
| the compiler can omit the load, again by using a value |
| that it loaded in the past. |
| Worse yet, it could load twice instead of once, so that |
| different parts of your code use different values---and |
| real compilers do this, especially when under register |
| pressure. |
| \item The value loaded by the head of the dependency chain must |
| be a pointer. |
| In theory, yes, you could load an integer, perhaps to use |
| it as an array index. |
| In practice, the compiler knows too much about integers, |
| and thus has way too many opportunities to break your |
| dependency chain. |
| \end{enumerate} |
| |
| \subsubsection{Avoid arithmetic dependency breakage} |
| Although it is just fine to do some arithmetic operations on a pointer in |
| your dependency chain, you need to be careful to avoid giving the |
| compiler too much information. |
| After all, if the compiler learns enough to determine the exact value |
| of the pointer, it can and sometimes will use that exact value instead |
| of the pointer itself. |
| As soon as the compiler does that, the dependency is broken and all |
| ordering is lost. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int reserve_int; |
| 2 int *gp; |
| 3 int *p; |
| 4 |
| 5 p = rcu_dereference(gp); |
| 6 if (p == &reserve_int) |
| 7 handle_reserve(p); |
| 8 do_something_with(*p); /* buggy! */ |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Breakable Dependencies With Comparisons} |
| \label{lst:memorder:Breakable Dependencies With Comparisons} |
| \end{listing} |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int reserve_int; |
| 2 int *gp; |
| 3 int *p; |
| 4 |
| 5 p = rcu_dereference(gp); |
| 6 if (p == &reserve_int) { |
| 7 handle_reserve(&reserve_int); |
| 8 do_something_with(reserve_int); /* buggy! */ |
| 9 } else { |
| 10 do_something_with(*p); /* OK! */ |
| 11 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Broken Dependencies With Comparisons} |
| \label{lst:memorder:Broken Dependencies With Comparisons} |
| \end{listing} |
| |
| \begin{enumerate} |
| \item Although it is permissible to compute offsets from a |
| pointer, these offsets must not result in total cancellation. |
| For example, given a \co{char} pointer \co{cp}, |
| \co{cp-(uintptr_t)cp)} will cancel and can allow the compiler |
| to break your dependency chain. |
| On the other hand, canceling offset values with each other |
| is perfectly safe and legal. |
| For example, if \co{a} and \co{b} are equal, \co{cp+a-b} |
| is an identity function, including preserving the dependency. |
| \item Comparisons can break dependencies. |
| Listing~\ref{lst:memorder:Breakable Dependencies With Comparisons} |
| shows how this can happen. |
| Here global pointer \co{gp} points to a dynamically allocated |
| integer, but if memory is low, it might instead point to |
| the \co{reserve_int} variable. |
| This \co{reserve_int} case might need special handling, as |
| shown on lines~6 and~7 of the figure. |
| But the compiler could reasonably transform this code into |
| the form shown in |
| Listing~\ref{lst:memorder:Broken Dependencies With Comparisons}, |
| especially on systems where instructions with absolute |
| addresses run faster than instructions using addresses |
| supplied in registers. |
| However, there is clearly no ordering between the pointer |
| load on line~5 and the dereference on line~8. |
| Please note that this is simply an example: There are a great |
| many other ways to break dependency chains with comparisons. |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| Why can't you simply dereference the pointer before comparing it |
| to \co{&reserve_int} on line~6 of |
| Listing~\ref{lst:memorder:Breakable Dependencies With Comparisons}? |
| \QuickQuizAnswer{ |
| For first, it might be necessary to invoke |
| \co{handle_reserve()} before \co{do_something_with()}. |
| |
| But more relevant to memory ordering, the compiler is often within |
| its rights to hoist the comparison ahead of the dereferences, |
| which would allow the compiler to use \co{&reserve_int} instead |
| of the varaiable \co{p} that the hardware has tagged with |
| a dependency. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But it should be safe to compare two pointer variables, |
| right? After all, the compiler doesn't know the value |
| of either, so how can it possibly learn anything from the |
| comparison? |
| \QuickQuizAnswer{ |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int *gp1; |
| 2 int *gp2; |
| 3 int *p; |
| 4 int *q; |
| 5 |
| 6 p = rcu_dereference(gp1); |
| 7 q = READ_ONCE(gp2); |
| 8 if (p == q) |
| 9 handle_equality(p); |
| 10 do_something_with(*p); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Breakable Dependencies With Non-Constant Comparisons} |
| \label{lst:memorder:Breakable Dependencies With Non-Constant Comparisons} |
| \end{listing} |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int *gp1; |
| 2 int *gp2; |
| 3 int *p; |
| 4 int *q; |
| 5 |
| 6 p = rcu_dereference(gp1); |
| 7 q = READ_ONCE(gp2); |
| 8 if (p == q) { |
| 9 handle_equality(q); |
| 10 do_something_with(*q); |
| 11 } else { |
| 12 do_something_with(*p); |
| 13 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Broken Dependencies With Non-Constant Comparisons} |
| \label{lst:memorder:Broken Dependencies With Non-Constant Comparisons} |
| \end{listing} |
| |
| Unfortunately, the compiler really can learn enough to |
| break your dependency chain, for example, as shown in |
| Listing~\ref{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}. |
| The compiler is within its rights to transform this code |
| into that shown in |
| Listing~\ref{lst:memorder:Broken Dependencies With Non-Constant Comparisons}, |
| and might well make this transformation due to register pressure |
| if \co{handle_equality()} was inlined and needed a lot of registers. |
| Line~10 of this transformed code uses \co{q}, which although |
| equal to \co{p}, is not necessarily tagged by the hardware as |
| carrying a dependency. |
| Therefore, this transformed code does not necessarily guarantee |
| that line~10 is ordered after line~6.\footnote{ |
| Kudos to Linus Torvalds for providing this example.} |
| } \QuickQuizEnd |
| |
| % @@@ Quick quiz showing other ways to break dependencies by comparison |
| |
| \subsubsection{Safe comparison of dependent pointers} |
| It turns out that there are several safe ways to compare dependent |
| pointers: |
| |
| \begin{enumerate} |
| \item Comparisons against the \co{NULL} pointer. |
| In this case, all the compiler can learn is that the pointer |
| is \co{NULL}, in which case you are not allowed to |
| dereference it in any case. |
| \item The dependent pointer is never dereferenced, whether before or |
| after the comparison. |
| \item The dependent pointer is compared to a pointer that references |
| objects that were last modified a very long time ago, where |
| the safest value of ``a very long time ago'' is at compile time. |
| \end{enumerate} |
| |
| % @@@ convert to \co{uintptr_t} to set/clear bits and convert back. |
| |
| % @@@ More here |
| |
| In short, some care is required in order to ensure that dependency |
| chains in your source code are still dependency chains once the |
| compiler has gotten done with them. |
| |
| \subsection{Control-Dependency Restrictions} |
| \label{sec:memorder:Control-Dependency Restrictions} |
| |
| Control dependencies are especially tricky because current compilers |
| do not understand them and can easily break them. |
| The rules and examples in this section are intended to help you |
| prevent your compiler's ignorance from breaking your code. |
| |
| A load-load control dependency requires a full read memory barrier, |
| not simply a data dependency barrier. |
| Consider the following bit of code: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 <data dependency barrier> |
| 4 q = READ_ONCE(y); |
| 5 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This will not have the desired effect because there is no actual data |
| dependency, but rather a control dependency that the CPU may short-circuit |
| by attempting to predict the outcome in advance, so that other CPUs see |
| the load from~\co{y} as having happened before the load from~\co{x}. |
| In such a case what's actually required is: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 <read barrier> |
| 4 q = READ_ONCE(y); |
| 5 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| However, stores are not speculated. |
| This means that ordering \emph{is} provided for load-store control |
| dependencies, as in the following example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) |
| 3 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Control dependencies pair normally with other types of ordering operations. |
| That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()} |
| are optional! |
| Without the \co{READ_ONCE()}, the compiler might combine the load |
| from~\co{x} with other loads from~\co{x}. |
| Without the \co{WRITE_ONCE()}, the compiler might combine the store |
| to~\co{y} with other stores to~\co{y}. |
| Either can result in highly counterintuitive effects on ordering. |
| |
| Worse yet, if the compiler is able to prove (say) that the value of |
| variable~\co{x} is always non-zero, it would be well within its rights |
| to optimize the original example by eliminating the ``\co{if}'' statement |
| as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is tempting to try to enforce ordering on identical stores on both |
| branches of the ``\co{if}'' statement as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 barrier(); |
| 4 WRITE_ONCE(y, 1); |
| 5 do_something(); |
| 6 } else { |
| 7 barrier(); |
| 8 WRITE_ONCE(y, 1); |
| 9 do_something_else(); |
| 10 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Unfortunately, current compilers will transform this as follows at high |
| optimization levels: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 barrier(); |
| 3 WRITE_ONCE(y, 1); /* BUG: No ordering!!! */ |
| 4 if (q) { |
| 5 do_something(); |
| 6 } else { |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Now there is no conditional between the load from~\co{x} and the store |
| to~\co{y}, which means that the CPU is within its rights to reorder them: |
| The conditional is absolutely required, and must be present in the |
| assembly code even after all compiler optimizations have been applied. |
| Therefore, if you need ordering in this example, you need explicit |
| memory-ordering operations, for example, a release store: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 smp_store_release(&y, 1); |
| 4 do_something(); |
| 5 } else { |
| 6 smp_store_release(&y, 1); |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The initial \co{READ_ONCE()} is still required to prevent the compiler from |
| proving the value of~\co{x}. |
| |
| In addition, you need to be careful what you do with the local variable~% |
| \co{q}, |
| otherwise the compiler might be able to guess the value and again remove |
| the needed conditional. |
| For example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q % MAX) { |
| 3 WRITE_ONCE(y, 1); |
| 4 do_something(); |
| 5 } else { |
| 6 WRITE_ONCE(y, 2); |
| 7 do_something_else(); |
| 8 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is |
| equal to zero, in which case the compiler is within its rights to |
| transform the above code into the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 2); |
| 3 do_something_else(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Given this transformation, the CPU is not required to respect the ordering |
| between the load from variable~\co{x} and the store to variable~\co{y}. |
| It is tempting to add a \co{barrier()} to constrain the compiler, |
| but this does not help. |
| The conditional is gone, and the \co{barrier()} won't bring it back. |
| Therefore, if you are relying on this ordering, you should make sure |
| that \co{MAX} is greater than one, perhaps as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 BUILD_BUG_ON(MAX <= 1); |
| 3 if (q % MAX) { |
| 4 WRITE_ONCE(y, 1); |
| 5 do_something(); |
| 6 } else { |
| 7 WRITE_ONCE(y, 2); |
| 8 do_something_else(); |
| 9 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Please note once again that the stores to~\co{y} differ. |
| If they were identical, as noted earlier, the compiler could pull this |
| store outside of the ``\co{if}'' statement. |
| |
| You must also avoid excessive reliance on boolean short-circuit evaluation. |
| Consider this example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q || 1 > 0) |
| 3 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because the first condition cannot fault and the second condition is |
| always true, the compiler can transform this example as following, |
| defeating control dependency: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 WRITE_ONCE(y, 1); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This example underscores the need to ensure that the compiler cannot |
| out-guess your code. |
| More generally, although \co{READ_ONCE()} does force |
| the compiler to actually emit code for a given load, it does not force |
| the compiler to use the results. |
| |
| In addition, control dependencies apply only to the then-clause and |
| else-clause of the if-statement in question. |
| In particular, it does |
| not necessarily apply to code following the if-statement: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 q = READ_ONCE(x); |
| 2 if (q) { |
| 3 WRITE_ONCE(y, 1); |
| 4 } else { |
| 5 WRITE_ONCE(y, 2); |
| 6 } |
| 7 WRITE_ONCE(z, 1); /* BUG: No ordering. */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is tempting to argue that there in fact is ordering because the |
| compiler cannot reorder volatile accesses and also cannot reorder |
| the writes to~\co{y} with the condition. |
| Unfortunately for this line |
| of reasoning, the compiler might compile the two writes to~\co{y} as |
| conditional-move instructions, as in this fanciful pseudo-assembly |
| language: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 ld r1,x |
| 2 cmp r1,$0 |
| 3 cmov,ne r4,$1 |
| 4 cmov,eq r4,$2 |
| 5 st r4,y |
| 6 st $1,z |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| A weakly ordered CPU would have no dependency of any sort between the load |
| from~\co{x} and the store to~\co{z}. |
| The control dependencies would extend |
| only to the pair of cmov instructions and the store depending on them. |
| In short, control dependencies apply only to the stores in the ``\co{then}'' |
| and ``\co{else}'' of the ``\co{if}'' in question (including functions |
| invoked by those two clauses), not to code following that ``\co{if}''. |
| |
| Finally, control dependencies do \emph{not} provide cumulativity.\footnote{ |
| Refer to Section~\ref{sec:memorder:Cumulativity} for |
| the meaning of cumulativity.} |
| This is demonstrated by two related examples, with the initial values |
| of~\co{x} and~\co{y} both being zero: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l|p{1.5in}} |
| \nf{CPU 0} & \nf{CPU 1} \\ |
| \hline |
| \tco{r1 = READ_ONCE(x);} & |
| \tco{r2 = READ_ONCE(y);} \\ |
| if (r1 > 0) & |
| if (r2 > 0) \\ |
| ~~~\tco{WRITE_ONCE(y, 1);} & |
| ~~~\tco{WRITE_ONCE(x, 1);} \\ |
| \multicolumn{2}{l}{~} \\ |
| \multicolumn{2}{l}{\tco{assert(!(r1 == 1 && r2 == 1));}} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The above two-CPU example will never trigger the \co{assert()}. |
| However, if control dependencies guaranteed transitivity (which they do |
| not), then adding the following CPU would guarantee a related assertion: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{tabular}{l} |
| \nf{CPU 2} \\ |
| \hline |
| \tco{WRITE_ONCE(x, 2);} \\ |
| \multicolumn{1}{l}{~} \\ |
| \multicolumn{1}{l}{\tco{assert(!(r1 == 2 && r2 == 1 && x == 2));}} \\ |
| \end{tabular} |
| \end{minipage} |
| \vspace{5pt} |
| |
| But because control dependencies do \emph{not} provide transitivity, the above |
| assertion can fail after the combined three-CPU example completes. |
| If you need the three-CPU example to provide ordering, you will need |
| \co{smp_mb()} between the loads and stores in the CPU~0 and CPU~1 code |
| fragments, that is, just before or just after the ``\co{if}'' statements. |
| Furthermore, the original two-CPU example is very fragile and should be avoided. |
| |
| The two-CPU example is known as LB (load buffering) and the three-CPU |
| example as WWC~\cite{Maranget2012TutorialARMPower}. |
| |
| The following list of rules summarizes the lessons of this section: |
| |
| \begin{enumerate} |
| \item Compilers do not understand control dependencies, so it is |
| your job to make sure that the compiler cannot break your code. |
| |
| \item Control dependencies can order prior loads against later stores. |
| However, they do \emph{not} guarantee any other sort of ordering: |
| Not prior loads against later loads, nor prior stores against |
| later anything. |
| If you need these other forms of ordering, use \co{smp_rmb()}, |
| \co{smp_wmb()}, or, in the case of prior stores and later loads, |
| \co{smp_mb()}. |
| |
| \item If both legs of the ``\co{if}'' statement begin with identical stores |
| to the same variable, then those stores must be ordered, |
| either by preceding both of them with \co{smp_mb()} or by using |
| \co{smp_store_release()} to carry out the stores. |
| Please note that it is \emph{not} sufficient to use \co{barrier()} |
| at beginning of each leg of the ``\co{if}'' statement because, as shown |
| by the example above, optimizing compilers can destroy the control |
| dependency while respecting the letter of the \co{barrier()} law. |
| |
| \item Control dependencies require at least one run-time conditional |
| between the prior load and the subsequent store, and this |
| conditional must involve the prior load. |
| If the compiler is able to optimize the conditional away, it |
| will have also optimized away the ordering. |
| Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help |
| to preserve the needed conditional. |
| |
| \item Control dependencies require that the compiler avoid reordering |
| the dependency into nonexistence. |
| Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or |
| \co{atomic64_read()} can help to preserve your control |
| dependency. |
| |
| \item Control dependencies apply only to the ``\co{then}'' and |
| ``\co{else}'' of the ``\co{if}'' containing the control |
| dependency, including any functions that these two clauses call. |
| Control dependencies do \emph{not} apply to code following the |
| ``\co{if}'' containing the control dependency. |
| |
| \item Control dependencies pair normally with other types of |
| memory-ordering operations. |
| |
| \item Control dependencies do \emph{not} provide transitivity. |
| If you need transitivity, use \co{smp_mb()}. |
| \end{enumerate} |
| |
| In short, many popular languages were designed primarily with |
| single-threaded use in mind. |
| Successfully using these languages to construct multi-threaded software |
| requires that you pay special attention to your memory references and |
| dependencies. |
| |
| \section{Hardware Specifics} |
| \label{sec:memorder:Hardware Specifics} |
| \OriginallyPublished{Appendix}{sec:memorder:Hardware Specifics}{Memory-Barrier Instructions For Specific CPUs}{Linux Journal}{PaulMcKenney2005i,PaulMcKenney2005j} |
| |
| \begin{table*} |
| \small |
| \centering |
| \renewcommand*{\arraystretch}{1.2}\OneColumnHSpace{-.6in} |
| \begin{tabular}{l|p{2in}|c|c|c|c|c|c|c|c|c|c|c} |
| \multicolumn{2}{l|}{~} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{Alpha} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{ARMv8} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{ARMv7-A/R} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{Itanium} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{MIPS} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{(PA-RISC)} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{PA-RISC CPUs} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{POWER} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{SPARC TSO} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{x86} |
| \end{picture} |
| & \begin{picture}(6,60)(0,0) |
| \rotatebox{90}{z~Systems} |
| \end{picture} |
| \\ |
| \hline |
| % Alpha ARMv8 ARMv7 Itanium MIPS PA-RISC -CPUs PPC SPARC x86 z Systems |
| \hline |
| Memory Ordering |
| & Loads Reordered After Loads or Stores? |
| & Y & Y & Y & Y & Y & Y & ~ & Y & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Stores Reordered After Stores? |
| & Y & Y & Y & Y & Y & Y & ~ & Y & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Stores Reordered After Loads? |
| & Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & Y \\ |
| \cline{2-13} |
| & Atomic Instructions Reordered With Loads or Stores? |
| & Y & Y & Y & ~ & Y & ~ & ~ & Y & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Dependent Loads Reordered? |
| & Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Dependent Stores Reordered? |
| & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Non-Sequentially Consistent? |
| & Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & Y \\ |
| \cline{2-13} |
| & Non-Multicopy Atomic? |
| & Y & Y & Y & Y & Y & Y & ~ & Y & Y & Y & ~ \\ |
| \cline{2-13} |
| & Non-Other-Multicopy Atomic? |
| & Y & ~ & Y & Y & Y & Y & ~ & Y & Y & Y & ~ \\ |
| \cline{2-13} |
| & Non-Cache Coherent? |
| & ~ & ~ & ~ & Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\ |
| \hline |
| \hline |
| Instructions |
| & Load-Acquire/Store-Release? |
| & F & i & F & I & ? & ? & ~ & b & ~ & ~ & ~ \\ |
| \cline{2-13} |
| & Atomic RMW Instruction Type? |
| & L & L & L & C & L & ? & ? & L & ? & C & C \\ |
| \cline{2-13} |
| & Incoherent Instruction Cache/Pipeline? |
| & Y & Y & Y & Y & Y & ~ & ~ & Y & Y & Y & Y \\ |
| \end{tabular} |
| \renewcommand*{\arraystretch}{1} |
| \begin{tabular}{llcl} |
| ~ & ~ & ~ & ~\\ |
| { \bf Key: } & ~ & ~ & ~ \\ |
| ~ & \multicolumn{3}{l}{Load-Acquire/Store-Release?} \\ |
| ~ & ~ & b: & Lightweight memory barrier \\ |
| ~ & ~ & F: & Full memory barrier \\ |
| ~ & ~ & i: & Instruction with lightweight ordering \\ |
| ~ & ~ & I: & Instruction with heavyweight ordering \\ |
| ~ &\multicolumn{3}{l}{Atomic RMW Instruction Type?} \\ |
| ~ & ~ & C: & Compare-and-exchange instruction \\ |
| ~ & ~ & L: & Load-linked/store-conditional instruction \\ |
| \end{tabular} |
| \caption{Summary of Memory Ordering} |
| \label{tab:memorder:Summary of Memory Ordering} |
| \end{table*} |
| |
| Each CPU has its own peculiar approach to memory ordering, which |
| can make portability a challenge, as indicated by |
| Table~\ref{tab:memorder:Summary of Memory Ordering}. |
| In fact, some software environments simply prohibit |
| direct use of memory-ordering operations, restricting the programmer |
| to mutual-exclusion primitives that incorporate them to the extent that |
| they are required. Please note that this section is not intended to be |
| a reference manual |
| covering all (or even most) aspects of each CPU family, but rather |
| a high-level overview giving a rough comparison. |
| For full details, see the reference manual for the CPU of interest. |
| |
| Getting back to |
| Table~\ref{tab:memorder:Summary of Memory Ordering}, |
| the first group of rows look at memory-ordering |
| properties and the second group looks at instruction properties. |
| |
| The first three rows indicate whether a given CPU allows the four |
| possible combinations of loads and stores to be reordered, as discussed |
| in |
| Sections~\ref{sec:memorder:Ordering: Why and How?} and |
| \ref{sec:memorder:Load Followed By Load}--\ref{sec:memorder:Store Followed By Store}. |
| The next row (``Atomic Instructions Reordered With Loads or Stores?'') |
| indicates whether a given CPU allows loads and stores |
| to be reordered with atomic instructions. |
| |
| The fifth and sixth rows cover reordering and dependencies, |
| which was covered in |
| Sections~\ref{sec:memorder:Address Dependencies}--\ref{sec:memorder:Control Dependencies} |
| and which is explained in more detail in |
| Section~\ref{sec:memorder:Alpha}. |
| The short version is that Alpha requires memory barriers for readers |
| as well as updaters of linked data structures. |
| |
| The next row, ``Non-Sequentially Consistent'', indicates whether |
| the CPU's normal load and store instructions are constrained by |
| sequential consistency. |
| Almost all are not constrained in this way for performance reasons. |
| |
| The next two rows cover multicopy atomicity, which was defined in |
| Section~\ref{sec:memorder:Multicopy Atomicity}. |
| The first is full-up (and rare) multicopy atomicity, and the second is the |
| weaker other-multicopy atomicity. |
| |
| The next row, ``Non-Cache Coherent'', covers accesses from multiple |
| threads to a single variable, which was discussed in |
| Section~\ref{sec:memorder:Cache Coherence}. |
| |
| The final three rows cover instruction-level choices and issues. |
| The first row indicates how each CPU implements load-acquire |
| and store-release, the second row classifies CPUs by atomic-instruction |
| type, and the third and final row |
| indicates whether a given CPU has an incoherent |
| instruction cache and pipeline. |
| Such CPUs require special instructions be executed for self-modifying |
| code. |
| |
| Parenthesized CPU names indicate modes that are architecturally allowed, |
| but rarely used in practice. |
| |
| The common ``just say no'' approach to memory-ordering operations |
| can be eminently reasonable where it applies, |
| but there are environments, such as the Linux kernel, where direct |
| use of memory-ordering operations is required. |
| Therefore, |
| Linux provides a carefully chosen least-common-denominator |
| set of memory-ordering primitives, which are as follows: |
| \begin{description} |
| \item [\tco{smp_mb()}] (full memory barrier) that orders both loads and |
| stores. |
| This means that loads and stores preceding the memory barrier |
| will be committed to memory before any loads and stores |
| following the memory barrier. |
| \item [\tco{smp_rmb()}] (read memory barrier) that orders only loads. |
| \item [\tco{smp_wmb()}] (write memory barrier) that orders only stores. |
| \item [\tco{smp_read_barrier_depends()}] that forces subsequent operations |
| that depend on prior operations to be ordered. |
| This primitive is a no-op on all platforms except Alpha, but |
| is normally not used directly, but rather as part of |
| something like \co{lockless_dereference()} or |
| \co{rcu_dereference()}. |
| \item [\tco{smp_mb__before_atomic()}] that forces ordering of accesses |
| preceding the \co{smp_mb__before_atomic()} against accesses following |
| a later RMW atomic operation. |
| This is a noop on systems that fully order atomic RMW operatings. |
| \item [\tco{smp_mb__after_atomic()}] that forces ordering of accesses |
| preceding an earlier RMW atomic operation against accesses |
| following the \co{smp_mb__after_atomic()}. |
| This is also a noop on systems that fully order atomic RMW operatings. |
| \item [\tco{mmiowb()}] that forces ordering on MMIO writes that are guarded |
| by global spinlocks, and is more thoroughly described |
| in a 2016 LWN article on MMIO~\cite{PaulEMcKenney2016LinuxKernelMMIO}. |
| \end{description} |
| The \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()} |
| primitives also force |
| the compiler to eschew any optimizations that would have the effect |
| of reordering memory optimizations across the barriers. |
| The \co{smp_read_barrier_depends()} primitive has a similar effect, but |
| only on Alpha CPUs. |
| |
| These primitives generate code only in SMP kernels, however, each |
| also has a UP version ({\tt mb()}, {\tt rmb()}, {\tt wmb()}, |
| and \co{read_barrier_depends()}, |
| respectively) that generate a memory barrier even in UP kernels. The \co{smp_} |
| versions should be used in most cases. However, these latter primitives |
| are useful when writing drivers, because MMIO accesses must remain |
| ordered even in UP kernels. In absence of memory-ordering operations, |
| both CPUs and compilers would happily rearrange these accesses, which at |
| best would make the device act strangely, and could crash your kernel or, |
| in some cases, even damage your hardware. |
| |
| So most kernel programmers need not worry about the memory-ordering |
| peculiarities of each and every CPU, as long as they stick to these |
| interfaces. |
| If you are working deep in a given CPU's architecture-specific code, |
| of course, all bets are off. |
| |
| Furthermore, |
| all of Linux's locking primitives (spinlocks, reader-writer locks, |
| semaphores, RCU, \ldots) include any needed ordering primitives. |
| So if you are working with code that uses these primitives, you don't |
| even need to worry about Linux's memory-ordering primitives. |
| |
| That said, deep knowledge of each CPU's memory-consistency model |
| can be very helpful when debugging, to say nothing of when writing |
| architecture-specific code or synchronization primitives. |
| |
| Besides, they say that a little knowledge is a very dangerous thing. |
| Just imagine the damage you could do with a lot of knowledge! |
| For those who wish to understand more about individual CPUs' |
| memory consistency models, the next sections describes those of the |
| most popular and prominent CPUs. |
| Although nothing can replace actually reading a given CPU's documentation, |
| these sections give a good overview. |
| |
| \subsection{Alpha} |
| \label{sec:memorder:Alpha} |
| |
| It may seem strange to say much of anything about a CPU whose end of life |
| has long since past, but Alpha is interesting because, with one of |
| the weakest memory ordering models, it reorders memory operations quite |
| aggressively. |
| It therefore has defined many of the Linux-kernel memory-ordering primitives, |
| which must work on all CPUs, including Alpha. |
| Understanding Alpha is therefore surprisingly important to the Linux kernel |
| hacker. |
| |
| The difference between Alpha and the other CPUs is illustrated by the |
| code shown in |
| Listing~\ref{lst:memorder:Insert and Lock-Free Search}. |
| This \co{smp_wmb()} on line~9 of this figure |
| guarantees that the element initialization |
| in lines~6-8 is executed before the element is added to the |
| list on line~10, so that the lock-free search will work correctly. |
| That is, it makes this guarantee on all CPUs {\em except} Alpha. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct el *insert(long key, long data) |
| 2 { |
| 3 struct el *p; |
| 4 p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| 5 spin_lock(&mutex); |
| 6 p->next = head.next; |
| 7 p->key = key; |
| 8 p->data = data; |
| 9 smp_wmb(); |
| 10 head.next = p; |
| 11 spin_unlock(&mutex); |
| 12 } |
| 13 |
| 14 struct el *search(long key) |
| 15 { |
| 16 struct el *p; |
| 17 p = READ_ONCE(head.next); |
| 18 while (p != &head) { |
| 19 /* BUG ON ALPHA!!! */ |
| 20 if (p->key == key) { |
| 21 return (p); |
| 22 } |
| 23 p = READ_ONCE(p->next); |
| 24 }; |
| 25 return (NULL); |
| 26 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Insert and Lock-Free Search} |
| \label{lst:memorder:Insert and Lock-Free Search} |
| \end{listing} |
| |
| Alpha has extremely weak memory ordering |
| such that the code on line~20 of |
| Listing~\ref{lst:memorder:Insert and Lock-Free Search} could see the old |
| garbage values that were present before the initialization on lines~6-8. |
| |
| Figure~\ref{fig:memorder:Why smp-read-barrier-depends() is Required} |
| shows how this can happen on |
| an aggressively parallel machine with partitioned caches, so that |
| alternating cache lines are processed by the different partitions |
| of the caches. |
| For example, the load of \co{head.next} on line~17 of |
| Listing~\ref{lst:memorder:Insert and Lock-Free Search} |
| might access cache bank~0, |
| and the load of \co{p->key} on line~20 and of \co{p->next} on line~23 |
| might access cache bank~1. |
| On Alpha, the \co{smp_wmb()} will guarantee that the cache invalidations |
| performed by lines~6-8 of |
| Listing~\ref{lst:memorder:Insert and Lock-Free Search} |
| (for \co{p->next}, \co{p->key}, and \co{p->data}) will reach |
| the interconnect before that of line~10 (for \co{head.next}), but |
| makes absolutely no guarantee about the order of |
| propagation through the reading CPU's cache banks. |
| For example, it is possible that the reading CPU's cache bank~1 is very |
| busy, but cache bank~0 is idle. |
| This could result in the cache invalidations for the new element |
| (\co{p->next}, \co{p->key}, and \co{p->data}) being |
| delayed, so that the reading CPU loads the new value for \co{head.next}, |
| but loads the old cached values for \co{p->key} and \co{p->next}. |
| Yes, this does mean that Alpha can in effect fetch |
| the data pointed to {\em before} it fetches the pointer itself, strange |
| but true. |
| See the documentation~\cite{Compaq01,WilliamPugh2000Gharachorloo} |
| called out earlier for more information, |
| or if you think that I am just making all this up.\footnote{ |
| Of course, the astute reader will have already recognized that |
| Alpha is nowhere near as mean and nasty as it could be, |
| the (thankfully) mythical architecture in |
| Section~\ref{sec:app:whymb:Ordering-Hostile Architecture} |
| being a case in point.} |
| The benefit of this unusual approach to ordering is that Alpha can use |
| simpler cache hardware, which in turn permitted higher clock frequency |
| in Alpha's heyday. |
| |
| \begin{figure}[tbp] |
| \centering |
| \resizebox{\columnwidth}{!}{\includegraphics{memorder/Alpha}} |
| \caption{Why \tco{smp_read_barrier_depends()} is Required} |
| \label{fig:memorder:Why smp-read-barrier-depends() is Required} |
| \end{figure} |
| |
| One could place an \co{smp_rmb()} primitive |
| between the pointer fetch and dereference in order to force Alpha |
| to order the pointer fetch with the later dependent load. |
| However, this imposes unneeded overhead on systems (such as ARM |
| Itanium, PPC, and SPARC) that respect data dependencies on the read side. |
| A \co{smp_read_barrier_depends()} primitive has therefore been added to the |
| Linux kernel to eliminate overhead on these systems. |
| This primitive could be inserted in place of line~19 of |
| Listing~\ref{lst:memorder:Insert and Lock-Free Search}, |
| but it is better to use the \co{rcu_dereference()} wrapper macro |
| as shown on lines~17 and~22 of |
| Listing~\ref{lst:memorder:Safe Insert and Lock-Free Search}. |
| |
| It is also possible to implement a software barrier |
| that could be used in place of \co{smp_wmb()}, which would force |
| all reading CPUs to see the writing CPU's writes in order. |
| However, this approach was deemed by the Linux community |
| to impose excessive overhead |
| on extremely weakly ordered CPUs such as Alpha, and furthermore would |
| not be considered a reasonable approach by those whose systems must meet |
| aggressive real-time response requirements. |
| This software barrier could be implemented by sending inter-processor |
| interrupts (IPIs) to all other CPUs. |
| Upon receipt of such an IPI, a CPU would execute a memory-barrier |
| instruction, implementing a memory-barrier shootdown. |
| Additional logic is required to avoid deadlocks. |
| Of course, CPUs that respect data dependencies would define such a barrier |
| to simply be \co{smp_wmb()}. |
| Perhaps this decision should be revisited in the future as Alpha |
| fades off into the sunset, but as of 2017 there is a surprisingly |
| large number of people who run recent Linux kernels on their lovingly |
| preserved DEC Alpha systems. |
| |
| \begin{listing}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct el *insert(long key, long data) |
| 2 { |
| 3 struct el *p; |
| 4 p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| 5 spin_lock(&mutex); |
| 6 p->next = head.next; |
| 7 p->key = key; |
| 8 p->data = data; |
| 9 smp_wmb(); |
| 10 head.next = p; |
| 11 spin_unlock(&mutex); |
| 12 } |
| 13 |
| 14 struct el *search(long key) |
| 15 { |
| 16 struct el *p; |
| 17 p = rcu_dereference(head.next); |
| 18 while (p != &head) { |
| 19 if (p->key == key) { |
| 20 return (p); |
| 21 } |
| 22 p = rcu_dereference(p->next); |
| 23 }; |
| 24 return (NULL); |
| 25 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Safe Insert and Lock-Free Search} |
| \label{lst:memorder:Safe Insert and Lock-Free Search} |
| \end{listing} |
| |
| The Linux memory-barrier primitives took their names from the Alpha |
| instructions, so \co{smp_mb()} is {\tt mb}, \co{smp_rmb()} is {\tt rmb}, |
| and \co{smp_wmb()} is {\tt wmb}. |
| Alpha is the only CPU where \co{smp_read_barrier_depends()} is |
| an \co{smp_mb()} rather than a no-op. |
| |
| \QuickQuiz{} |
| Why is Alpha's \co{smp_read_barrier_depends()} an |
| \co{smp_mb()} rather than \co{smp_rmb()}? |
| \QuickQuizAnswer{ |
| First, Alpha has only \co{mb} and \co{wmb} instructions, |
| so \co{smp_rmb()} would be implemented by the Alpha \co{mb} |
| instruction in either case. |
| |
| More importantly, \co{smp_read_barrier_depends()} must |
| order subsequent stores. |
| For example, consider the following code: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 p = global_pointer; |
| 2 smp_read_barrier_depends(); |
| 3 if (do_something_with(p->a, p->b) == 0) |
| 4 p->hey_look = 1; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Here the store to \co{p->hey_look} must be ordered, |
| not just the loads from \co{p->a} and \co{p->b}. |
| } \QuickQuizEnd |
| |
| For more on Alpha, see its reference manual~\cite{ALPHA2002}. |
| |
| \subsection{ARMv7-A/R} |
| \label{sec:memorder:ARMv7-A/R} |
| |
| The ARM family of CPUs is extremely popular in embedded applications, |
| particularly for power-constrained applications such as cellphones. |
| Its memory model is similar to that of Power |
| (see Section~\ref{sec:memorder:POWER / PowerPC}, but ARM uses a |
| different set of memory-barrier instructions~\cite{ARMv7A:2010}: |
| |
| \begin{description} |
| \item [\tco{DMB}] (data memory barrier) causes the specified type of |
| operations to \emph{appear} to have completed before any |
| subsequent operations of the same type. |
| The ``type'' of operations can be all operations or can be |
| restricted to only writes (similar to the Alpha \co{wmb} |
| and the POWER \co{eieio} instructions). |
| In addition, ARM allows cache coherence to have one of three |
| scopes: single processor, a subset of the processors |
| (``inner'') and global (``outer''). |
| \item [\tco{DSB}] (data synchronization barrier) causes the specified |
| type of operations to actually complete before any subsequent |
| operations (of any type) are executed. |
| The ``type'' of operations is the same as that of \co{DMB}. |
| The \co{DSB} instruction was called \co{DWB} (drain write buffer |
| or data write barrier, your choice) in early versions of the |
| ARM architecture. |
| \item [\tco{ISB}] (instruction synchronization barrier) flushes the CPU |
| pipeline, so that all instructions following the \co{ISB} |
| are fetched only after the \co{ISB} completes. |
| For example, if you are writing a self-modifying program |
| (such as a JIT), you should execute an \co{ISB} between |
| generating the code and executing it. |
| \end{description} |
| |
| None of these instructions exactly match the semantics of Linux's |
| \co{rmb()} primitive, which must therefore be implemented as a full |
| \co{DMB}. |
| The \co{DMB} and \co{DSB} instructions have a recursive definition |
| of accesses ordered before and after the barrier, which has an effect |
| similar to that of POWER's cumulativity. |
| |
| ARM also implements control dependencies, so that if a conditional |
| branch depends on a load, then any store executed after that conditional |
| branch will be ordered after the load. |
| However, loads following the conditional branch will \emph{not} |
| be guaranteed to be ordered unless there is an \co{ISB} |
| instruction between the branch and the load. |
| Consider the following example: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 r1 = x; |
| 2 if (r1 == 0) |
| 3 nop(); |
| 4 y = 1; |
| 5 r2 = z; |
| 6 ISB(); |
| 7 r3 = z; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| In this example, load-store control dependency ordering causes |
| the load from \co{x} on line~1 to be ordered before the store to |
| \co{y} on line~4. |
| However, ARM does not respect load-load control dependencies, so that |
| the load on line~1 might well happen \emph{after} the |
| load on line~5. |
| On the other hand, the combination of the conditional branch on line~2 |
| and the \co{ISB} instruction on line~6 ensures that |
| the load on line~7 happens after the load on line~1. |
| Note that inserting an additional \co{ISB} instruction somewhere between |
| lines~3 and 4 would enforce ordering between lines~1 and 5. |
| |
| \subsection{ARMv8} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-LDLAR}} |
| \caption{Half Memory Barrier} |
| \ContributedBy{Figure}{fig:memorder:Half Memory Barrier}{Melissa Brossard} |
| \end{figure} |
| |
| ARMv8 is ARM's 64-bit CPU~\cite{ARMv8A:2017}, |
| in contrast to their 32-bit CPU described in |
| Section~\ref{sec:memorder:ARMv7-A/R}. |
| ARMv8's memory model closely resembles its 32-bit counterpart, |
| but adds load-acquire (\co{LDLARB}, \co{LDLARH}, and \co{LDLAR}) |
| and store-release (\co{STLLRB}, \co{STLLRH}, and \co{STLLR}) |
| instructions. |
| These instructions act as ``half memory barriers'', so that |
| ARMv8 CPUs can reorder previous accesses with a later \co{LDLAR} |
| instruction, but are prohibited from reordering an earlier \co{LDAR} |
| instruction with later accesses, as fancifully depicted in |
| Figure~\ref{fig:memorder:Half Memory Barrier}. |
| Similarly, ARMv8 CPUs can reorder an earlier \co{STLLR} instruction with |
| a subsequent access, but are prohibited from reordering |
| previous accesses with a later \co{STLLR} instruction. |
| As one might expect, this means that these instructions directly support |
| the C11 notion of load-acquire and store-release. |
| |
| However, ARMv8 goes well beyond the C11 memory model by mandating that |
| the combination of a store-release and load-acquire act as a full |
| barrier under many circumstances. |
| For example, in ARMv8, given a store followed by a store-release followed |
| a load-acquire followed by a load, all to different variable, all CPUs |
| would agree that the initial store preceded the final load. |
| Interestingly enough, most TSO architectures (including x86 and the |
| mainframe) do not make this guarantee, as the two loads could be |
| reordered before the two stores. |
| |
| ARMv8 also has the distinction of being the first CPU whose vendor publicly |
| defined its memory ordering with a formal model~\cite{ARMv8A:2017}. |
| |
| \subsection{Itanium} |
| |
| Itanium offers a weak consistency model, so that in absence of explicit |
| memory-barrier instructions, Itanium is within its rights to arbitrarily |
| reorder memory references~\cite{IntelItanium02v2}. |
| Itanium has a memory-fence instruction named {\tt mf}, but also has |
| ``half-memory fence'' modifiers to loads, stores, and to some of its atomic |
| instructions~\cite{IntelItanium02v3}. |
| The {\tt acq} modifier prevents subsequent memory-reference instructions |
| from being reordered before the {\tt acq}, but permits |
| prior memory-reference instructions to be reordered after the {\tt acq}, |
| similar to the ARMv8 load-acquire instructions. |
| Similarly, the {\tt rel} modifier prevents prior memory-reference |
| instructions from being reordered after the {\tt rel}, but allows |
| subsequent memory-reference instructions to be reordered before |
| the {\tt rel}. |
| |
| These half-memory fences are useful for critical sections, since |
| it is safe to push operations into a critical section, but can be |
| fatal to allow them to bleed out. |
| However, as one of the only CPUs with this property,\footnote{ |
| ARMv8 has recently added load-acquire and store-release instructions.} |
| Itanium defines |
| Linux's semantics of memory ordering associated with lock acquisition |
| and release. |
| Oddly enough, actual Itanium hardware is rumored to implement |
| both load-acquire and store-release instructions as full barriers. |
| Nevertheless, Itanium was the first mainstream CPU to introduce the concept |
| (if not the reality) of load-acquire and store-release into its |
| instruction set. |
| |
| The Itanium {\tt mf} instruction is used for the \co{smp_rmb()}, |
| \co{smp_mb()}, and \co{smp_wmb()} primitives in the Linux kernel. |
| Oh, and despite rumors to the contrary, the ``mf'' mnemonic really |
| does stand for ``memory fence''. |
| |
| Finally, Itanium offers a global total order for ``release'' operations, |
| including the ``mf'' instruction. |
| This provides the notion of transitivity, where if a given code fragment |
| sees a given access as having happened, any later code fragment will |
| also see that earlier access as having happened. |
| Assuming, that is, that all the code fragments involved correctly use |
| memory barriers. |
| |
| \subsection{MIPS} |
| |
| The MIPS memory model~\cite[Table 6.6]{MIPSvII-A-2015} |
| appears to resemble that of ARM, Itanium, and Power, |
| being weakly ordered by default, but respecting dependencies. |
| MIPS has a wide variety of memory-barrier instructions, but ties them |
| not to hardware considerations, but rather to the use cases provided |
| by the Linux kernel and the C++11 standard~\cite{RichardSmith2015N4527} |
| in a manner similar to the ARM64 additions: |
| |
| \begin{description}[style=nextline] |
| \item[\tco{SYNC}] |
| Full barrier for a number of hardware operations in addition |
| to memory references. |
| \item[\tco{SYNC_WMB}] |
| Write memory barrier, which can be used to implement the |
| \co{smp_wmb()} primitive in the Linux kernel. |
| \item[\tco{SYNC_MB}] |
| Full memory barrier, but only for memory operations. |
| This may be used to implement the Linux-kernel \co{smp_mb()} |
| and the C++ \co{atomic_thread_fence(memory_order_seq_cst)}. |
| \item[\tco{SYNC_ACQUIRE}] |
| Acquire memory barrier, which may be used to implement the |
| Linux-kernel \co{smp_load_acquire()} and the C++ |
| \co{atomic_load_explicit(..., memory_order_acquire)}. |
| \item[\tco{SYNC_RELEASE}] |
| Release memory barrier, which may be used to implement the |
| Linux-kernel \co{smp_store_release()} and the C++ |
| \co{atomic_store_explicit(..., memory_order_release)}. |
| \item[\tco{SYNC_RMB}] |
| Read memory barrier, which can be used to implement the |
| \co{smp_rmb()} primitive in the Linux kernel. |
| \end{description} |
| |
| Informal discussions with MIPS architects indicates that MIPS has a |
| definition of transitivity or cumulativity similar to that of |
| ARM and Power. |
| However, it appears that different MIPS implementations can have |
| different memory-ordering properties, so it is important to consult |
| the documentation for the specific MIPS implementation you are using. |
| |
| \subsection{PA-RISC} |
| |
| Although the PA-RISC architecture permits full reordering of loads and |
| stores, actual CPUs run fully ordered~\cite{GerryKane96a}. |
| This means that the Linux kernel's memory-ordering primitives generate |
| no code, however, they do use the gcc {\tt memory} attribute to disable |
| compiler optimizations that would reorder code across the memory |
| barrier. |
| |
| \subsection{POWER / PowerPC} |
| \label{sec:memorder:POWER / PowerPC} |
| |
| The POWER and PowerPC\textsuperscript{\textregistered} |
| CPU families have a wide variety of memory-barrier |
| instructions~\cite{PowerPC94,MichaelLyons05a}: |
| \begin{description} |
| \item [\tco{sync}] causes all preceding operations to {\em appear to have} |
| completed before any subsequent operations are started. |
| This instruction is therefore quite expensive. |
| \item [\tco{lwsync}] (light-weight sync) orders loads with respect to |
| subsequent loads and stores, and also orders stores. |
| However, it does {\em not} order stores with respect to subsequent |
| loads. |
| Interestingly enough, the {\tt lwsync} instruction enforces |
| the same ordering as does zSeries, and coincidentally, |
| SPARC TSO. |
| The \co{lwsync} instruction may be used to implement |
| load-acquire and store-release operations. |
| \item [\tco{eieio}] (enforce in-order execution of I/O, in case you |
| were wondering) causes all preceding cacheable stores to appear |
| to have completed before all subsequent stores. |
| However, stores to cacheable memory are ordered separately from |
| stores to non-cacheable memory, which means that {\tt eieio} |
| will not force an MMIO store to precede a spinlock release. |
| \item [\tco{isync}] forces all preceding instructions to appear to have |
| completed before any subsequent instructions start execution. |
| This means that the preceding instructions must have progressed |
| far enough that any traps they might generate have either happened |
| or are guaranteed not to happen, and that any side-effects of |
| these instructions (for example, page-table changes) are seen by the |
| subsequent instructions. |
| \end{description} |
| |
| Unfortunately, none of these instructions line up exactly with Linux's |
| {\tt wmb()} primitive, which requires {\em all} stores to be ordered, |
| but does not require the other high-overhead actions of the {\tt sync} |
| instruction. |
| But there is no choice: ppc64 versions of {\tt wmb()} and {\tt mb()} are |
| defined to be the heavyweight {\tt sync} instruction. |
| However, Linux's \co{smp_wmb()} instruction is never used for MMIO |
| (since a driver must carefully order MMIOs in UP as well as |
| SMP kernels, after all), so it is defined to be the lighter weight |
| {\tt eieio} instruction. |
| This instruction may well be unique in having a five-vowel mnemonic. |
| The \co{smp_mb()} instruction is also defined to be the {\tt sync} |
| instruction, but both \co{smp_rmb()} and \co{rmb()} are defined to |
| be the lighter-weight {\tt lwsync} instruction. |
| |
| Power features ``cumulativity'', which can be used to obtain |
| transitivity. |
| When used properly, any code seeing the results of an earlier |
| code fragment will also see the accesses that this earlier code |
| fragment itself saw. |
| Much more detail is available from |
| McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}. |
| |
| Power respects control dependencies in much the same way that ARM |
| does, with the exception that the Power \co{isync} instruction |
| is substituted for the ARM \co{ISB} instruction. |
| |
| Many members of the POWER architecture have incoherent instruction |
| caches, so that a store to memory will not necessarily be reflected |
| in the instruction cache. |
| Thankfully, few people write self-modifying code these days, but JITs |
| and compilers do it all the time. |
| Furthermore, recompiling a recently run program looks just like |
| self-modifying code from the CPU's viewpoint. |
| The {\tt icbi} instruction (instruction cache block invalidate) |
| invalidates a specified cache line from |
| the instruction cache, and may be used in these situations. |
| |
| \subsection{SPARC RMO, PSO, and TSO} |
| |
| Solaris on SPARC uses TSO (total-store order), as does Linux when built for |
| the ``sparc'' 32-bit architecture. |
| However, a 64-bit Linux kernel (the ``sparc64'' architecture) |
| runs SPARC in RMO (relaxed-memory order) mode~\cite{SPARC94}. |
| The SPARC architecture also offers an intermediate PSO (partial store |
| order). |
| Any program that runs in RMO will also run in either PSO or TSO, and similarly, |
| a program that runs in PSO will also run in TSO. |
| Moving a shared-memory parallel program in the other direction may |
| require careful insertion of memory barriers, although, as noted earlier, |
| programs that make standard use of synchronization primitives need not |
| worry about memory barriers. |
| |
| SPARC has a very flexible memory-barrier instruction~\cite{SPARC94} |
| that permits fine-grained control of ordering: |
| \begin{description} |
| \item [\tco{StoreStore}] orders preceding stores before subsequent stores. |
| (This option is used by the Linux \co{smp_wmb()} primitive.) |
| \item [\tco{LoadStore}] orders preceding loads before subsequent stores. |
| \item [\tco{StoreLoad}] orders preceding stores before subsequent loads. |
| \item [\tco{LoadLoad}] orders preceding loads before subsequent loads. |
| (This option is used by the Linux \co{smp_rmb()} primitive.) |
| \item [\tco{Sync}] fully completes all preceding operations before starting |
| any subsequent operations. |
| \item [\tco{MemIssue}] completes preceding memory operations before subsequent |
| memory operations, important for some instances of memory-mapped |
| I/O. |
| \item [\tco{Lookaside}] does the same as MemIssue, |
| but only applies to preceding stores |
| and subsequent loads, and even then only for stores and loads that |
| access the same memory location. |
| \end{description} |
| |
| The Linux \co{smp_mb()} primitive uses the first four options together, |
| as in |
| \co{membar #LoadLoad | #LoadStore | #StoreStore | #StoreLoad}, |
| thus fully ordering memory operations. |
| |
| So, why is \qco{membar #MemIssue} needed? |
| Because a \qco{membar #StoreLoad} could permit a subsequent |
| load to get its value from a store buffer, which would be |
| disastrous if the write was to an MMIO register that induced side effects |
| on the value to be read. |
| In contrast, \qco{membar #MemIssue} would wait until the store buffers |
| were flushed before permitting the loads to execute, |
| thereby ensuring that the load actually gets its value from the MMIO register. |
| Drivers could instead use \qco{membar #Sync}, but the lighter-weight |
| \qco{membar #MemIssue} is preferred in cases where the additional function |
| of the more-expensive \qco{membar #Sync} are not required. |
| |
| The \qco{membar #Lookaside} is a lighter-weight version of |
| \qco{membar #MemIssue}, which is useful when writing to a given MMIO register |
| affects the value that will next be read from that register. |
| However, the heavier-weight \qco{membar #MemIssue} must be used when |
| a write to a given MMIO register affects the value that will next be |
| read from {\em some other} MMIO register. |
| |
| It is not clear why SPARC does not define \co{wmb()} to be |
| \qco{membar #MemIssue} and \co{smp_wmb()} to be |
| \qco{membar #StoreStore}, |
| as the current definitions seem vulnerable to bugs in some drivers. |
| It is quite possible that all the SPARC CPUs that Linux runs on |
| implement a more conservative memory-ordering model than the architecture |
| would permit. |
| |
| SPARC requires a {\tt flush} instruction be used between the time that |
| an instruction is stored and executed~\cite{SPARC94}. |
| This is needed to flush any prior value for that location from |
| the SPARC's instruction cache. |
| Note that {\tt flush} takes an address, and will flush only that address |
| from the instruction cache. |
| On SMP systems, all CPUs' caches are flushed, but there is no |
| convenient way to determine when the off-CPU flushes complete, |
| though there is a reference to an implementation note. |
| |
| \subsection{x86} |
| |
| Since the x86 CPUs provide ``process ordering'' so that all CPUs agree |
| on the order of a given CPU's writes to memory, the \co{smp_wmb()} primitive |
| is a no-op for the CPU~\cite{IntelXeonV3-96a}. |
| However, a compiler directive is required to |
| prevent the compiler from performing optimizations that would result |
| in reordering across the \co{smp_wmb()} primitive. |
| |
| On the other hand, x86 CPUs have traditionally given |
| no ordering guarantees for loads, so |
| the \co{smp_mb()} and \co{smp_rmb()} primitives expand to {\tt lock;addl}. |
| This atomic instruction acts as a barrier to both loads and stores. |
| |
| Intel has also published a memory model for |
| x86~\cite{Intelx86MemoryOrdering2007}. |
| It turns out that Intel's actual CPUs enforced tighter ordering than |
| was claimed in the previous specifications, so this model is in effect |
| simply mandating the earlier de-facto behavior. |
| Even more recently, Intel published an updated memory model for |
| x86~\cite[Section 8.2]{Intel64IA32v3A2011}, which mandates a total global order |
| for stores, although individual CPUs are still permitted to see their |
| own stores as having happened earlier than this total global order |
| would indicate. |
| This exception to the total ordering is needed to allow important |
| hardware optimizations involving store buffers. |
| In addition, memory ordering obeys causality, so that if CPU~0 sees a |
| store by CPU~1, then CPU~0 is guaranteed to see all stores that CPU~1 |
| saw prior to its store. |
| Software may use atomic operations to override these hardware optimizations, |
| which is one reason that atomic operations tend to be more expensive |
| than their non-atomic counterparts. |
| This total store order is \emph{not} guaranteed on older processors. |
| |
| It is also important to note that atomic instructions operating |
| on a given memory location should all be of the same |
| size~\cite[Section 8.1.2.2]{Intel64IA32v3A2011}. |
| For example, if you write a program where one CPU atomically increments |
| a byte while another CPU executes a 4-byte atomic increment on |
| that same location, you are on your own. |
| |
| However, note that some SSE instructions are weakly ordered ({\tt clflush} |
| and non-temporal move instructions~\cite{IntelXeonV2b-96a}). |
| CPUs that have SSE can use {\tt mfence} for \co{smp_mb()}, |
| {\tt lfence} for \co{smp_rmb()}, and {\tt sfence} for \co{smp_wmb()}. |
| |
| A few versions of the x86 CPU have a mode bit that enables out-of-order |
| stores, and for these CPUs, \co{smp_wmb()} must also be defined to |
| be {\tt lock;addl}. |
| |
| Although newer x86 implementations accommodate self-modifying code |
| without any special instructions, to be fully compatible with |
| past and potential future x86 implementations, a given CPU must |
| execute a jump instruction or a serializing instruction (e.g., \co{cpuid}) |
| between modifying the code and executing |
| it~\cite[Section 8.1.3]{Intel64IA32v3A2011}. |
| |
| \subsection{zSeries} |
| |
| The zSeries machines make up the IBM\textsuperscript{\texttrademark} |
| mainframe family, previously |
| known as the 360, 370, and 390~\cite{IBMzSeries04a}. |
| Parallelism came late to zSeries, but given that these mainframes first |
| shipped in the mid 1960s, this is not saying much. |
| The \qco{bcr 15,0} instruction is used for the Linux \co{smp_mb()}, |
| \co{smp_rmb()}, and \co{smp_wmb()} primitives. |
| It also has comparatively strong memory-ordering semantics, as shown in |
| Table~\ref{tab:memorder:Summary of Memory Ordering}, which should allow the |
| \co{smp_wmb()} primitive to be a {\tt nop} (and by the time you read this, |
| this change may well have happened). |
| The table actually understates the situation, as the zSeries memory model |
| is otherwise sequentially consistent, meaning that all CPUs |
| will agree on the order of unrelated stores from different CPUs. |
| |
| As with most CPUs, the zSeries architecture does not guarantee a |
| cache-coherent instruction stream, hence, |
| self-modifying code must execute a serializing instruction between updating |
| the instructions and executing them. |
| That said, many actual zSeries machines do in fact accommodate self-modifying |
| code without serializing instructions. |
| The zSeries instruction set provides a large set of serializing instructions, |
| including compare-and-swap, some types of branches (for example, the |
| aforementioned \qco{bcr 15,0} instruction), and test-and-set, |
| among others. |
| |
| \section{Where is Memory Ordering Needed?} |
| \label{sec:memorder:Where is Memory Ordering Needed?} |
| |
| This section revisits |
| Table~\ref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| in light of the intervening discussion. |
| |
| Memory-ordering operations are only required where there is a possibility |
| of interaction involving at least two variables between at least two |
| threads. |
| As always, if a single-threaded program will provide sufficient |
| performance, why bother with parallelism?\footnote{ |
| Hobbyists and researchers should of course feel free to ignore |
| this and other cautions.} |
| After all, avoiding parallelism also avoids the added cost of |
| memory-ordering operations. |
| |
| If all thread-to-thread communication in a given cycle use store-to-load links |
| (that is, the next thread's load returning the value that the previous thread |
| stored), minimal ordering suffices. |
| Minimal ordering includes dependencies, acquires, and all stronger |
| ordering operations. |
| |
| If all but one of the links in a given cycle is a store-to-load |
| link, it is sufficient to use release-acquire pairs for each of |
| those store-to-load links. |
| You can replace a given acquire with a a dependency in environments permitting |
| this, keeping in mind that the C11 standard's memory model does \emph{not} |
| permit this. |
| Note also that a dependency leading to a load must be headed by |
| a \co{lockless_dereference()} or an \co{rcu_dereference()}: |
| \co{READ_ONCE()} is not sufficient. |
| Never forget to carefully review |
| Sections~\ref{sec:memorder:Address- and Data-Dependency Restrictions} |
| and~\ref{sec:memorder:Control-Dependency Restrictions}, because |
| a dependency broken by your compiler is no help at all! |
| The two threads sharing the sole non-store-to-load link can |
| usually substitute \co{WRITE_ONCE()} plus \co{smp_wmb()} for |
| \co{smp_store_release()} on the one hand, |
| and \co{READ_ONCE()} plus \co{smp_rmb()} for \co{smp_load_acquire()} |
| on the other. |
| |
| If a given cycle contains two or more non-store-to-load links (that is, |
| a total of two or more load-to-store and store-to-store links), you will |
| need at least one full barrier between each pair of non-store-to-load |
| links in that cycle. |
| Full barriers include \co{smp_mb()}, successful full-strength non-\co{void} |
| atomic RMW operations, and other atomic RMW operations in conjunction with |
| either \co{smp_mb__before_atomic()} or \co{smp_mb__after_atomic()}. |
| Any of RCU's grace-period-wait primitives (\co{synchronize_rcu()} and |
| friends) also act as full barriers, but at even greater expense than |
| \co{smp_mb()}. |
| With strength comes expense, though the overhead of full barriers |
| usually hurts performance more than it hurts scalability. |
| |
| Note that these are the \emph{minimum} guarantees. |
| Different architectures may give |
| more substantial guarantees, |
| as discussed in Section~\ref{sec:memorder:Hardware Specifics}, |
| but they may \emph{not} |
| be relied upon outside of code specifically designed to run only on |
| the corresponding architecture. |
| |
| One final word of advice: Again, use of raw memory-ordering primitives is |
| a last resort. |
| It is almost always better to use existing primitives, such as locking |
| or RCU, that take care of memory ordering for you. |