| % memorder/memorder.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{qqzmemorder} |
| \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{the Linux kernel}{Howells2009membartxt} |
| \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{Linux Weekly News}{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2} |
| \OriginallyPublished{chp:Advanced Synchronization: Memory Ordering}{Advanced Synchronization: Memory Ordering}{ASPLOS '18}{Alglave:2018:FSC:3173162.3177156} |
| % |
| \Epigraph{The art of progress is to preserve order amid change and to preserve change amid order.}{Alfred North Whitehead} |
| |
| Causality and sequencing are deeply intuitive, and hackers often |
| have a strong grasp of these concepts. |
| These intuitions can be quite helpful when writing, analyzing, and |
| debugging not only sequential code, but also parallel code that makes |
| use of standard mutual-exclusion mechanisms such as locking. |
| Unfortunately, these intuitions break down completely in complex |
| concurrent code, such as that in the Linux-kernel, which often uses |
| weakly ordered atomic operations and memory barriers. |
| One example of such code implements the standard mutual-exclusion |
| mechanisms themselves, while another example implements fast |
| paths that use weaker synchronization. |
| Insults to intuition notwithstanding, some argue that weakness is a |
| virtue~\cite{JadeAlglave2013-WeaknessIsVirtue}. |
| Vice or virtue, this chapter will help you gain an understanding of |
| memory ordering, that, with practice, will be sufficient to implement |
| synchronization primitives and performance-critical fast paths. |
| |
| First, \cref{sec:memorder:Memory-Model Intuitions} |
| provides some reliable intuitions and useful rules of thumb, which can |
| often provide a useful alternative to learning the full Linux-kernel |
| memory model (LKMM)\@. |
| However, those working on fast-paths or concurrency primitives will |
| need the additional detail provided by the following sections. |
| \Cref{sec:memorder:Ordering: Why and How?} |
| will demonstrate that real computer systems can reorder memory references, |
| give some reasons why they do so, and provide some information on how |
| to prevent undesired reordering. |
| \Cref{sec:memorder:Tricks and Traps,% |
| sec:memorder:Compile-Time Consternation} |
| will cover the types of pain that hardware and compilers, respectively, |
| can inflict on unwary parallel programmers. |
| \Cref{sec:memorder:Higher-Level Primitives} |
| gives an overview of the benefits of modeling memory ordering at |
| higher levels of abstraction. |
| Finally, |
| \cref{sec:memorder:Hardware Specifics} |
| follows up with more detail on a few representative hardware platforms. |
| |
| \QuickQuiz{ |
| This chapter has been rewritten since the first edition, |
| and heavily edited since the second edition. |
| Did memory ordering change all \emph{that} since 2014, |
| let alone since 2021? |
| }\QuickQuizAnswer{ |
| The earlier memory-ordering section had its roots in a pair of |
| Linux Journal articles~\cite{PaulMcKenney2005i,PaulMcKenney2005j} |
| dating back to 2005. |
| Since then, the C and C++ memory models~\cite{PeteBecker2011N3242} |
| have been formalized |
| (and critiqued~\cite{MarkBatty2013OOTA-WorkingNote,Boehm:2014:OGA:2618128.2618134,Vafeiadis:2015:CCO:2775051.2676995,conf/esop/BattyMNPS15,Lahav:2017:RSC:3140587.3062352,OlivierGiroux2017-P0668R1}), |
| executable formal memory models for computer systems have become the |
| norm~\cite{Maranget2012TutorialARMPower,PaulEMcKenney2011ppcmem,test6-pdf,JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306,JadeAlglave2013-cav,Alglave:2014:HCM:2594291.2594347,PaulEMcKenney2014weakaxiom,Flur:2017:MCA:3093333.3009839,ARMv8A:2017}, |
| and there is even a memory model for the Linux |
| kernel~\cite{JadeAlglave2017LWN-LKMM-1,JadeAlglave2017LWN-LKMM-2,Alglave:2018:FSC:3173162.3177156}, |
| along with a paper describing differences between the C11 and |
| Linux memory models~\cite{PaulEMcKenney2016P0124R6-LKMM}. |
| |
| The \IXacrf{kcsan}~\cite{MarcoElver2020FearDataRaceDetector1,MarcoElver2020FearDataRaceDetector2}, |
| based in part on |
| RacerD~\cite{SamBlackshear2018RacerD} |
| and implementing \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}, |
| has also been added to the Linux kernel |
| and is now heavily used. |
| |
| Finally, there are now better ways of describing the Linux-kernel |
| memory model (LKMM). |
| |
| Given all this progress, substantial change was required. |
| }\QuickQuizEnd |
| |
| \section{Memory-Model Intuitions} |
| \label{sec:memorder:Memory-Model Intuitions} |
| % |
| \epigraph{Almost all people are intelligent. |
| It is method that they lack.} |
| {F. W. Nichol} |
| |
| This section is for people who would like to avoid learning all the |
| details of the \IXalthmr{Linux-kernel memory model (\acr{lkmm})} |
| {Linux kernel}{memory model}, the better |
| to get on with their concurrency lives. |
| The good news is that learning a very small fraction of LKMM can get |
| you most of its benefits. |
| % @@@ \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| % @@@ and \cref{sec:memorder:Basic Rules of Thumb}, |
| % @@@ summarizing the intervening discussion with some appeals to transitive |
| % @@@ intuitions and with more sophisticated rules of thumb. |
| |
| But first, it is necessary to understand that useful ordering is a |
| property not of any single component of the system, but of the system |
| as a whole. |
| For example, it does not help for the hardware to preserve ordering if |
| the compiler does not also do so. |
| Nor does it help for both the hardware and compiler to provide ways |
| to preserve ordering if developers fail to use them. |
| The Linux kernel adds architecture-specific assembly-language primitives |
| to the mix, which must also preserve the required orderings. |
| In short, all the components of the system must work together to provide |
| any needed ordering. |
| |
| Next, it is necessary to understand the temporal and non-temporal |
| nature of communication from one thread to another when using memory |
| as the communications medium, as will be discussed in detail in |
| \cref{sec:memorder:Multicopy Atomicity}. |
| The key point is that although loads and stores are conceptually simple, |
| on real multicore hardware significant periods of time are required for |
| their effects to become visible to all other threads. |
| Strange though it might seem, this means that a given store's value |
| might not be returned by a load that happens later in wall-clock time, |
| and it also means that a given store's value might be overwritten by a |
| store that happens earlier in wall-clock time. |
| |
| The simple and intuitive case occurs when one thread loads a value that |
| some other thread stored. |
| This straightforward cause-and-effect case exhibits temporal behavior, so |
| that the software can safely assume that the store instruction completed |
| before the load instruction started. |
| In real life, the load instruction might well have started quite some |
| time before the store instruction did, but all modern hardware must |
| carefully hide such cases from the software. |
| Software will thus see the expected temporal cause-and-effect |
| behavior when one thread loads a value that some other thread stores, |
| as will be discussed in \cref{sec:memorder:Happens-Before}. |
| |
| This intuitive temporal behavior provides the basis for the next section's |
| transitive intuitions. |
| |
| \subsection{Transitive Intuitions} |
| \label{sec:memorder:Transitive Intuitions} |
| |
| This section lists intuitions regarding single threads or variables, |
| locking, release-acquire chains, RCU, and fully ordered code. |
| |
| \subsubsection{Singular Intuitive Bliss} |
| \label{sec:memorder:Singular Intuitive Bliss} |
| |
| A program that has only one variable or only one thread will see |
| all accesses in order. |
| There is quite a bit of code that can attain adequate performance |
| when running single-threaded on modern computer systems, but |
| this book is primarily about software that needs multiple CPUs. |
| On, then, to the next section. |
| |
| \subsubsection{Locking Intuitions} |
| \label{sec:memorder:Locking Intuitions} |
| |
| Another transitive intuition involves that much-maligned workhorse, |
| locking, as will be described in more detail in |
| \cref{sec:memorder:Locking}, |
| to say nothing of |
| \cref{chp:Locking}. |
| This section contains a graphical description followed by a verbal |
| description. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/locktrans}} |
| \caption{Locking Intuitions} |
| \label{fig:memorder:Locking Intuitions} |
| \end{figure*} |
| |
| The graphical description is shown in |
| \cref{fig:memorder:Locking Intuitions}, |
| which shows a lock being acquired and released by CPUs~0, 1, and~2 |
| in that order. |
| The solid black arrows depict the unlock-lock ordering. |
| The dotted lines emanating from them to the wide green arrows |
| show the effects on ordering. |
| In particular: |
| |
| \begin{enumerate} |
| \item The fact that CPU~0's unlock precedes CPU~1's lock ensures that |
| any access executed by CPU~0 within or before its critical |
| section will be seen by accesses executed by CPU~1 within |
| and after its critical section. |
| \item The fact that CPU~0's unlock precedes CPU~2's lock ensures that |
| any access executed by CPU~0 within or before its critical |
| section will be seen by accesses executed by CPU~2 within |
| and after its critical section. |
| \item The fact that CPU~1's unlock precedes CPU~2's lock ensures that |
| any access executed by CPU~1 within or before its critical |
| section will be seen by accesses executed by CPU~2 within and |
| after its critical section. |
| \end{enumerate} |
| |
| In short, lock-based ordering is transitive through CPUs~0, 1, and~2. |
| A key point is that this ordering extends beyond the critical sections, |
| so that everything before an earlier lock release is seen by everything |
| after a later lock acquisition. |
| |
| For those who prefer words to diagrams, code holding a given lock will |
| see the accesses in all prior critical sections for that same lock, |
| transitively. |
| And if such code sees the accesses in a given critical section, it will |
| also see the accesses in all of that CPU's code preceding |
| that critical section. |
| In other words, when a CPU releases a given lock, all |
| of that lock's subsequent critical sections will see the accesses in |
| all of that CPU's code preceding that lock release. |
| |
| Inversely, code holding a given lock will be protected from seeing the |
| accesses in any subsequent critical sections for that same lock, again, |
| transitively. |
| And if such code is protected against seeing the accesses in a given |
| critical section, it will also be protected against seeing the accesses |
| in all of that CPU's code following that critical section. |
| In other words, when a CPU acquires a given lock, all of |
| that lock's previous critical sections will be protected from seeing |
| the accesses in all of that CPU's code following that lock |
| acquisition. |
| |
| But what does it mean to ``see accesses'' and exactly what accesses |
| are seen? |
| |
| To start, an access is either a load or a store, possibly occurring as part |
| of a read-modify-write operation. |
| |
| If a CPU's code prior to its release of a given lock contains |
| an access A to a given variable, then for an access B to that same variable |
| contained in any CPU's code following a later acquisition |
| of that same lock: |
| |
| \begin{enumerate} |
| \item If A and B are both loads, then B will return either the same |
| value that A did or some later value. |
| \item If A is a load and B is a store, then B will overwrite either the |
| value loaded by A or some later value. |
| \item If A is a store and B is a load, then B will return either the |
| value stored by A or some later value. |
| \item If A and B are both stores, then B will overwrite either the value |
| stored by A or some later value. |
| \end{enumerate} |
| |
| Here, ``some later value'' is shorthand for ``the value stored by some |
| intervening access''. |
| |
| \QuickQuiz{ |
| But what about reader-writer locking? |
| }\QuickQuizAnswer{ |
| Reader-writer locking works the same as does exclusive locking, |
| with the proviso that ordering is guaranteed only between a |
| pair of conflicting critical sections. |
| This means that a read-side critical section will be ordered |
| before a later write-side critical section and vice versa. |
| But this also means that there is no guarantee of ordering |
| between a pair of read-side critical sections unless there is |
| an intervening write-side critical section. |
| |
| As of mid-2023, LKMM does not yet model reader-writer locking, |
| but this is on the to-do list. |
| }\QuickQuizEnd |
| |
| Locking is strongly intuitive, which is one reason why it has survived |
| so many attempts to eliminate it. |
| This is also one reason why you should use it where it applies. |
| |
| \subsubsection{Release-Acquire Intuitions} |
| \label{sec:memorder:Release-Acquire Intuitions} |
| |
| Release-acquire chains also behave in a transitively intuitive manner |
| not unlike that of locking, except that release-acquire chains do not |
| provide mutual exclusion. |
| This section also contains a graphical description followed by a verbal |
| description. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/relacqtrans}} |
| \caption{Release-Acquire Intuitions} |
| \label{fig:memorder:Release-Acquire Intuitions} |
| \end{figure*} |
| |
| The graphical description is shown in |
| \cref{fig:memorder:Release-Acquire Intuitions}, |
| which shows a release-acquire chain extending through CPUs~0, 1, and~2. |
| The solid black arrows depict the release-acquire ordering. |
| The dotted lines emanating from them to the wide green arrows show the |
| effects on ordering. |
| |
| \begin{enumerate} |
| \item The fact that CPU~0's release of A is read by CPU~1's acquire of A |
| ensures that any accesses executed by CPU~0 prior to its release |
| will be seen by any accesses executed by CPU~1 after its acquire. |
| \item The fact that CPU~1's release of B is read by CPU~2's acquire of B |
| ensures that any accesses executed by CPU~1 prior to its release |
| will be seen by any accesses executed by CPU~2 after its acquire. |
| \item Note also that CPU~0's release of A is read by CPU~1's acquire of |
| A, which precedes CPU 1's release of B, which is read by CPU~2's |
| acquire of B\@. |
| Taken together, all this ensures that any accesses executed by |
| CPU~0 prior to its release will be seen by any accesses executed |
| by CPU~2 after its acquire. |
| \end{enumerate} |
| |
| This illustrates that properly constructed release-acquire ordering is |
| transitive through CPUs~0, 1, and~2, and in fact may be extended through |
| as many CPUs as needed.\footnote{ |
| But please note that stray stores to either A or B can break |
| the release-acquire chain, as illustrated by |
| \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}.} |
| |
| For those who prefer words to diagrams, when an acquire loads the value |
| stored by a release, discussed in |
| \cref{sec:memorder:Release-Acquire Chains}, |
| then the code following that release will see all accesses preceding |
| the acquire. |
| More precisely, if CPU~0 does an acquire that loads the value stored by |
| CPU~1's release, than all the subsequent accesses executed by CPU~0 will |
| see the all of CPU~1's accesses prior to its release. |
| |
| Similarly, the accesses preceding that release access will be protected |
| from seeing the accesses following the acquire access. |
| (More precision is left as an exercise to the reader.) |
| |
| Releases and acquires can be chained, for example CPU~0's release |
| stores the value loaded by CPU~1's acquire, a later release by CPU~1 |
| stores the value loaded by CPU~2's acquire, and so on. |
| The accesses following a given acquire will see the accesses preceding |
| each prior release in the chain, and, inversely, the accesses preceding a |
| given release will be protected from seeing the accesses following each |
| later acquire in the chain. |
| Some long-chain examples will be illustrated by |
| \cref{lst:memorder:Long LB Release-Acquire Chain,% |
| lst:memorder:Long ISA2 Release-Acquire Chain,% |
| lst:memorder:Long Z6.2 Release-Acquire Chain} |
| on |
| \cpagerefrange{lst:memorder:Long LB Release-Acquire Chain}{lst:memorder:Long Z6.2 Release-Acquire Chain}, |
| respectively. |
| |
| The seeing and not seeing of accesses works the same way as described in |
| \cref{sec:memorder:Locking Intuitions}. |
| |
| However, as will be illustrated by |
| \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)} |
| on |
| \cpageref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}, |
| the acquire access must load exactly what was stored by the release access. |
| Any intervening store that is not itself part of that same release-acquire |
| chain will break the chain. |
| And again, release-acquire chains do not provide mutual exclusion. |
| |
| Nevertheless, properly constructed release-acquire chains are transitive, |
| intuitive, and useful. |
| |
| \subsubsection{RCU Intuitions} |
| \label{sec:memorder:RCU Intuitions} |
| |
| As noted in \cref{sec:defer:RCU Fundamentals} on |
| \cpageref{sec:defer:RCU Fundamentals}, RCU provides a number |
| of ordering guarantees. |
| |
| The first is the publish-subscribe mechanism described in |
| \cref{sec:defer:Publish-Subscribe Mechanism} |
| on |
| \cpageref{sec:defer:Publish-Subscribe Mechanism}. |
| This resembles the acquire-release chains discussed in the previous |
| section, but substitutes a member of the \co{rcu_dereference()} family |
| of primitives for the \co{smp_load_acquire()}. |
| Unlike \co{smp_load_acquire()}, the ordering implied by |
| \co{rcu_dereference()} applies only to subsequent accesses that |
| dereference the pointer returned by that \co{rcu_dereference()} |
| as shown in |
| \cref{fig:defer:Publication/Subscription Constraints} |
| on |
| \cpageref{fig:defer:Publication/Subscription Constraints}. |
| |
| The second guarantee says that if any part of an RCU read-side |
| critical section precedes the beginning of a grace period, then |
| the entirety of that critical section precedes the end of that |
| grace period, as shown in |
| \cref{fig:defer:RCU Reader and Later Grace Period} |
| on |
| \cpageref{fig:defer:RCU Reader and Later Grace Period}. |
| |
| The third guarantee says that if any part of an RCU read-side critical |
| section follows the end of a grace period, then the entirety of that |
| critical section follows the beginning of that grace period, as shown in |
| \cref{fig:defer:RCU Reader and Earlier Grace Period} |
| on |
| \cpageref{fig:defer:RCU Reader and Earlier Grace Period}. |
| |
| Both of these two guarantees are discussed in |
| \cref{sec:defer:Wait For Pre-Existing RCU Readers} |
| on |
| \cpageref{sec:defer:Wait For Pre-Existing RCU Readers}, |
| with more examples shown in |
| \cref{fig:defer:RCU Reader Within Grace Period,% |
| fig:defer:Summary of RCU Grace-Period Ordering Guarantees} |
| on |
| \cpageref{fig:defer:RCU Reader Within Grace Period,fig:defer:Summary of RCU Grace-Period Ordering Guarantees}. |
| These two guarantees have further version-maintenance consequences that |
| are discussed in |
| \cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects} |
| on |
| \cpageref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}. |
| |
| These guarantees will be discussed somewhat more formally in |
| \cref{sec:memorder:RCU} |
| on |
| \cpageref{sec:memorder:RCU}. |
| |
| Much of the sophistication of RCU lies not in its guarantees, but in its |
| use cases, which are the subject of |
| \cref{sec:defer:RCU Usage} |
| starting on |
| \cpageref{sec:defer:RCU Usage}. |
| |
| \subsubsection{Fully Ordered Intuitions} |
| \label{sec:memorder:Fully Ordered Intuitions} |
| |
| A more extreme example of transitivity places at least one \co{smp_mb()} |
| between each pair of accesses. |
| All accesses seen by any given access will also be seen by all later |
| accesses. |
| |
| The resulting program will be fully ordered, if somewhat slow. |
| Such programs will be sequentially consistent and much loved by |
| formal-verification experts who specialize in tried-and-true 1980s |
| proof techniques. |
| But slow or not, \co{smp_mb()} is always there when you need it! |
| |
| Nevertheless, there are situations that cannot be addressed by these |
| intuitive approaches. |
| The next section therefore presents a more complete, if less transitive, |
| set of rules of thumb. |
| |
| \subsection{Rules of Thumb} |
| \label{sec:memorder:Rules of Thumb} |
| |
| The transitive intuitions presented in the previous section are |
| very appealing, at least as \IX{memory model}s go. |
| Unfortunately, hardware is under no obligation to provide temporal |
| cause-and-effect illusions when one thread's store overwrites a value either |
| loaded or stored by some other thread. |
| It is quite possible that, from the software's viewpoint, an earlier store |
| will overwrite a later store's value, but only if those two stores were |
| executed by different threads, as will be illustrated by |
| \cref{fig:memorder:Store-to-Store is Counter-Temporal} |
| on |
| \cpageref{fig:memorder:Store-to-Store is Counter-Temporal}. |
| Similarly, a later load might well read a value overwritten by an |
| earlier store, but again only if that load and store were executed by |
| different threads, as will be illustrated by |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal} |
| on |
| \cpageref{fig:memorder:Load-to-Store is Counter-Temporal}. |
| This counter-intuitive behavior occurs due to the need to buffer |
| stores in order to achieve adequate performance, as will be discussed in |
| \cref{sec:memorder:Propagation} |
| on |
| \cpageref{sec:memorder:Propagation}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{.5\onecolumntextwidth}{!}{\includegraphics{memorder/ifthen}} |
| \caption{If-Then Nature of Memory Ordering} |
| \label{fig:memorder:If-Then Nature of Memory Ordering} |
| \end{figure} |
| |
| As a result, situations where one thread reads a value written by some |
| other thread can make do with far weaker ordering than can situations |
| where one thread overwrites a value loaded or stored by some other thread. |
| These differences are captured by the following rules of thumb. |
| However, it is important to note that none of these rules involve |
| mutual exclusion. |
| There is instead an if-then nature to these rules, as illustrated by |
| \cref{fig:memorder:If-Then Nature of Memory Ordering}. |
| In the figure, if CPU~1's reference to Y follows that of CPU~1, |
| and if both CPUs provide sufficient ordering between their two |
| accesses, then CPU~0's access to X will also precede that of |
| CPU~1. |
| |
| The first rule of thumb is that memory-ordering operations are only |
| required where there is a possibility of interaction between at least |
| two variables shared among at least two threads, which underlies the |
| singular intuitive bliss presented in |
| \cref{sec:memorder:Singular Intuitive Bliss}. |
| In light of the intervening material, this single sentence encapsulates |
| many of the basic rules of thumb that will be presented in |
| \cref{sec:memorder:Basic Rules of Thumb} |
| on |
| \cpageref{sec:memorder:Basic Rules of Thumb}, |
| for example, keeping in mind that ``memory-barrier pairing'' is a |
| two-thread special case of ``cycle''. |
| And, as always, if a single-threaded program will provide sufficient |
| performance, why bother with parallelism?\footnote{ |
| Hobbyists and researchers should of course feel free to ignore |
| this and many other cautions.} |
| After all, avoiding parallelism also avoids the added cost and complexity |
| of memory-ordering operations. |
| |
| The second rule of thumb involves load-buffering situations: |
| If all thread-to-thread communication in a given cycle use store-to-load |
| links (that is, the next thread's load returns the value stored by |
| the previous thread), minimal ordering suffices. |
| Minimal ordering includes dependencies and acquires as well as all stronger |
| ordering operations. |
| Because a lock acquisition must load the lock-word value stored by any |
| prior release of that lock, this rule of thumb underlies the locking |
| intuitions presented in |
| \cref{sec:memorder:Locking Intuitions}. |
| |
| The third rule of thumb involves release-acquire chains: |
| If all but one of the links in a given cycle is a store-to-load |
| link, it is sufficient to use release-acquire pairs for each of |
| those store-to-load links, as will be illustrated by |
| \cref{lst:memorder:Long ISA2 Release-Acquire Chain,% |
| lst:memorder:Long Z6.2 Release-Acquire Chain} |
| starting on |
| \cpageref{lst:memorder:Long ISA2 Release-Acquire Chain}. |
| This rule underlies the release-acquire intuitions presented in |
| \cref{sec:memorder:Release-Acquire Intuitions}. |
| |
| You can replace a given acquire with a dependency in environments permitting |
| this, keeping in mind that the |
| \IXalthmr{C11 standard's memory model}{C11}{memory model} does \emph{not} |
| fully respect dependencies. |
| Therefore, a dependency leading to a load must be headed by |
| a \co{READ_ONCE()} or an \co{rcu_dereference()}: |
| A plain C-language load is not sufficient. |
| In addition, carefully review |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties,% |
| sec:memorder:Control-Dependency Calamities}, |
| on |
| \cpageref{sec:memorder:Address- and Data-Dependency Difficulties,% |
| sec:memorder:Control-Dependency Calamities}, |
| because a dependency broken by your compiler will not order anything. |
| The two threads sharing the sole non-store-to-load link can |
| sometimes substitute \co{WRITE_ONCE()} plus \co{smp_wmb()} for |
| \co{smp_store_release()} on the one hand, |
| and \co{READ_ONCE()} plus \co{smp_rmb()} for \co{smp_load_acquire()} |
| on the other. |
| However, the wise developer will check such substitutions carefully, |
| for example, using the \co{herd} tool as described in |
| \cref{sec:formal:Axiomatic Approaches} |
| on |
| \cpageref{sec:formal:Axiomatic Approaches}. |
| |
| \QuickQuiz{ |
| Why is it necessary to use heavier-weight ordering for |
| load-to-store and store-to-store links, but not for |
| store-to-load links? |
| What on earth makes store-to-load links so special??? |
| }\QuickQuizAnswer{ |
| Recall that load-to-store and store-to-store links can be |
| counter-temporal, as illustrated by |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal,% |
| fig:memorder:Store-to-Store is Counter-Temporal} in |
| \cref{sec:memorder:Propagation}. |
| This counter-temporal nature of load-to-store and store-to-store |
| links necessitates strong ordering. |
| |
| In constrast, store-to-load links are temporal, as illustrated by |
| \cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test,% |
| lst:memorder:Load-Buffering Control-Dependency Litmus Test}. |
| This temporal nature of store-to-load links permits use of |
| minimal ordering. |
| }\QuickQuizEnd |
| |
| The fourth and final rule of thumb identifies where full memory barriers |
| (or stronger) are required: |
| If a given cycle contains two or more non-store-to-load links (that is, a |
| total of two or more links that are either load-to-store or store-to-store |
| links), you will need at least one full barrier between each pair of |
| non-store-to-load links in that cycle, as will be illustrated by |
| \cref{lst:memorder:W+WRC Litmus Test With More Barriers} |
| on |
| \cpageref{lst:memorder:W+WRC Litmus Test With More Barriers} |
| and most especially by the example presented in |
| \cref{sec:memorder:A Counter-Intuitive Case Study} |
| on |
| \cpageref{sec:memorder:A Counter-Intuitive Case Study}. |
| % @@@ \cref{lst:memorder:W+WRC Litmus Test With More Barriers} |
| % @@@ as well as in the answer to |
| % @@@ \QuickQuizARef{\MemorderQQLitmusTestR}. |
| Full barriers include \co{smp_mb()}, successful full-strength non-\co{void} |
| atomic RMW operations, and other atomic RMW operations in conjunction with |
| either \co{smp_mb__before_atomic()} or \co{smp_mb__after_atomic()}. |
| Any of RCU's grace-period-wait primitives (\co{synchronize_rcu()} and |
| friends) also act as full barriers, but at far greater expense than |
| \co{smp_mb()}. |
| With strength comes expense, though full barriers |
| usually hurt performance more than they hurt scalability. |
| The extreme logical endpoint of this rule of thumb underlies the |
| fully ordered intuitions presented in |
| \cref{sec:memorder:Fully Ordered Intuitions}. |
| |
| Recapping the rules: |
| |
| \begin{enumerate} |
| \item Memory-ordering operations are required only if at least |
| two variables are shared by at least two threads. |
| \item If all links in a cycle are store-to-load links, then |
| minimal ordering suffices. |
| \item If all but one of the links in a cycle are store-to-load links, |
| then each store-to-load link may use a release-acquire pair. |
| \item Otherwise, at least one full barrier is required between |
| each pair of non-store-to-load links. |
| \end{enumerate} |
| |
| Note that an architecture is permitted to provide stronger guarantees, as |
| will be discussed in |
| \cref{sec:memorder:Hardware Specifics}, |
| but these guarantees may only be relied upon in code that runs only for |
| that architecture. |
| In addition, more accurate |
| \IX{memory model}s~\cite{Alglave:2018:FSC:3173162.3177156} |
| may give stronger guarantees with lower-overhead operations than do |
| these rules of thumb, albeit at the expense of greater complexity. |
| In these more formal memory-ordering papers, a store-to-load link is an |
| example of a reads-from (rf) link, a load-to-store link is an example |
| of a from-reads (fr) link, and a store-to-store link is an example of |
| a coherence (co) link. |
| |
| One final word of advice: |
| Use of raw memory-ordering primitives is a last resort. |
| It is almost always better to use existing primitives, such as locking |
| or RCU, thus letting those primitives do the memory ordering for you. |
| |
| With that said, the next section describes why hardware misorders memory |
| references and some of the things that you can do about it. |
| |
| \section{Ordering: |
| Why and How?} |
| \label{sec:memorder:Ordering: Why and How?} |
| % |
| \epigraph{Nothing is orderly till people take hold of it. |
| Everything in creation lies around loose.} |
| {Henry Ward Beecher, updated} |
| |
| One motivation for memory ordering can be seen in the trivial-seeming |
| litmus test in |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test} |
| (\path{C-SB+o-o+o-o.litmus}), |
| which at first glance might |
| appear to guarantee that the \co{exists} clause never triggers.\footnote{ |
| Purists would instead insist that the \co{exists} clause is |
| never \emph{satisfied}, but we use ``trigger'' here by |
| analogy with assertions.} |
| After all, if \nbco{0:r2=0} as shown in the \co{exists} clause,\footnote{ |
| That is, Thread~\co{P0()}'s instance of local variable \co{r2} |
| equals zero. |
| See \cref{sec:formal:Anatomy of a Litmus Test} |
| for documentation of litmus-test nomenclature.} |
| we might hope that Thread~\co{P0()}'s load from~\co{x1} into \co{r2} |
| must have happened before Thread~\co{P1()}'s store to~\co{x1}, which |
| might raise further hopes that Thread~\co{P1()}'s load from~\co{x0} |
| into \co{r2} must happen after Thread~\co{P0()}'s store to~\co{x0}, |
| so that \nbco{1:r2=2}, thus never triggering the \co{exists} clause. |
| The example is symmetric, so similar reasoning might lead |
| us to hope that \nbco{1:r2=0} guarantees that \nbco{0:r2=2}. |
| Unfortunately, the lack of memory barriers dashes these hopes. |
| The CPU is within its rights to reorder |
| the statements within both Thread~\co{P0()} and Thread~\co{P1()}, |
| even on relatively strongly ordered systems such as x86. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-SB+o-o+o-o@whole.fcv} |
| \caption{Memory Misordering: |
| Store-Buffering Litmus Test} |
| \label{lst:memorder:Memory Misordering: Store-Buffering Litmus Test} |
| \end{listing} |
| |
| \QuickQuiz{ |
| The compiler can also reorder Thread~\co{P0()}'s and |
| Thread~\co{P1()}'s memory accesses in |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}, |
| right? |
| }\QuickQuizAnswer{ |
| In general, compiler optimizations carry out more extensive |
| and profound reorderings than CPUs can. |
| However, in this case, the volatile accesses in |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} |
| prevent the compiler from reordering. |
| And also from doing much else as well, so the examples in this |
| chapter will be making heavy use of |
| \co{READ_ONCE()} and \co{WRITE_ONCE()}. |
| See \cref{sec:memorder:Compile-Time Consternation} |
| for more detail on the need for \co{READ_ONCE()} and \co{WRITE_ONCE()}. |
| }\QuickQuizEnd |
| |
| This willingness to reorder can be confirmed using tools such as |
| \co{litmus7}~\cite{Alglave:2014:HCM:2594291.2594347}, |
| which found that the counter-intuitive ordering happened |
| 314 times out of 100,000,000 trials on an x86 laptop. |
| Oddly enough, the perfectly legal outcome where both loads return the |
| value 2 occurred less frequently, in this case, only 167 times.\footnote{ |
| Please note that results are sensitive to the exact hardware |
| configuration, |
| how heavily the system is loaded, and much else besides. |
| So why not try it out on your own system?} |
| The lesson here is clear: |
| Increased counter-intuitivity does not necessarily imply decreased probability! |
| % Run on June 23, 2017: |
| % litmus7 -r 1000 -carch X86 C-SB+o-o+o-o.litmus |
| % Test C-SB+o-o+o-o Allowed |
| % Histogram (4 states) |
| % 314 *>0:r2=0; 1:r2=0; |
| % 49999625:>0:r2=2; 1:r2=0; |
| % 49999894:>0:r2=0; 1:r2=2; |
| % 167 :>0:r2=2; 1:r2=2; |
| |
| The following sections show exactly how this intuition breaks down, |
| and then put forward some mental models of memory ordering that can help |
| you avoid these pitfalls. |
| |
| \Cref{sec:memorder:Why Hardware Misordering?} |
| gives a brief overview of why hardware misorders memory accesses, and then |
| \cref{sec:memorder:How to Force Ordering?} |
| gives an equally brief overview of how you can thwart such misordering. |
| Finally, \cref{sec:memorder:Basic Rules of Thumb} |
| lists some basic rules of thumb, which will be further refined in |
| later sections. |
| These sections focus on hardware reordering, but rest assured that compilers |
| reorder much more aggressively than hardware ever dreamed of doing. |
| Compiler-induced reordering will be taken up in |
| \cref{sec:memorder:Compile-Time Consternation}. |
| |
| \subsection{Why Hardware Misordering?} |
| \label{sec:memorder:Why Hardware Misordering?} |
| |
| But why does memory misordering happen in the first place? |
| Can't CPUs keep track of ordering on their own? |
| Isn't that why we have computers in the first place, to keep track of things? |
| |
| \begin{figure*} |
| \centering |
| \resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch}} |
| \caption{Intel Core 2 Architecture} |
| \ContributedBy{fig:memorder:Intel Core 2 Architecture}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted} |
| \end{figure*} |
| |
| Many people do indeed expect their computers to keep track of things, |
| but many also insist that they keep track of things quickly. |
| In fact, so intense is the focus on performance that modern CPUs |
| are extremely complex, as can be seen in the simplified block diagram in |
| \cref{fig:memorder:Intel Core 2 Architecture}. |
| Those needing to squeeze the last few percent of performance from their |
| systems will in turn need to pay close attention to the fine details of this |
| figure when tuning their software. |
| Except that this close attention to detail means that when a given CPU |
| degrades with age, the software will no longer run quickly on it. |
| For example, if the leftmost ALU fails, software tuned to take full |
| advantage of all of the ALUs might well run more slowly than untuned |
| software. |
| One solution to this problem is to take systems out of service as soon |
| as any of their CPUs start degrading. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{\textwidth}{!}{\includegraphics{memorder/Intel_Core2_arch-simplified}} |
| \caption{Intel Core 2 Architecture Simplified} |
| \ContributedBy{fig:memorder:Intel Core 2 Architecture Simplified}{Wikipedia user “I, Appaloosa” CC BY-SA 3.0, reformatted} |
| \end{figure*} |
| |
| Another option is to recall the lessons of |
| \cref{chp:Hardware and its Habits}, |
| especially the lesson that for many important workloads, main memory |
| cannot keep up with modern CPUs, which can execute hundreds of |
| instructions in the time required to fetch a single variable from memory. |
| For such workloads, the detailed internal structure of the CPU is |
| irrelevant, and the CPU can instead be approximated by the blue shapes in |
| \cref{fig:memorder:Intel Core 2 Architecture Simplified} |
| labeled CPU, store buffer, and cache. |
| |
| Because of these data-intensive workloads, CPUs sport increasingly large |
| caches, as was seen back in |
| \cref{fig:cpu:System Hardware Architecture}, |
| which means that although the first load by a given CPU from a given |
| variable will result in an expensive \emph{cache miss} as was discussed in |
| \cref{sec:cpu:Cache Misses}, subsequent |
| repeated loads from that variable by that CPU might execute |
| very quickly because the initial cache miss will have loaded that |
| variable into that CPU's cache. |
| |
| However, it is also necessary to accommodate frequent concurrent stores |
| from multiple CPUs to a set of shared variables. |
| In cache-coherent systems, if the caches hold multiple copies of a given |
| variable, all the copies of that variable must have the same value. |
| This works extremely well for concurrent loads, but not so well for |
| concurrent stores: |
| Each store must do something about all copies of the old value |
| (another cache miss!), which, given the finite speed of light and |
| the atomic nature of matter, will be slower than impatient software |
| hackers would like. |
| And these strings of stores are the reason for the blue block |
| labelled store buffer in |
| \cref{fig:memorder:Intel Core 2 Architecture Simplified}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{memorder/SystemArchSB}} |
| \caption{System Architecture With Store Buffers} |
| \label{fig:memorder:System Architecture With Store Buffers} |
| \end{figure} |
| |
| Removing the internal CPU complexity from |
| \cref{fig:memorder:Intel Core 2 Architecture Simplified}, |
| adding a second CPU, and showing main memory results in |
| \cref{fig:memorder:System Architecture With Store Buffers}. |
| When a given CPU stores to a variable |
| not present in that CPU's cache, then the new value |
| is instead placed in that CPU's store buffer. |
| The CPU can then proceed immediately, without having to wait for the |
| store to do something about all the old values of that variable |
| residing in other CPUs' caches. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Out-of-order}} |
| \caption{CPUs Can Do Things Out of Order} |
| \ContributedBy{fig:memorder:CPUs Can Do Things Out of Order}{Melissa Broussard} |
| \end{figure} |
| |
| Although store buffers can greatly increase performance, they can cause |
| instructions and memory references to execute out of order, which can |
| in turn cause serious confusion, as fancifully illustrated in |
| \cref{fig:memorder:CPUs Can Do Things Out of Order}. |
| |
| \begin{table*} |
| \rowcolors{6}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \small |
| \centering\OneColumnHSpace{-0.1in} |
| \ebresizewidth{ |
| \begin{tabular}{rllllll} |
| \toprule |
| & \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\ |
| \cmidrule(l){2-4} \cmidrule(l){5-7} |
| & Instruction & Store Buffer & Cache & |
| Instruction & Store Buffer & Cache \\ |
| \cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7} |
| 1 & (Initial state) & & \tco{x1==0} & |
| (Initial state) & & \tco{x0==0} \\ |
| 2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} & |
| \tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\ |
| 3 & \tco{r2 = x1;} (0) & \tco{x0==2} & \tco{x1==0} & |
| \tco{r2 = x0;} (0) & \tco{x1==2} & \tco{x0==0} \\ |
| 4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} & |
| (Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\ |
| 5 & (Finish store) & & \tco{x0==2} & |
| (Finish store) & & \tco{x1==2} \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{Memory Misordering: |
| Store-Buffering Sequence of Events} |
| \label{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| \end{table*} |
| |
| In particular, store buffers cause the memory misordering |
| illustrated by |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| \Cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| shows the steps leading to this misordering. |
| Row~1 shows the initial state, where CPU~0 has \co{x1} in its cache |
| and CPU~1 has \co{x0} in its cache, both variables having a value of zero. |
| \begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole] |
| Row~2 shows the state change due to each CPU's store (\clnref{st0,st1} of |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}). |
| Because neither CPU has the stored-to variable in its cache, both CPUs |
| record their stores in their respective store buffers. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| But wait!!! |
| On row~2 of |
| \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| both \co{x0} and \co{x1} each have two values at the same time, |
| namely zero and two. |
| How can that possibly work??? |
| }\QuickQuizAnswer{ |
| There is an underlying cache-coherence protocol that straightens |
| things out, which are discussed in |
| \cref{sec:app:whymb:Cache-Coherence Protocols}. |
| But if you think that a given variable having two values at |
| the same time is surprising, just wait until you get to |
| \cref{sec:memorder:Variables With Multiple Values}! |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:formal:C-SB+o-o+o-o:whole] |
| Row~3 shows the two loads (\clnref{ld0,ld1} of |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}). |
| Because the variable being loaded by each CPU is in that CPU's cache, |
| each load immediately returns the cached value, which in both cases |
| is zero. |
| \end{fcvref} |
| |
| But the CPUs are not done yet: |
| Sooner or later, they must empty their store buffers. |
| Because caches move data around in relatively large blocks called |
| \emph{cachelines}, and because each cacheline can hold several |
| variables, each CPU must get the cacheline into its own cache so |
| that it can update the portion of that cacheline corresponding |
| to the variable in its store buffer, but without disturbing any |
| other part of the cacheline. |
| Each CPU must also ensure that the cacheline is not present in any other |
| CPU's cache, for which a read-invalidate operation is used. |
| As shown on row~4, after both read-invalidate operations complete, |
| the two CPUs have traded cachelines, so that CPU~0's cache now contains |
| \co{x0} and CPU~1's cache now contains \co{x1}. |
| Once these two variables are in their new homes, each CPU can flush |
| its store buffer into the corresponding \IX{cache line}, leaving each |
| variable with its final value as shown on row~5. |
| |
| \QuickQuiz{ |
| But don't the values also need to be flushed from the cache |
| to main memory? |
| }\QuickQuizAnswer{ |
| Perhaps surprisingly, not necessarily! |
| On some systems, |
| if the two variables are being used heavily, they might |
| be bounced back and forth between the CPUs' caches and never |
| land in main memory. |
| }\QuickQuizEnd |
| |
| In summary, store buffers are needed to allow CPUs to handle |
| store instructions efficiently, but they can result in |
| counter-intuitive memory misordering. |
| |
| But what do you do if your algorithm really needs its memory |
| references to be ordered? |
| For example, suppose that you are communicating with a driver using |
| a pair of flags, one that says whether or not the driver is running |
| and the other that says whether there is a request pending for that |
| driver. |
| The requester needs to set the request-pending flag, then check |
| the driver-running flag, and if false, wake the driver. |
| Once the driver has serviced all the pending requests that it knows about, |
| it needs to clear its driver-running flag, then check the request-pending |
| flag to see if it needs to restart. |
| This very reasonable approach cannot work unless there is some way |
| to make sure that the hardware processes the stores and loads in order. |
| This is the subject of the next section. |
| |
| \subsection{How to Force Ordering?} |
| \label{sec:memorder:How to Force Ordering?} |
| |
| It turns out that there are compiler directives and synchronization |
| primitives (such as locking and RCU) that are responsible for maintaining |
| the illusion of ordering through use of \emph{\IXBpl{memory barrier}} (for |
| example, \co{smp_mb()} in the Linux kernel). |
| These memory barriers can be explicit instructions, as they are on |
| \ARM, \Power{}, Itanium, and Alpha, or they can be implied by other instructions, |
| as they often are on x86. |
| Since these standard synchronization primitives preserve the illusion of |
| ordering, your path of least resistance is to simply use these primitives, |
| thus allowing you to stop reading this section. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-SB+o-mb-o+o-mb-o@whole.fcv} |
| \caption{Memory Ordering: |
| Store-Buffering Litmus Test} |
| \label{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| \end{listing} |
| |
| However, if you need to implement the synchronization primitives |
| themselves, or if you are simply interested in understanding how memory |
| ordering works, read on! |
| The first stop on the journey is |
| \cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| (\path{C-SB+o-mb-o+o-mb-o.litmus}), |
| which places an \co{smp_mb()} Linux-kernel \IXh{full}{memory barrier} between |
| the store and load in both \co{P0()} and \co{P1()}, but is otherwise |
| identical to |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| % Test C-SB+o-mb-o+o-mb-o Allowed |
| % Histogram (3 states) |
| % 49553298:>0:r2=2; 1:r2=0; |
| % 49636449:>0:r2=0; 1:r2=2; |
| % 810253:>0:r2=2; 1:r2=2; |
| % No |
| These barriers prevent the counter-intuitive outcome from happening |
| on 100,000,000 trials on my x86 laptop. |
| Interestingly enough, the added overhead due to these barriers causes the |
| legal outcome where both loads return the value two to happen more |
| than 800,000 times, as opposed to only 167 times for the |
| barrier-free code in |
| \cref{lst:memorder:Memory Misordering: Store-Buffering Litmus Test}. |
| |
| \begin{table*} |
| \rowcolors{6}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \small |
| \centering\OneColumnHSpace{-0.1in} |
| \ebresizewidth{ |
| \begin{tabular}{rllllll} |
| \toprule |
| & \multicolumn{3}{c}{CPU 0} & \multicolumn{3}{c}{CPU 1} \\ |
| \cmidrule(l){2-4} \cmidrule(l){5-7} |
| & Instruction & Store Buffer & Cache & |
| Instruction & Store Buffer & Cache \\ |
| \cmidrule{1-1} \cmidrule(l){2-4} \cmidrule(l){5-7} |
| 1 & (Initial state) & & \tco{x1==0} & |
| (Initial state) & & \tco{x0==0} \\ |
| 2 & \tco{x0 = 2;} & \tco{x0==2} & \tco{x1==0} & |
| \tco{x1 = 2;} & \tco{x1==2} & \tco{x0==0} \\ |
| 3 & \tco{smp_mb();} & \tco{x0==2} & \tco{x1==0} & |
| \tco{smp_mb();} & \tco{x1==2} & \tco{x0==0} \\ |
| 4 & (Read-invalidate) & \tco{x0==2} & \tco{x0==0} & |
| (Read-invalidate) & \tco{x1==2} & \tco{x1==0} \\ |
| 5 & (Finish store) & & \tco{x0==2} & |
| (Finish store) & & \tco{x1==2} \\ |
| 6 & \tco{r2 = x1;} (2) & & \tco{x1==2} & |
| \tco{r2 = x0;} (2) & & \tco{x0==2} \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{Memory Ordering: |
| Store-Buffering Sequence of Events} |
| \label{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events} |
| \end{table*} |
| |
| These barriers have a profound effect on ordering, as can be seen in |
| \cref{tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}. |
| Although the first two rows are the same as in |
| \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events} |
| and although the \co{smp_mb()} instructions on row~3 |
| do not change state |
| in and of themselves, they do cause the stores to complete |
| (rows~4 and~5) before the |
| loads (row~6), which rules out the counter-intuitive outcome shown in |
| \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events}. |
| Note that variables \co{x0} and \co{x1} each still have more than one |
| value on row~2, however, as promised earlier, the \co{smp_mb()} |
| invocations straighten things out in the end. |
| |
| Although full barriers such as \co{smp_mb()} have extremely strong |
| ordering guarantees, their strength comes at a high price in terms |
| of foregone hardware and compiler optimizations. |
| A great many situations can be handled with much weaker ordering guarantees |
| that use much cheaper memory-ordering instructions, or, in some case, no |
| memory-ordering instructions at all. |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.7in} |
| \renewcommand*{\arraystretch}{1.1} |
| \rowcolors{7}{lightgray}{} |
| \ebresizewidth{ |
| \begin{tabular}{lcccccccccccc}\toprule |
| & & \multicolumn{4}{c}{Prior Ordered Operation} & |
| \multicolumn{7}{c}{Subsequent Ordered Operation} \\ |
| \cmidrule(l){3-6} \cmidrule(l){7-13} |
| Operation Providing Ordering & C & |
| Self & R & W & RMW & Self & R & W & DR & DW & RMW & SV \\ |
| \cmidrule(r){1-1} \cmidrule{2-2} \cmidrule(l){3-6} \cmidrule(l){7-13} |
| Store, for example, \tco{WRITE_ONCE()} & & |
| Y & & & & & & & & & & Y \\ |
| Load, for example, \tco{READ_ONCE()} & & |
| Y & & & & & & & Y & Y & & Y \\ |
| \tco{_relaxed()} RMW operation & & |
| Y & & & & & & & Y & Y & & Y \\ |
| \tco{*_dereference()} & & |
| Y & & & & & & & Y & Y & & Y \\ |
| Successful \tco{*_acquire()} & & |
| R & & & & & Y & Y & Y & Y & Y & Y \\ |
| Successful \tco{*_release()} & C & |
| & Y & Y & Y & W & & & & & & Y \\ |
| \tco{smp_rmb()} & & |
| & Y & & R & & Y & & Y & & R & \\ |
| \tco{smp_wmb()} & & |
| & & Y & W & & & Y & & Y & W & \\ |
| \tco{smp_mb()} and \tco{synchronize_rcu()} & CP & |
| & Y & Y & Y & & Y & Y & Y & Y & Y & \\ |
| Successful full-strength non-\tco{void} RMW & CP & |
| Y & Y & Y & Y & Y & Y & Y & Y & Y & Y & Y \\ |
| \tco{smp_mb__before_atomic()} & CP & |
| & Y & Y & Y & & a & a & a & a & Y & \\ |
| \tco{smp_mb__after_atomic()} & CP & |
| & a & a & Y & & Y & Y & Y & Y & Y & \\ |
| \bottomrule |
| \end{tabular} |
| } |
| |
| \vspace{5pt}\hfill |
| \ebresizeverb{.8}{ |
| \framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt} |
| \rowcolors{1}{}{} |
| \begin{tabular}{lrl} |
| Key: & C: & Ordering is cumulative \\ |
| & P: & Ordering propagates \\ |
| & R: & Read, for example, \tco{READ_ONCE()}, or read portion of RMW \\ |
| & W: & Write, for example, \tco{WRITE_ONCE()}, or write portion of RMW \\ |
| & Y: & Provides the specified ordering \\ |
| & a: & Provides specified ordering given intervening RMW atomic operation \\ |
| & DR: & Dependent read (address dependency, \cref{sec:memorder:Address Dependencies}) \\ |
| & DW: & Dependent write (address, data, or control dependency, \crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies}) \\ |
| & RMW: & \IX{Atomic read-modify-write operation} \\ |
| & Self: & Orders self, as opposed to accesses both before |
| and after \\ |
| & SV: & Orders later accesses to the same variable \\ |
| \multicolumn{3}{l}{\emph{Applies to Linux kernel v4.15 and later.}} \\ |
| \end{tabular} |
| }\OneColumnHSpace{-0.9in} |
| } |
| \caption{Linux-Kernel Memory-Ordering Cheat Sheet} |
| \label{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| \end{table*} |
| |
| \Cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| provides a cheatsheet of the Linux kernel's ordering primitives and their |
| guarantees. |
| Each row corresponds to a primitive or category of primitives that might |
| or might not provide ordering, with the columns labeled |
| ``Prior Ordered Operation'' and ``Subsequent Ordered Operation'' |
| being the operations that might (or might not) be ordered against. |
| Cells containing ``Y'' indicate that ordering is supplied unconditionally, |
| while other characters indicate that ordering is supplied only partially or |
| conditionally. |
| Blank cells indicate that no ordering is supplied. |
| |
| The ``Store'' row also covers the store portion of an |
| \IXalt{atomic RMW operation}{atomic read-modify-write operation}. |
| In addition, the ``Load'' row covers the load |
| component of a successful value-returning \co{_relaxed()} RMW atomic |
| operation, although the combined ``\co{_relaxed()} RMW operation'' |
| line provides a convenient combined reference in the value-returning case. |
| A CPU executing unsuccessful value-returning atomic RMW operations must |
| invalidate the corresponding variable from all other CPUs' caches. |
| Therefore, unsuccessful value-returning atomic RMW operations have many |
| of the properties of a store, which means that the ``\co{_relaxed()} |
| RMW operation'' line also applies to unsuccessful value-returning atomic |
| RMW operations. |
| |
| The \co{*_acquire} row covers \co{smp_load_acquire()}, |
| \co{cmpxchg_acquire()}, \co{xchg_acquire()}, and so on; the \co{*_release} |
| row covers \co{smp_store_release()}, \co{rcu_assign_pointer()}, |
| \co{cmpxchg_release()}, \co{xchg_release()}, and so on; and |
| the ``Successful full-strength non-\co{void} RMW'' row covers |
| \co{atomic_add_return()}, \co{atomic_add_unless()}, \co{atomic_dec_and_test()}, |
| \co{cmpxchg()}, \co{xchg()}, and so on. |
| The ``Successful'' qualifiers apply to primitives such as |
| \co{atomic_add_unless()}, \co{cmpxchg_acquire()}, and \co{cmpxchg_release()}, |
| which have no effect on either memory or on ordering when they indicate |
| failure, as indicated by the earlier ``\co{_relaxed()} RMW operation'' row. |
| |
| Column ``C'' indicates cumulativity and propagation, as explained in |
| \cref{sec:memorder:Cumulativity,% |
| sec:memorder:Propagation}. |
| In the meantime, this column can usually be ignored when there |
| are at most two threads involved. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| The rows in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| seem quite random and confused. |
| Whatever is the conceptual basis of this table??? |
| }\QuickQuizAnswerB{ |
| The rows correspond roughly to hardware mechanisms of increasing |
| power and overhead. |
| |
| The \co{WRITE_ONCE()} row captures the fact that accesses to |
| a single variable are always fully ordered, as indicated by |
| the ``SV'' column. |
| Note that all other operations providing ordering against accesses |
| to multiple variables also provide this same-variable ordering. |
| |
| The \co{READ_ONCE()} row captures the fact that (as of 2021) compilers |
| and CPUs do not indulge in user-visible speculative stores, so that |
| any store whose address, data, or execution depends on a prior load |
| is guaranteed to happen after that load completes. |
| However, this guarantee assumes that these dependencies have |
| been constructed carefully, as described in |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties,% |
| sec:memorder:Control-Dependency Calamities}. |
| |
| The ``\co{_relaxed()} RMW operation'' row captures the fact |
| that a value-returning \co{_relaxed()} RMW has done a load and a |
| store, which are every bit as good as a \co{READ_ONCE()} and a |
| \co{WRITE_ONCE()}, respectively. |
| |
| The \co{*_dereference()} row captures the address and data |
| dependency ordering provided by \co{rcu_dereference()} and friends. |
| Again, these dependencies must been constructed carefully, |
| as described in |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties}. |
| |
| The ``Successful \co{*_acquire()}'' row captures the fact that many |
| CPUs have special ``acquire'' forms of loads and of atomic RMW |
| instructions, |
| and that many other CPUs have lightweight memory-barrier |
| instructions that order prior loads against subsequent loads |
| and stores. |
| |
| The ``Successful \co{*_release()}'' row captures the fact that many |
| CPUs have special ``release'' forms of stores and of atomic RMW |
| instructions, and that many other CPUs have lightweight memory-barrier |
| instructions that order prior loads and stores against |
| subsequent stores. |
| |
| The \co{smp_rmb()} row captures the fact that many CPUs have |
| lightweight memory-barrier instructions that order prior loads against |
| subsequent loads. |
| Similarly, |
| the \co{smp_wmb()} row captures the fact that many CPUs have |
| lightweight memory-barrier instructions that order prior stores against |
| subsequent stores. |
| |
| None of the ordering operations thus far require prior stores to be |
| ordered against subsequent loads, which means that these operations |
| need not interfere with store buffers, whose main purpose in life |
| is in fact to reorder prior stores against subsequent loads. |
| The lightweight nature of these operations is precisely due to |
| their policy of store-buffer non-interference. |
| However, as noted earlier, it is sometimes necessary to interfere |
| with the store buffer in order to prevent prior stores from being |
| reordered against later stores, which brings us to the remaining |
| rows in this table. |
| |
| The \co{smp_mb()} row corresponds to the \IXh{full}{memory barrier} |
| available on most platforms, with Itanium being the exception |
| that proves the rule. |
| However, even on Itanium, \co{smp_mb()} provides full ordering |
| with respect to \co{READ_ONCE()} and \co{WRITE_ONCE()}, |
| as discussed in \cref{sec:memorder:Itanium}. |
| |
| The ``Successful full-strength non-\co{void} RMW'' row captures |
| the fact that on some platforms (such as x86) atomic RMW instructions |
| provide full ordering both before and after. |
| The Linux kernel therefore requires that full-strength non-\co{void} |
| atomic RMW operations provide full ordering in cases where these |
| operations succeed. |
| (Full-strength atomic RMW operation's names do not end in |
| \co{_relaxed}, \co{_acquire}, or \co{_release}.) |
| As noted earlier, the case where these operations do not succeed |
| is covered by the ``\co{_relaxed()} RMW operation'' row. |
| |
| However, the Linux kernel does not require that either \co{void} |
| or \co{_relaxed()} atomic RMW operations provide any ordering |
| whatsoever, with the canonical example being \co{atomic_inc()}. |
| Therefore, these operations, along with failing non-\co{void} |
| atomic RMW operations may be preceded by \co{smp_mb__before_atomic()} |
| and followed by \co{smp_mb__after_atomic()} to provide full |
| ordering for any accesses preceding or following both. |
| No ordering need be provided for accesses between the |
| \co{smp_mb__before_atomic()} (or, similarly, the |
| \co{smp_mb__after_atomic()}) and the atomic RMW operation, as |
| indicated by the ``a'' entries on the \co{smp_mb__before_atomic()} |
| and \co{smp_mb__after_atomic()} rows of the table. |
| |
| In short, the structure of this table is dictated by the |
| properties of the underlying hardware, which are constrained by |
| nothing other than the laws of physics, which were covered back in |
| \cref{chp:Hardware and its Habits}. |
| That is, the table is not random, although it is quite possible |
| that you are confused. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Why is |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| missing \co{smp_mb__after_unlock_lock()} and |
| \co{smp_mb__after_spinlock()}? |
| }\QuickQuizAnswerE{ |
| These two primitives are rather specialized, and at present |
| seem difficult to fit into |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| The \co{smp_mb__after_unlock_lock()} primitive is intended to be placed |
| immediately after a lock acquisition, and ensures that all CPUs |
| see all accesses in prior critical sections as happening before |
| all accesses following the \co{smp_mb__after_unlock_lock()} |
| and also before all accesses in later critical sections. |
| Here ``all CPUs'' includes those CPUs not holding that lock, |
| and ``prior critical sections'' includes all prior critical sections |
| for the lock in question as well as all prior critical sections |
| for all other locks that were released by the same CPU that executed |
| the \co{smp_mb__after_unlock_lock()}. |
| |
| The \co{smp_mb__after_spinlock()} provides the same guarantees |
| as does \co{smp_mb__after_unlock_lock()}, but also provides |
| additional visibility guarantees for other accesses performed |
| by the CPU that executed the \co{smp_mb__after_spinlock()}. |
| Given any store S performed prior to any earlier lock acquisition |
| and any load L performed after the \co{smp_mb__after_spinlock()}, |
| all CPUs will see S as happening before~L\@. |
| In other words, if a CPU performs a store S, acquires a lock, |
| executes an \co{smp_mb__after_spinlock()}, then performs a |
| load L, all CPUs will see S as happening before~L\@. |
| }\QuickQuizEndE |
| } |
| |
| It is important to note that this table is just a cheat sheet, |
| and is therefore in no way a replacement for a good understanding |
| of memory ordering. |
| To begin building such an understanding, the next section will |
| present some basic rules of thumb. |
| |
| \subsection{Basic Rules of Thumb} |
| \label{sec:memorder:Basic Rules of Thumb} |
| |
| This section presents some basic rules of thumb that are ``good and |
| sufficient'' for a great many situations. |
| In fact, you could write a great deal of concurrent code having |
| excellent performance and scalability without needing anything more |
| than these rules of thumb. |
| More sophisticated rules of thumb will be presented in |
| \cref{sec:memorder:Memory-Model Intuitions}. |
| |
| \QuickQuiz{ |
| But how can I know that a given project can be designed |
| and coded within the confines of these rules of thumb? |
| }\QuickQuizAnswer{ |
| Much of the purpose of the remainder of this chapter is |
| to answer exactly that question! |
| }\QuickQuizEnd |
| |
| \paragraph{A given thread sees its own accesses in order.} |
| This rule assumes that loads and stores from/to shared variables use |
| \co{READ_ONCE()} and \co{WRITE_ONCE()}, respectively. |
| Otherwise, the compiler can profoundly scramble\footnote{ |
| Many compiler writers prefer the word ``optimize''.} |
| your code, and sometimes the CPU can do a bit of scrambling as well, |
| as discussed in \cref{sec:memorder:Itanium}. |
| |
| \paragraph{Interrupts and signal handlers are part of a thread.} |
| Both interrupt and signal handlers happen between a pair of adjacent |
| instructions in a thread. |
| This means that a given handler appears to execute atomically |
| from the viewpoint of the interrupted thread, at least at the |
| assembly-language level. |
| However, the C and C++ languages do not define the results of handlers |
| and interrupted threads sharing plain variables. |
| Instead, such shared variables must be \co{sig_atomic_t}, lock-free |
| atomics, or \co{volatile}. |
| |
| On the other hand, because the handler executes within the interrupted |
| thread's context, the memory ordering used to synchronize communication |
| between the handler and the thread can be extremely lightweight. |
| For example, the counterpart of an acquire load is a \co{READ_ONCE()} |
| followed by a \co{barrier()} compiler directive and the counterpart |
| of a release store is a \co{barrier()} followed by a \co{WRITE_ONCE()}. |
| The counterpart of a full memory barrier is \co{barrier()}. |
| Finally, disabling interrupts or signals (as the case may be) within |
| the thread excludes handlers. |
| |
| % @@@ Using flags to avoid expensive interrupt-disabling instructions |
| % @@@ and signal-blocking system calls. Harder than it looks! |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{memorder/memorybarrier}} |
| \caption{Memory Barriers Provide Conditional If-Then Ordering} |
| \label{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| \end{figure} |
| |
| \paragraph{Ordering has conditional if-then semantics.} |
| \Cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| illustrates this for memory barriers. |
| Assuming that both memory barriers are strong enough, if CPU~1's access |
| Y1 happens after CPU~0's access Y0, then CPU~1's access X1 is guaranteed |
| to happen after CPU~0's access X0. |
| |
| \QuickQuiz{ |
| How can you tell which memory barriers are strong enough for |
| a given use case? |
| }\QuickQuizAnswer{ |
| Ah, that is a deep question whose answer requires most of the |
| rest of this chapter. |
| But the short answer is that \co{smp_mb()} is almost always |
| strong enough, albeit at some cost. |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:formal:C-SB+o-mb-o+o-mb-o:whole] |
| \Cref{lst:memorder:Memory Ordering: Store-Buffering Litmus Test} |
| is a case in point. |
| The \co{smp_mb()} on \clnref{P0:mb,P1:mb} serve as the barriers, |
| the store to \co{x0} on \clnref{P0:st} as X0, the load from \co{x1} |
| on \clnref{P0:ld} as Y0, the store to \co{x1} on \clnref{P1:st} as Y1, |
| and the load from \co{x0} on \clnref{P1:ld} as X1. |
| Applying the if-then rule step by step, we know that the store to |
| \co{x1} on \clnref{P1:st} happens after the load from \co{x1} on \clnref{P0:ld} if |
| \co{P0()}'s local variable \co{r2} is set to the value zero. |
| The if-then rule would then state that the load from \co{x0} on |
| \clnref{P1:ld} happens after the store to \co{x0} on \clnref{P0:st}. |
| In other words, |
| \co{P1()}'s local variable \co{r2} is guaranteed |
| to end up with the value two \emph{only if} |
| \co{P0()}'s local variable \co{r2} ends up with the value zero. |
| This underscores the point that memory ordering guarantees are |
| conditional, not absolute. |
| \end{fcvref} |
| |
| Although |
| \cref{fig:memorder:Memory Barriers Provide Conditional If-Then Ordering} |
| specifically mentions memory barriers, this same if-then rule applies |
| to the rest of the Linux kernel's ordering operations. |
| |
| \paragraph{Ordering operations must be paired.} |
| If you carefully order the operations in one thread, but then fail to do |
| so in another thread, then there is no ordering. |
| Both threads must provide ordering for the if-then rule to apply.\footnote{ |
| In \cref{sec:memorder:Propagation}, pairing will be |
| generalized to cycles.} |
| |
| \paragraph{Ordering operations almost never speed things up.} |
| If you find yourself tempted to add a memory barrier in an attempt |
| to force a prior store to be flushed to memory faster, resist! |
| Adding ordering usually slows things down. |
| Of course, there are situations where adding instructions speeds things |
| up, as was shown by |
| \cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR} on |
| \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR}, |
| but careful benchmarking is required in such cases. |
| And even then, it is quite possible that although you sped things up |
| a little bit on \emph{your} system, you might well have slowed things |
| down significantly on your users' systems. |
| Or on your future system. |
| |
| \paragraph{Ordering operations are not magic.} |
| When your program is failing due to some \IX{race condition}, it is often |
| tempting to toss in a few memory-ordering operations in an attempt |
| to barrier your bugs out of existence. |
| A far better reaction is to use higher-level primitives in a carefully |
| designed manner. |
| With concurrent programming, it is almost always better to design your |
| bugs out of existence than to hack them down to lower probabilities. |
| |
| \paragraph{These are only rough rules of thumb.} |
| Although these rules of thumb cover the vast majority of situations |
| seen in actual practice, as with any set of rules of thumb, they |
| do have their limits. |
| The next section will demonstrate some of these limits by introducing |
| trick-and-trap litmus tests that are intended to insult your |
| intuition while increasing your understanding. |
| These litmus tests will also illuminate many of the concepts |
| represented by the Linux-kernel memory-ordering cheat sheet shown in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}, |
| and can be automatically analyzed given proper |
| tooling~\cite{Alglave:2018:FSC:3173162.3177156}. |
| \Cref{sec:memorder:Memory-Model Intuitions} will |
| circle back to this cheat sheet, presenting a more sophisticated set of |
| rules of thumb in light of learnings from all the intervening tricks |
| and traps. |
| |
| \QuickQuiz{ |
| Wait!!! |
| Where do I find this tooling that automatically analyzes |
| litmus tests??? |
| }\QuickQuizAnswer{ |
| Get version v4.17 (or later) of the Linux-kernel source code, |
| then follow the instructions in \path{tools/memory-model/README} |
| to install the needed tools. |
| Then follow the further instructions to run these tools on the |
| litmus test of your choice. |
| }\QuickQuizEnd |
| |
| \section{Tricks and Traps} |
| \label{sec:memorder:Tricks and Traps} |
| % |
| \epigraph{Knowing where the trap is---that's the first step in evading it.} |
| {Duke Leto Atreides, \emph{Dune}, Frank Herbert} |
| |
| Now that you know that hardware can reorder memory accesses and that you |
| can prevent it from doing so, the next step is to get you to admit |
| that your intuition has a problem. |
| This painful task is taken up by |
| \cref{sec:memorder:Variables With Multiple Values}, |
| which presents some code demonstrating that scalar variables can |
| take on multiple values simultaneously, |
| and by |
| \crefthro{sec:memorder:Memory-Reference Reordering} |
| {sec:memorder:Multicopy Atomicity}, |
| which show a series of intuitively correct code fragments that fail miserably |
| on real hardware. |
| Once your intuition has made it through the grieving process, later |
| sections will summarize the basic rules that memory ordering follows. |
| |
| But first, let's take a quick look at just how many values a single |
| variable might have at a single point in time. |
| |
| \subsection{Variables With Multiple Values} |
| \label{sec:memorder:Variables With Multiple Values} |
| |
| It is natural to think of a variable as taking on a well-defined |
| sequence of values in a well-defined, global order. |
| Unfortunately, the next stop on the journey says ``goodbye'' to this comforting fiction. |
| Hopefully, you already started to say ``goodbye'' in response to row~2 of |
| \cref{tab:memorder:Memory Misordering: Store-Buffering Sequence of Events,% |
| tab:memorder:Memory Ordering: Store-Buffering Sequence of Events}, |
| and if so, the purpose of this section is to drive this point home. |
| |
| \begin{fcvref}[ln:memorder:Software Logic Analyzer] |
| To this end, consider the program fragment shown in |
| \cref{lst:memorder:Software Logic Analyzer}. |
| This code fragment is executed in parallel by several CPUs. |
| \Clnref{setid} sets a shared variable to the current CPU's ID, \clnref{init} |
| initializes several variables from a \co{gettb()} function that |
| delivers the value of a fine-grained hardware ``timebase'' counter that is |
| synchronized among all CPUs (not available from all CPU architectures, |
| unfortunately!), and the loop from \clnrefrange{loop:b}{loop:e} |
| records the length of |
| time that the variable retains the value that this CPU assigned to it. |
| Of course, one of the CPUs will ``win'', and would thus never exit |
| the loop if not for the check on \clnrefrange{iftmout}{break}. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| What assumption is the code fragment |
| in \cref{lst:memorder:Software Logic Analyzer} |
| making that might not be valid on real hardware? |
| }\QuickQuizAnswer{ |
| The code assumes that as soon as a given CPU stops |
| seeing its own value, it will immediately see the |
| final agreed-upon value. |
| On real hardware, some of the CPUs might well see several |
| intermediate results before converging on the final value. |
| The actual code used to produce the data in the figures |
| discussed later in this section was therefore somewhat more |
| complex. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Software Logic Analyzer] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| state.variable = mycpu; \lnlbl[setid] |
| lasttb = oldtb = firsttb = gettb(); \lnlbl[init] |
| while (state.variable == mycpu) { \lnlbl[loop:b] |
| lasttb = oldtb; |
| oldtb = gettb(); |
| if (lasttb - firsttb > 1000) \lnlbl[iftmout] |
| break; \lnlbl[break] |
| } \lnlbl[loop:e] |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Software Logic Analyzer} |
| \label{lst:memorder:Software Logic Analyzer} |
| \end{listing} |
| |
| Upon exit from the loop, \co{firsttb} will hold a timestamp |
| taken shortly after the assignment and \co{lasttb} will hold |
| a timestamp taken before the last sampling of the shared variable |
| that still retained the assigned value, or a value equal to \co{firsttb} |
| if the shared variable had changed before entry into the loop. |
| This allows us to plot each CPU's view of the value of \co{state.variable} |
| over a 532-nanosecond time period, as shown in |
| \cref{fig:memorder:A Variable With Multiple Simultaneous Values}. |
| This data was collected in 2006 on 1.5\,GHz \Power{5} system with 8 cores, |
| each containing a pair of hardware threads. |
| CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test. |
| The timebase counter period was about 5.32\,ns, sufficiently fine-grained |
| to allow observations of intermediate cache states. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{memorder/MoreThanOneValue}} |
| \caption{A Variable With Multiple Simultaneous Values} |
| \label{fig:memorder:A Variable With Multiple Simultaneous Values} |
| \end{figure} |
| |
| Each horizontal bar represents the observations of a given CPU over time, |
| with the gray regions to the left indicating the time before the |
| corresponding CPU's first measurement. |
| During the first 5\,ns, only CPU~3 has an opinion about the value of the |
| variable. |
| During the next 10\,ns, CPUs~2 and~3 disagree on the value of the variable, |
| but thereafter agree that the value is~``2'', which is in fact |
| the final agreed-upon value. |
| However, CPU~1 believes that the value is~``1'' for almost 300\,ns, and |
| CPU~4 believes that the value is~``4'' for almost 500\,ns. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How could CPUs possibly have different views of the |
| value of a single variable \emph{at the same time?} |
| }\QuickQuizAnswerB{ |
| As discussed in |
| \cref{sec:memorder:Why Hardware Misordering?}, |
| many CPUs have store buffers that record the values of |
| recent stores, which do not become globally visible until |
| the corresponding cache line makes its way to the CPU\@. |
| Therefore, it is quite possible for each CPU to see its own value |
| for a given variable (in its own store buffer) at a single point |
| in time---and for main memory to hold yet another value. |
| One of the reasons that memory barriers were invented was |
| to allow software to deal gracefully with situations like |
| this one. |
| |
| Fortunately, software rarely cares about the fact that multiple |
| CPUs might see multiple values for the same variable. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Why do CPUs~2 and~3 come to agreement so quickly, when it |
| takes so long for CPUs~1 and~4 to come to the party? |
| }\QuickQuizAnswerE{ |
| CPUs~2 and~3 are a pair of hardware threads on the same |
| core, sharing the same cache hierarchy, and therefore have |
| very low communications latencies. |
| This is a \IXacr{numa}, or, more accurately, a \IXacr{nuca} effect. |
| |
| This leads to the question of why CPUs~2 and~3 ever disagree |
| at all. |
| One possible reason is that they each might have a small amount |
| of private cache in addition to a larger shared cache. |
| Another possible reason is instruction reordering, given the |
| short 10-nanosecond duration of the disagreement and the |
| total lack of memory-ordering operations in the code fragment. |
| }\QuickQuizEndE |
| } |
| |
| And if you think that the situation with four CPUs was intriguing, consider |
| \cref{fig:memorder:A Variable With More Simultaneous Values}, |
| which shows the same situation, but with 15~CPUs each assigning their |
| number to a single shared variable at time~$t=0$. Both diagrams in the |
| figure are drawn in the same way as |
| \cref{fig:memorder:A Variable With Multiple Simultaneous Values}. |
| The only difference is that the unit of horizontal axis is timebase ticks, |
| with each tick lasting about 5.3~nanoseconds. |
| The entire sequence therefore lasts a bit longer than the events recorded in |
| \cref{fig:memorder:A Variable With Multiple Simultaneous Values}, |
| consistent with the increase in number of CPUs. |
| The upper diagram shows the overall picture, while the lower one zooms |
| in on the first 50~timebase ticks. |
| Again, CPU~0 coordinates the test, so does not record any values. |
| |
| \begin{figure*} |
| \centering |
| \IfEbookSize{ |
| \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}} |
| }{ |
| \resizebox{5in}{!}{\includegraphics{memorder/MoreThanOneValue-15CPU}} |
| } |
| \caption{A Variable With More Simultaneous Values} |
| \ContributedBy{fig:memorder:A Variable With More Simultaneous Values}{Akira Yokosawa} |
| \end{figure*} |
| |
| All CPUs eventually agree on the final value of~9, but not before |
| the values~15 and~12 take early leads. |
| Note that there are fourteen different opinions on the variable's value |
| at time~21 indicated by the vertical line in the lower diagram. |
| Note also that all CPUs see sequences whose orderings are consistent with |
| the directed graph shown in |
| \cref{fig:memorder:Possible Global Orders With More Simultaneous Values}. |
| Nevertheless, these figures underscore the importance of |
| proper use of memory-ordering operations. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.0in}{!}{\includegraphics{memorder/store15tred}} |
| \caption{Possible Global Orders With More Simultaneous Values} |
| \label{fig:memorder:Possible Global Orders With More Simultaneous Values} |
| \end{figure} |
| |
| How many values can a single variable take on at a single point in |
| time? |
| As many as one per store buffer in the system! |
| We have therefore entered a regime where we must bid a fond farewell to |
| comfortable intuitions about values of variables and the passage of time. |
| This is the regime where memory-ordering operations are needed. |
| |
| But remember well the lessons from |
| \cref{chp:Hardware and its Habits,% |
| chp:Partitioning and Synchronization Design}. |
| Having all CPUs store concurrently to the same variable |
| is no way to design a parallel program, at least |
| not if performance and scalability are at all important to you. |
| |
| Unfortunately, memory ordering has many other ways of insulting your |
| intuition, and not all of these ways conflict with performance and |
| scalability. |
| The next section overviews reordering of unrelated memory reference. |
| |
| \subsection{Memory-Reference Reordering} |
| \label{sec:memorder:Memory-Reference Reordering} |
| |
| \Cref{sec:memorder:Why Hardware Misordering?} |
| showed that even relatively strongly ordered systems like x86 |
| can reorder prior stores with later loads, at least when the |
| store and load are to different variables. |
| This section builds on that result, looking at the other combinations of |
| loads and stores. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-o@whole.fcv} |
| \caption{Message-Passing Litmus Test (No Ordering)} |
| \label{lst:memorder:Message-Passing Litmus Test (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Load Followed By Load} |
| \label{sec:memorder:Load Followed By Load} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-o:whole] |
| \Cref{lst:memorder:Message-Passing Litmus Test (No Ordering)} |
| (\path{C-MP+o-wmb-o+o-o.litmus}) |
| shows the classic \emph{message-passing} litmus test, where \co{x0} is |
| the message and \co{x1} is a flag indicating whether or not a message |
| is available. |
| In this test, the \co{smp_wmb()} forces \co{P0()} stores to be ordered, |
| but no ordering is specified for the loads. |
| Relatively strongly ordered architectures, such as x86, do enforce ordering. |
| However, weakly ordered architectures often do |
| not~\cite{JadeAlglave2011ppcmem}. |
| Therefore, the \co{exists} clause on \clnref{exists} of the listing \emph{can} |
| trigger. |
| \end{fcvref} |
| |
| One rationale for reordering loads from different locations is that doing |
| so allows execution to proceed when an earlier load misses the cache, |
| but the values for later loads are already present. |
| |
| \QuickQuiz{ |
| But why make load-load reordering visible to the user? |
| Why not just use speculative execution to allow execution to |
| proceed in the common case where there are no intervening |
| stores, in which case the reordering cannot be visible anyway? |
| }\QuickQuizAnswer{ |
| They can and many do, otherwise systems containing |
| strongly ordered CPUs would be slow indeed. |
| However, speculative execution does have its downsides, especially |
| if speculation must be rolled back frequently, particularly |
| on battery-powered systems. |
| Speculative execution can also introduce side channels, which |
| might in turn be exploited to exfiltrate information. |
| But perhaps future systems will be able to overcome these |
| disadvantages. |
| Until then, we can expect vendors to continue producing |
| weakly ordered CPUs. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-rmb-o@whole.fcv} |
| \caption{Enforcing Order of Message-Passing Litmus Test} |
| \label{lst:memorder:Enforcing Order of Message-Passing Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-rmb-o:whole] |
| Thus, portable code relying on ordered loads must |
| add explicit ordering, for example, the \co{smp_rmb()} shown on |
| \clnref{rmb} of |
| \cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test} |
| (\path{C-MP+o-wmb-o+o-rmb-o.litmus}), which prevents |
| the \co{exists} clause from triggering. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-o+o-o@whole.fcv} |
| \caption{Load-Buffering Litmus Test (No Ordering)} |
| \label{lst:memorder:Load-Buffering Litmus Test (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Load Followed By Store} |
| \label{sec:memorder:Load Followed By Store} |
| |
| \begin{fcvref}[ln:formal:C-LB+o-o+o-o:whole] |
| \Cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)} |
| (\path{C-LB+o-o+o-o.litmus}) |
| shows the classic \emph{load-buffering} litmus test. |
| Although relatively strongly ordered systems such as x86 |
| or the IBM Mainframe do not reorder prior loads with subsequent stores, |
| many weakly ordered architectures really do allow such |
| reordering~\cite{JadeAlglave2011ppcmem}. |
| Therefore, the \co{exists} clause on \clnref{exists} really can trigger. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-r+a-o@whole.fcv} |
| \caption{Enforcing Ordering of Load-Buffering Litmus Test} |
| \label{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-LB+o-r+a-o:whole] |
| Although it is rare for actual hardware to |
| exhibit this reordering~\cite{LucMaranget2017aarch64}, |
| one situation where it might be desirable to do so is when a load |
| misses the cache, the store buffer is nearly full, and the cacheline for |
| a subsequent store is ready at hand. |
| Therefore, portable code must enforce any required ordering, for example, |
| as shown in |
| \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| (\path{C-LB+o-r+a-o.litmus}). |
| The \co{smp_store_release()} and \co{smp_load_acquire()} guarantee that |
| the \co{exists} clause on \clnref{exists} never triggers. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| Why should it be rare for systems to reorder prior loads with |
| later stores? |
| }\QuickQuizAnswer{ |
| Keep in mind that conventional computer systems are not permitted |
| to make a given CPU's speculative stores visible to other CPUs. |
| |
| This constraint has consequences. |
| |
| For example, suppose that address translation has not yet completed |
| for that prior load. |
| In that case, it is quite possible that this address translation |
| will result in a segmentation violation, which might in turn mean |
| that the later store would never execute. |
| Therefore, a given store cannot make its value visible to other |
| CPUs until address translation has completed for each and every |
| prior load. |
| |
| Of course, memory latencies can be quite long compared to |
| address-translation delays, especially if all the information |
| required for a given address translation is already present |
| in the translation lookaside buffer (TLB)\@. |
| Although these long memory latencies should provide ample |
| opportunity for store-to-load reordering, the fact its that such |
| reordering is demonstrably rare, especially on the large systems |
| that would be expected to have the longest memory latencies. |
| |
| And the reason for that is that large systems are most likely |
| to have parity or error-correcting code (ECC) checking on memory. |
| |
| An uncorrectable parity or ECC error would cause a load to take a |
| fault, which would prevent subsequent stores from ever executing |
| just as surely as would a segmentation violation. |
| Therefore, on systems featuring parity or ECC, a subsequent store |
| cannot expose its value to other CPUs until after parity or |
| ECC checking has completed for all prior loads. |
| A given load's checking clearly cannot be carried out until that |
| load's value has been returned (check bits and all), which has |
| the side effect of prohibiting reordering of prior loads with |
| later stores. |
| |
| So the only systems that can exhibit load-to-store reordering are |
| those lacking both parity and ECC, for example, small embedded |
| systems and GPGPUs. |
| |
| Given that load-to-store reordering is not possible on common-case |
| computer systems, why do hardware memory models continue to |
| permit it? |
| Only CPU architects know for sure, but here are some possible |
| reasons: |
| |
| \begin{enumerate} |
| \item There might be increased demand for small systems that |
| do not require parity or ECC. |
| \item Memory might become more reliable, so that larger systems |
| no longer require parity or ECC. |
| \item Systems having multiple hardware threads per core might |
| permit cross-hardware-thread speculation, which would |
| permit speculative stores to make their values visible |
| to other threads in that core. |
| \item Operating systems and hardware might work together to |
| terminate any processes affected by an uncorrectable |
| parity or ECC error. |
| This might allow speculative stores to make their |
| values visible to other CPUs, give or take things |
| like debuggers and assertions. |
| \end{enumerate} |
| |
| In all of these cases, there might well be compelling performance |
| advantages to once again permitting load-to-store reordering. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-o+o-rmb-o@whole.fcv} |
| \caption{Message-Passing Litmus Test, No Writer Ordering (No Ordering)} |
| \label{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)} |
| \end{listing} |
| |
| \subsubsection{Store Followed By Store} |
| \label{sec:memorder:Store Followed By Store} |
| |
| \Cref{lst:memorder:Message-Passing Litmus Test; No Writer Ordering (No Ordering)} |
| (\path{C-MP+o-o+o-rmb-o.litmus}) |
| once again shows the classic message-passing litmus test, with the |
| \co{smp_rmb()} providing ordering for \co{P1()}'s loads, but without |
| any ordering for \co{P0()}'s stores. |
| Again, the relatively strongly ordered architectures do enforce ordering, |
| but weakly ordered architectures do not necessarily do |
| so~\cite{JadeAlglave2011ppcmem}, which means that the |
| \co{exists} clause can trigger. |
| One situation in which such reordering could be beneficial is when |
| the store buffer is full, another store is ready to execute, but the |
| cacheline needed by the oldest store is not yet available. |
| In this situation, allowing stores to complete out of order would |
| allow execution to proceed. |
| Therefore, portable code must explicitly order the stores, for |
| example, as shown in |
| \cref{lst:memorder:Enforcing Order of Message-Passing Litmus Test}, |
| thus preventing the \co{exists} clause from triggering. |
| |
| \QuickQuiz{ |
| Why should strongly ordered systems pay the performance price |
| of unnecessary \co{smp_rmb()} and \co{smp_wmb()} invocations? |
| Shouldn't weakly ordered systems shoulder the full cost of |
| their misordering choices??? |
| }\QuickQuizAnswer{ |
| That is in fact exactly what happens. |
| On strongly ordered systems, \co{smp_rmb()} and \co{smp_wmb()} |
| emit no instructions, but instead just constrain the compiler. |
| Thus, in this case, weakly ordered systems do in fact shoulder |
| the full cost of their memory-ordering choices. |
| }\QuickQuizEnd |
| |
| \subsection{Address Dependencies} |
| \label{sec:memorder:Address Dependencies} |
| |
| An \emph{\IXBh{address}{dependency}} occurs when the value returned by a load |
| instruction is used to compute the address used by a later memory-reference |
| instruction. |
| This means that the exact same sequence of instructions used to traverse |
| a linked data structure in single-threaded code provides weak but extremely |
| useful ordering in concurrent code. |
| |
| \QuickQuiz{ |
| Must an address dependency begin with a load instruction? |
| Why not something like \co{xchg_relaxed()}, which also loads |
| a value from memory? |
| }\QuickQuizAnswer{ |
| Yes, pretty much any instruction that loads a value from memory |
| can start an address dependency, including certain atomic operations |
| such as \co{xchg_relaxed()}. |
| The same is true of the control and data dependencies discussed in |
| \cref{sec:memorder:Control Dependencies,sec:memorder:Data Dependencies}. |
| |
| However, in all three cases, it is often the case that if you |
| are going to start your dependency with an atomic operation, |
| it will be more convenient and maintainable to instead use |
| an atomic operation with acquire semantics, in this example, |
| \co{xchg_acquire()} or maybe even just the fully ordered |
| \co{xchg()}. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-wmb-o+o-addr-o@whole.fcv} |
| \caption{Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)} |
| \label{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole] |
| \Cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)} |
| (\path{C-MP+o-wmb-o+o-addr-o.litmus}) |
| shows a linked variant of the message-passing pattern. |
| The head pointer is \co{x1}, which initially |
| references the \co{int} variable \co{y} (\clnref{init:x1}), which is in turn |
| initialized to the value $1$ (\clnref{init:y}). |
| \co{P0()} updates head pointer \co{x1} to reference \co{x0} (\clnref{P0:x1}), |
| but only after initializing it to $2$ (\clnref{P0:x0}) and forcing ordering |
| (\clnref{P0:wmb}). |
| \co{P1()} picks up the head pointer \co{x1} (\clnref{P1:x1}), and then loads |
| the referenced value (\clnref{P1:ref}). |
| There is thus an address dependency from the load on \clnref{P1:x1} to the |
| load on \clnref{P1:ref}. |
| In this case, the value returned by \clnref{P1:x1} is exactly the address |
| used by \clnref{P1:ref}, but many variations are possible, |
| \end{fcvref} |
| including field access using the C-language \co{->} operator, |
| addition, subtraction, and array indexing.\footnote{ |
| But note that in the Linux kernel, the address dependency must |
| be carried through the pointer to the array, not through the |
| array index.} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole] |
| One might hope that \clnref{P1:x1}'s load from the head pointer would be ordered |
| before \clnref{P1:ref}'s dereference, which is in fact the case on Linux v4.15 |
| and later. |
| However, prior to v4.15, this was not the case on DEC Alpha, which could |
| in effect use a speculated value for the dependent load, as described |
| in more detail in \cref{sec:memorder:Alpha}. |
| Therefore, on older versions of Linux, |
| \cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)}'s |
| \co{exists} clause \emph{can} trigger. |
| \end{fcvref} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)] |
| \begin{VerbatimL}[commandchars=\@\[\]] |
| C C-MP+o-wmb-o+ld-addr-o |
| |
| { |
| y=1; |
| x1=y; |
| } |
| |
| P0(int* x0, int** x1) { |
| WRITE_ONCE(*x0, 2); |
| smp_wmb(); |
| WRITE_ONCE(*x1, x0); |
| } |
| |
| P1(int** x1) { |
| int *r2; |
| int r3; |
| |
| r2 = lockless_dereference(*x1); // Obsolete @lnlbl[deref] |
| r3 = READ_ONCE(*r2); @lnlbl[read] |
| } |
| |
| exists (1:r2=x0 /\ 1:r3=1) |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)} |
| \label{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-wmb-o+o-addr-o:whole] |
| \Cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)} |
| % \path{C-MP+o-wmb-o+ld-addr-o.litmus} available at commit bc4b1c3f3b35 |
| % ("styleguide: Loosen restriction on comment in litmus test") |
| shows how to make this work reliably on pre-v4.15 Linux kernels running on |
| DEC Alpha, by replacing \co{READ_ONCE()} on \clnref{P1:x1} of |
| \cref{lst:memorder:Message-Passing Address-Dependency Litmus Test (No Ordering Before v4.15)} |
| with \apikh{lockless_dereference()},\footnote{ |
| Note that \co{lockless_dereference()} is not needed on v4.15 and |
| later, and therefore is not available in these later Linux kernels. |
| Nor is it needed in versions of this book containing this footnote.} |
| which acts like \co{READ_ONCE()} on all platforms other than DEC Alpha, |
| where it acts like a \co{READ_ONCE()} followed by an \co{smp_mb()}, |
| thereby forcing the required ordering on all platforms, in turn |
| preventing the \co{exists} clause from triggering. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-S+o-wmb-o+o-addr-o@whole.fcv} |
| \caption{S Address-Dependency Litmus Test} |
| \label{lst:memorder:S Address-Dependency Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole] |
| But what happens if the dependent operation is a store rather than |
| a load, for example, in the \emph{S} |
| litmus test~\cite{JadeAlglave2011ppcmem} shown in |
| \cref{lst:memorder:S Address-Dependency Litmus Test} |
| (\path{C-S+o-wmb-o+o-addr-o.litmus})? |
| Because no production-quality platform speculates stores, |
| it is not possible for the \co{WRITE_ONCE()} on \clnref{P0:x0} to overwrite |
| the \co{WRITE_ONCE()} on \clnref{P1:r2}, meaning that the \co{exists} |
| clause on \clnref{exists} cannot trigger, even on DEC Alpha, even |
| in pre-v4.15 Linux kernels. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| But how do we know that \emph{all} platforms really avoid |
| triggering the \co{exists} clauses in |
| \cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),% |
| lst:memorder:S Address-Dependency Litmus Test}? |
| }\QuickQuizAnswerB{ |
| Answering this requires identifying three major groups of platforms: |
| \begin{enumerate*}[(1)] |
| \item Total-store-order (TSO) platforms, |
| \item Weakly ordered platforms, and |
| \item DEC Alpha. |
| \end{enumerate*} |
| |
| \begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)] |
| The TSO platforms order all pairs of memory references except for |
| prior stores against later loads. |
| Because the address dependency on \clnref{deref,read} of |
| \cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)} |
| is instead a load followed by another load, TSO platforms preserve |
| this address dependency. |
| \end{fcvref} |
| \begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole] |
| They also preserve the address dependency on \clnref{P1:x1,P1:r2} of |
| \cref{lst:memorder:S Address-Dependency Litmus Test} |
| because this is a load followed by a store. |
| Because address dependencies must start with a load, TSO platforms |
| implicitly but completely respect them, give or take compiler |
| optimizations, hence the need for \co{READ_ONCE()}. |
| \end{fcvref} |
| |
| Weakly ordered platforms don't necessarily maintain ordering of |
| unrelated accesses. |
| However, the address dependencies in |
| \cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),% |
| lst:memorder:S Address-Dependency Litmus Test} are not unrelated: |
| There is an address dependency. |
| The hardware tracks dependencies and maintains the needed |
| ordering. |
| |
| \begin{fcvref}[ln:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)] |
| There is one (famous) exception to this rule for weakly ordered |
| platforms, and that exception is DEC Alpha for load-to-load |
| address dependencies. |
| And this is why, in Linux kernels predating v4.15, DEC Alpha |
| requires the explicit memory barrier supplied for it by the |
| now-obsolete \apikh{lockless_dereference()} on \clnref{deref} of |
| \cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15)}. |
| \end{fcvref} |
| \begin{fcvref}[ln:formal:C-S+o-wmb-o+o-addr-o:whole] |
| However, DEC Alpha does track load-to-store address dependencies, |
| which is why \clnref{P1:x1} of |
| \cref{lst:memorder:S Address-Dependency Litmus Test} |
| does not need a \co{lockless_dereference()}, even in Linux |
| kernels predating v4.15. |
| \end{fcvref} |
| |
| To sum up, current platforms either respect address dependencies |
| implicitly, as is the case for TSO platforms (x86, mainframe, |
| SPARC,~\dots), have hardware tracking for address dependencies |
| (\ARM, PowerPC, MIPS,~\dots), have the required memory barriers |
| supplied by \co{READ_ONCE()} (DEC Alpha in Linux kernel v4.15 and |
| later), or supplied by |
| \co{rcu_dereference()} (DEC Alpha in Linux kernel v4.14 and earlier). |
| }\QuickQuizEndB |
| % |
| \QuickQuizM{ |
| Why the use of \co{smp_wmb()} in |
| \cref{lst:memorder:Enforced Ordering of Message-Passing Address-Dependency Litmus Test (Before v4.15),% |
| lst:memorder:S Address-Dependency Litmus Test}? |
| Wouldn't \co{smp_store_release()} be a better choice? |
| }\QuickQuizAnswerM{ |
| In most cases, \co{smp_store_release()} is indeed a better choice. |
| However, \co{smp_wmb()} was there first in the Linux kernel, |
| so it is still good to understand how to use it. |
| }\QuickQuizEndM |
| % |
| \QuickQuizE{ |
| SP, MP, LB, and now~S\@. |
| Where do all these litmus-test abbreviations come from and |
| how can anyone keep track of them? |
| }\QuickQuizAnswerE{ |
| The best scorecard is the infamous |
| \co{test6.pdf}~\cite{test6-pdf}. |
| Unfortunately, not all of the abbreviations have catchy |
| expansions like SB (store buffering), MP (message passing), |
| and LB (load buffering), but at least the list of abbreviations |
| is readily available. |
| }\QuickQuizEndE |
| } |
| |
| However, it is important to note that address dependencies can |
| be fragile and easily broken by compiler optimizations, as discussed in |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties}. |
| |
| \subsection{Data Dependencies} |
| \label{sec:memorder:Data Dependencies} |
| |
| A \emph{\IXBh{data}{dependency}} occurs when the value returned by a load |
| instruction is used to compute the data stored by a later store |
| instruction. |
| Note well the ``data'' above: |
| If the value returned by a load was instead used to compute the address |
| used by a later store instruction, that would instead be an address |
| dependency, which was covered in |
| \cref{sec:memorder:Address Dependencies}. |
| However, the existence of data dependencies means that the exact same |
| sequence of instructions used to update a linked data structure in |
| single-threaded code provides weak but extremely useful ordering in |
| concurrent code. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-r+o-data-o@whole.fcv} |
| \caption{Load-Buffering Data-Dependency Litmus Test} |
| \label{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole] |
| \Cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| (\path{C-LB+o-r+o-data-o.litmus}) |
| is similar to |
| \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}, |
| except that \co{P1()}'s ordering between \clnref{ld,st} is |
| enforced not by an \IX{acquire load}, but instead by a data dependency: |
| The value loaded by \clnref{ld} is what \clnref{st} stores. |
| The ordering provided by this data dependency is sufficient to prevent |
| the \co{exists} clause from triggering. |
| \end{fcvref} |
| |
| Just as with address dependencies, data dependencies are |
| fragile and can be easily broken by compiler optimizations, as discussed in |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties}. |
| In fact, data dependencies can be even more fragile than are address |
| dependencies. |
| The reason for this is that address dependencies normally involve |
| pointer values. |
| In contrast, as shown in |
| \cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test}, |
| it is tempting to carry data dependencies through integral values, |
| which the compiler has much more freedom to optimize into nonexistence. |
| For but one example, if the integer loaded was multiplied by the constant |
| zero, the compiler would know that the result was zero, and could therefore |
| substitute the constant zero for the value loaded, thus breaking |
| the dependency. |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:formal:C-LB+o-r+o-data-o:whole] |
| But wait!!! |
| \Clnref{ld} of |
| \cref{lst:memorder:Load-Buffering Data-Dependency Litmus Test} |
| uses \co{READ_ONCE()}, which marks the load as volatile, |
| which means that the compiler absolutely must emit the load |
| instruction even if the value is later multiplied by zero. |
| So how can the compiler possibly break this data dependency? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| Yes, the compiler absolutely must emit a load instruction for |
| a volatile load. |
| But if you multiply the value loaded by zero, the compiler is |
| well within its rights to substitute a constant zero for the |
| result of that multiplication, which will break the data |
| dependency on many platforms. |
| |
| Worse yet, if the dependent store does not use \co{WRITE_ONCE()}, |
| the compiler could hoist it above the load, which would cause |
| even TSO platforms to fail to provide ordering. |
| }\QuickQuizEnd |
| |
| In short, you can rely on data dependencies only if you prevent the |
| compiler from breaking them. |
| |
| \subsection{Control Dependencies} |
| \label{sec:memorder:Control Dependencies} |
| |
| A \emph{\IXBh{control}{dependency}} occurs when the value returned by a load |
| instruction is tested to determine whether or not a later store instruction |
| is executed. |
| In other words, a simple conditional branch or conditional-move |
| instruction can act as a weak but low-overhead memory-barrier instruction. |
| However, note well the ``later store instruction'': |
| Although all platforms respect load-to-store dependencies, many platforms |
| do \emph{not} respect load-to-load control dependencies. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-r+o-ctrl-o@whole.fcv} |
| \caption{Load-Buffering Control-Dependency Litmus Test} |
| \label{lst:memorder:Load-Buffering Control-Dependency Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-LB+o-r+o-ctrl-o:whole] |
| \Cref{lst:memorder:Load-Buffering Control-Dependency Litmus Test} |
| (\path{C-LB+o-r+o-ctrl-o.litmus}) |
| shows another load-buffering example, this time using a control |
| dependency (\clnref{if}) to order the load on \clnref{ld} and the store on |
| \clnref{st}. |
| The ordering is sufficient to prevent the \co{exists} from triggering. |
| \end{fcvref} |
| |
| However, control dependencies are even more susceptible to being optimized |
| out of existence than are data dependencies, and |
| \cref{sec:memorder:Control-Dependency Calamities} |
| describes some of the rules that must be followed in order to prevent |
| your compiler from breaking your control dependencies. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-r+o-ctrl-o@whole.fcv} |
| \caption{Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| \label{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-MP+o-r+o-ctrl-o:whole] |
| It is worth reiterating that control dependencies provide ordering only |
| from loads to stores. |
| Therefore, the load-to-load control dependency shown on \clnrefrange{ld1}{ld2} of |
| \cref{lst:memorder:Message-Passing Control-Dependency Litmus Test (No Ordering)} |
| (\path{C-MP+o-r+o-ctrl-o.litmus}) |
| does \emph{not} provide ordering, and therefore does \emph{not} |
| prevent the \co{exists} clause from triggering. |
| \end{fcvref} |
| |
| In summary, control dependencies can be useful, but they are |
| high-maintenance items. |
| You should therefore use them only when performance considerations |
| permit no other solution. |
| |
| \QuickQuiz{ |
| Wouldn't control dependencies be more robust if they were |
| mandated by language standards??? |
| }\QuickQuizAnswer{ |
| But of course! |
| And perhaps in the fullness of time they will be so mandated. |
| }\QuickQuizEnd |
| |
| \subsection{Cache Coherence} |
| \label{sec:memorder:Cache Coherence} |
| |
| On cache-coherent platforms, all CPUs agree on the order of loads and |
| stores to a given variable. |
| Fortunately, when \co{READ_ONCE()} and \co{WRITE_ONCE()} are used, |
| almost all platforms are cache-coherent, as indicated by the ``SV'' |
| column of the cheat sheet shown in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| Unfortunately, this property is so popular that it has been named |
| multiple times, with ``single-variable SC'',\footnote{ |
| Recall that SC stands for sequentially consistent.} |
| ``single-copy atomic''~\cite{Stone:1995:SP:623262.623912}, |
| and just plain ``coherence''~\cite{JadeAlglave2011ppcmem} |
| having seen use. |
| Rather than further compound the confusion by inventing yet another term |
| for this concept, this book uses ``\IX{cache coherence}'' and ``coherence'' |
| interchangeably. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-CCIRIW+o+o+o-o+o-o@whole.fcv} |
| \caption{Cache-Coherent IRIW Litmus Test} |
| \label{lst:memorder:Cache-Coherent IRIW Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-CCIRIW+o+o+o-o+o-o:whole] |
| \Cref{lst:memorder:Cache-Coherent IRIW Litmus Test} |
| (\path{C-CCIRIW+o+o+o-o+o-o.litmus}) |
| shows a litmus test that tests for cache coherence, |
| where ``IRIW'' stands |
| for ``independent reads of independent writes''. |
| Because this litmus test uses only one variable, |
| \co{P2()} and \co{P3()} must agree |
| on the order of \co{P0()}'s and \co{P1()}'s stores. |
| In other words, if \co{P2()} believes that \co{P0()}'s store |
| came first, then \co{P3()} had better not believe that |
| \co{P1()}'s store came first. |
| And in fact the \co{exists} clause on \clnref{exists} will trigger if this |
| situation arises. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| But in |
| \cref{lst:memorder:Cache-Coherent IRIW Litmus Test}, |
| wouldn't be just as bad if \co{P2()}'s \co{r1} and \co{r2} |
| obtained the values~2 and~1, respectively, while \co{P3()}'s |
| \co{r3} and \co{r4} obtained the values~1 and~2, respectively? |
| }\QuickQuizAnswer{ |
| Yes, it would. |
| Feel free to modify the \co{exists} clause to |
| check for that outcome and see what happens. |
| }\QuickQuizEnd |
| |
| It is tempting to speculate that different-sized overlapping loads |
| and stores to a single region of memory (as might be set up using |
| the C-language \co{union} keyword) would provide similar ordering |
| guarantees. |
| However, Flur et al.~\cite{Flur:2017:MCA:3093333.3009839} discovered some |
| surprisingly simple litmus tests that demonstrate that such guarantees |
| can be violated on real hardware. |
| It is therefore necessary to restrict code to non-overlapping |
| same-sized aligned accesses to a given variable, at least if portability |
| is a consideration.\footnote{ |
| There is reason to believe that using atomic RMW operations |
| (for example, \co{xchg()}) for all the stores will |
| provide sequentially consistent ordering, but this has not |
| yet been proven either way.} |
| |
| Adding more variables and threads increases the scope for reordering |
| and other counter-intuitive behavior, as discussed in the next section. |
| |
| \subsection{Multicopy Atomicity} |
| \label{sec:memorder:Multicopy Atomicity} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{memorder/SystemArchBus}} |
| \caption{Global System Bus And Multi-Copy Atomicity} |
| \label{fig:memorder:Global System Bus And Multi-Copy Atomicity} |
| \end{figure} |
| |
| Threads running on a fully |
| \emph{multicopy atomic}~\cite{Stone:1995:SP:623262.623912} |
| platform are guaranteed |
| to agree on the order of stores, even to different variables. |
| A useful mental model of such a system is the single-bus architecture |
| shown in |
| \cref{fig:memorder:Global System Bus And Multi-Copy Atomicity}. |
| If each store resulted in a message on the bus, and if the bus could |
| accommodate only one store at a time, then any pair of CPUs would |
| agree on the order of all stores that they observed. |
| Unfortunately, building a computer system as shown in the figure, without |
| store buffers or even caches, would result in glacially slow computation. |
| Most CPU vendors interested in providing |
| \IXBalt{multicopy atomicity}{multi-copy atomic} therefore |
| instead provide the slightly weaker |
| \emph{other-multicopy atomicity}~\cite[Section B2.3]{ARMv8A:2017}, |
| which excludes the CPU doing a given store from the requirement that all |
| CPUs agree on the order of all stores.\footnote{ |
| As of early 2021, \ARMv8 and x86 provide other-multicopy atomicity, |
| IBM mainframe provides full multicopy atomicity, and PPC |
| provides no multicopy atomicity at all. |
| More detail is shown in |
| \cref{tab:memorder:Summary of Memory Ordering} |
| on |
| \cpageref{tab:memorder:Summary of Memory Ordering}.} |
| This means that if only a subset of CPUs are doing stores, the |
| other CPUs will agree on the order of stores, hence the ``other'' |
| in ``\IXBalthy{other-multicopy atomicity}{other-}{multi-copy atomic}''. |
| Unlike multicopy-atomic platforms, within other-multicopy-atomic platforms, |
| the CPU doing the store is permitted to observe its |
| store early, which allows its later loads to obtain the newly stored |
| value directly from the store buffer, which improves performance. |
| |
| \QuickQuiz{ |
| Can you give a specific example showing different behavior for |
| multicopy atomic on the one hand and other-multicopy atomic |
| on the other? |
| }\QuickQuizAnswer{ |
| \Cref{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| (\path{C-MP-OMCA+o-o-o+o-rmb-o.litmus}) |
| shows such a test. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP-OMCA+o-o-o+o-rmb-o@whole.fcv} |
| \caption{Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| \label{lst:memorder:Litmus Test Distinguishing Multicopy Atomic From Other Multicopy Atomic} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-MP-OMCA+o-o-o+o-rmb-o:whole] |
| On a multicopy-atomic platform, \co{P0()}'s store to \co{x} on |
| \clnref{P0:st} must become visible to both \co{P0()} and \co{P1()} |
| simultaneously. |
| Because this store becomes visible to \co{P0()} on \clnref{P0:ld}, before |
| \co{P0()}'s store to \co{y} on \clnref{P0:y}, \co{P0()}'s store to |
| \co{x} must become visible before its store to \co{y} everywhere, |
| including \co{P1()}. |
| Therefore, if \co{P1()}'s load from \co{y} on \clnref{P1:y} returns the |
| value 1, so must its load from \co{x} on \clnref{P1:x}, given that |
| the \co{smp_rmb()} on \clnref{P1:rmb} forces these two loads to execute |
| in order. |
| Therefore, the \co{exists} clause on \clnref{exists} cannot trigger on a |
| multicopy-atomic platform. |
| \end{fcvref} |
| |
| In contrast, on an other-multicopy-atomic platform, \co{P0()} |
| could see its own store early, so that there would be no constraint |
| on the order of visibility of the two stores from \co{P1()}, |
| which in turn allows the \co{exists} clause to trigger. |
| }\QuickQuizEnd |
| |
| Perhaps there will come a day when all platforms provide some flavor |
| of multi-copy atomicity, but |
| in the meantime, \IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic} |
| platforms do exist, and so software must deal with them. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-WRC+o+o-data-o+o-rmb-o@whole.fcv} |
| \caption{WRC Litmus Test With Dependencies (No Ordering)} |
| \label{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole] |
| \Cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| (\path{C-WRC+o+o-data-o+o-rmb-o.litmus}) |
| demonstrates multicopy atomicity, that is, on a multicopy-atomic platform, |
| the \co{exists} clause on \clnref{exists} cannot trigger. |
| In contrast, on a non-multicopy-atomic |
| platform this \co{exists} clause can trigger, despite |
| \co{P1()}'s accesses being ordered by a data dependency and \co{P2()}'s |
| accesses being ordered by an \co{smp_rmb()}. |
| Recall that the definition of multicopy atomicity requires that all |
| threads agree on the order of stores, which can be thought of as |
| all stores reaching all threads at the same time. |
| Therefore, a non-multicopy-atomic platform can have a store reach |
| different threads at different times. |
| In particular, \co{P0()}'s store might reach \co{P1()} long before it |
| reaches \co{P2()}, which raises the possibility that \co{P1()}'s store |
| might reach \co{P2()} before \co{P0()}'s store does. |
| \end{fcvref} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{memorder/NonMCAplatform}} |
| \caption{Shared Store Buffers And Multi-Copy Atomicity} |
| \label{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity} |
| \end{figure} |
| |
| This leads to the question of why a real system constrained by the |
| usual laws of physics would ever trigger the \co{exists} clause of |
| \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}. |
| The cartoonish diagram of a such a real system is shown in |
| \cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}. |
| CPU~0 and CPU~1 share a store buffer, as do CPUs~2 and~3. |
| This means that CPU~1 can load a value out of the store buffer, thus |
| potentially immediately seeing a value stored by CPU~0. |
| In contrast, CPUs~2 and~3 will have to wait for the corresponding \IX{cache |
| line} to carry this new value to them. |
| |
| \QuickQuiz{ |
| Then who would even \emph{think} of designing a system with shared |
| store buffers??? |
| }\QuickQuizAnswer{ |
| This is in fact a very natural design for any system having |
| multiple hardware threads per core. |
| Natural from a hardware point of view, that is! |
| }\QuickQuizEnd |
| |
| \begin{table*} |
| \small |
| \centering\OneColumnHSpace{-0.8in} |
| \renewcommand*{\arraystretch}{1.1} |
| \rowcolors{10}{}{lightgray} |
| \ebresizewidth{ |
| \begin{tabular}{rlllllll}\toprule |
| & \multicolumn{1}{c}{\tco{P0()}} & \multicolumn{2}{c}{\tco{P0()} \& \tco{P1()}} & |
| \multicolumn{1}{c}{\tco{P1()}} & \multicolumn{3}{c}{\tco{P2()}} \\ |
| \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(lr){5-5} \cmidrule(l){6-8} |
| & Instruction & Store Buffer & Cache & Instruction & |
| Instruction & Store Buffer & Cache \\ |
| \cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} |
| \cmidrule(lr){5-5} \cmidrule(l){6-8} |
| 1 & (Initial state) & & \tco{y==0} & |
| (Initial state) & |
| (Initial state) & & \tco{x==0} \\ |
| 2 & \tco{x = 1;} & \tco{x==1} & \tco{y==0} & |
| & & & \tco{x==0} \\ |
| 3 & (Read-Invalidate \tco{x}) & \tco{x==1} & \tco{y==0} & \tco{r1 = x} (1) |
| & & & \tco{x==0} \\ |
| 4 & & \tco{x==1} \tco{y==1} & \tco{y==0} & \tco{y = r1} |
| & \tco{r2 = y} & & \tco{x==0} \\ |
| 5 & & \tco{x==1} & \tco{y==1} & (Finish store) |
| & (Read \tco{y}) & & \tco{x==0} \\ |
| 6 & (Respond \tco{y}) & \tco{x==1} & \tco{y==1} & |
| & (\tco{r2==1}) & & \tco{x==0} \tco{y==1} \\ |
| 7 & & \tco{x==1} & \tco{y==1} & |
| & \tco{smp_rmb()} & & \tco{x==0} \tco{y==1} \\ |
| 8 & & \tco{x==1} & \tco{y==1} & |
| & \tco{r3 = x (0)} & & \tco{x==0} \tco{y==1} \\ |
| 9 & & \tco{x==1} & \tco{x==0} \tco{y==1} & |
| & (Respond \tco{x}) & & \tco{y==1} \\ |
| 10 & (Finish store) & & \tco{x==1} \tco{y==1} & |
| & & & \tco{y==1} \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{Memory Ordering: |
| WRC Sequence of Events} |
| \label{tab:memorder:Memory Ordering: WRC Sequence of Events} |
| \end{table*} |
| |
| \Cref{tab:memorder:Memory Ordering: WRC Sequence of Events} |
| shows one sequence of events that can result in the \co{exists} clause in |
| \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| triggering. |
| This sequence of events will depend critically on \co{P0()} and |
| \co{P1()} sharing both cache and a store buffer in the manner shown in |
| \cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}. |
| |
| \QuickQuiz{ |
| But just how is it fair that \co{P0()} and \co{P1()} must share a store |
| buffer and a cache, but \co{P2()} gets one each of its very own??? |
| }\QuickQuizAnswer{ |
| Presumably there is a \co{P3()}, as is in fact shown in |
| \cref{fig:memorder:Shared Store Buffers And Multi-Copy Atomicity}, |
| that shares \co{P2()}'s store buffer and cache. |
| But not necessarily. |
| Some platforms allow different cores to disable different numbers |
| of threads, allowing the hardware to adjust to the needs of the |
| workload at hand. |
| For example, a single-threaded critical-path portion of the workload |
| might be assigned to a core with only one thread enabled, thus |
| allowing the single thread running that portion of the workload |
| to use the entire capabilities of that core. |
| Other more highly parallel but cache-miss-prone portions of the |
| workload might be assigned to cores with all hardware threads |
| enabled to provide improved throughput. |
| This improved throughput could be due to the fact that while one |
| hardware thread is stalled on a cache miss, the other hardware |
| threads can make forward progress. |
| |
| In such cases, performance requirements override quaint human |
| notions of fairness. |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole] |
| Row~1 shows the initial state, with the initial value of \co{y} in |
| \co{P0()}'s and \co{P1()}'s shared cache, and the initial value of \co{x} in |
| \co{P2()}'s cache. |
| |
| Row~2 shows the immediate effect of \co{P0()} executing its store on \clnref{P0:x}. |
| Because the cacheline containing \co{x} is not in \co{P0()}'s and \co{P1()}'s |
| shared cache, the new value (\co{1}) is stored in the shared store buffer. |
| |
| Row~3 shows two transitions. |
| First, \co{P0()} issues a read-invalidate operation to fetch the cacheline |
| containing \co{x} so that it can flush the new value for \co{x} out of |
| the shared store buffer. |
| Second, \co{P1()} loads from \co{x} (\clnref{P1:x}), an operation that completes |
| immediately because the new value of \co{x} is immediately available |
| from the shared store buffer. |
| |
| Row~4 also shows two transitions. |
| First, it shows the immediate effect of \co{P1()} executing its store to |
| \co{y} (\clnref{P1:y}), placing the new value into the shared store buffer. |
| Second, it shows the start of \co{P2()}'s load from \co{y} (\clnref{P2:y}). |
| |
| Row~5 continues the tradition of showing two transitions. |
| First, it shows \co{P1()} complete its store to \co{y}, flushing |
| from the shared store buffer to the cache. |
| Second, it shows \co{P2()} request the cacheline containing \co{y}. |
| |
| Row~6 shows \co{P2()} receive the cacheline containing \co{y}, allowing |
| it to finish its load into \co{r2}, which takes on the value \co{1}. |
| |
| Row~7 shows \co{P2()} execute its \co{smp_rmb()} (\clnref{P2:rmb}), thus keeping |
| its two loads ordered. |
| |
| Row~8 shows \co{P2()} execute its load from \co{x}, which immediately |
| returns with the value zero from \co{P2()}'s cache. |
| |
| Row~9 shows \co{P2()} \emph{finally} responding to \co{P0()}'s request for |
| the cacheline containing \co{x}, which was made way back up on row~3. |
| |
| Finally, row~10 shows \co{P0()} finish its store, flushing its value of |
| \co{x} from the shared store buffer to the shared cache. |
| |
| Note well that the \co{exists} clause on \clnref{exists} has triggered. |
| The values of \co{r1} and \co{r2} are both the value one, and |
| the final value of \co{r3} the value zero. |
| This strange result occurred because \co{P0()}'s new value of \co{x} was |
| communicated to \co{P1()} long before it was communicated to \co{P2()}. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:formal:C-WRC+o+o-data-o+o-rmb-o:whole] |
| Referring to |
| \cref{tab:memorder:Memory Ordering: WRC Sequence of Events}, |
| why on earth would \co{P0()}'s store take so long to complete when |
| \co{P1()}'s store complete so quickly? |
| In other words, does the \co{exists} clause on \clnref{exists} of |
| \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| really trigger on real systems? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| You need to face the fact that it really can trigger. |
| Akira Yokosawa used the \co{litmus7} tool to run this litmus test |
| on a \Power{8} system. |
| Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause. |
| Thus, triggering the \co{exists} clause is not merely a one-in-a-million |
| occurrence, but rather a one-in-a-hundred-million occurrence. |
| But it nevertheless really does trigger on real systems. |
| }\QuickQuizEnd |
| |
| This counter-intuitive result happens because although dependencies |
| do provide ordering, they provide it only within the confines of their |
| own thread. |
| This three-thread example requires stronger ordering, which |
| is the subject of |
| \crefthro{sec:memorder:Cumulativity} |
| {sec:memorder:Release-Acquire Chains}. |
| |
| \subsubsection{Cumulativity} |
| \label{sec:memorder:Cumulativity} |
| |
| The three-thread example shown in |
| \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)} |
| requires \emph{cumulative} ordering, or \emph{cumulativity}. |
| A cumulative memory-ordering operation orders not just any given |
| access preceding it, but also earlier accesses by any thread to that |
| same variable. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-WRC+o+o-r+a-o@whole.fcv} |
| \caption{WRC Litmus Test With Release} |
| \label{lst:memorder:WRC Litmus Test With Release} |
| \end{listing} |
| |
| Dependencies do not provide cumulativity, |
| which is why the ``C'' column is blank for the \co{READ_ONCE()} row |
| of \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| on |
| \cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| However, as indicated by the ``C'' in their ``C'' column, |
| release operations do provide cumulativity. |
| Therefore, |
| \cref{lst:memorder:WRC Litmus Test With Release} |
| (\path{C-WRC+o+o-r+a-o.litmus}) |
| substitutes a release operation for |
| \cref{lst:memorder:WRC Litmus Test With Dependencies (No Ordering)}'s |
| data dependency. |
| \begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole] |
| Because the release operation is cumulative, its ordering applies not only to |
| \cref{lst:memorder:WRC Litmus Test With Release}'s |
| load from \co{x} by \co{P1()} on \clnref{P1:x}, but also to the store to \co{x} |
| by \co{P0()} on \clnref{P0:x}---but only if that load returns the value stored, |
| which matches the \co{1:r1=1} in the \co{exists} clause on \clnref{exists}. |
| This means that \co{P2()}'s load-acquire suffices to force the |
| load from \co{x} on \clnref{P2:x} to happen after the store on \clnref{P0:x}, so |
| the value returned is one, which does not match \co{2:r3=0}, which |
| in turn prevents the \co{exists} clause from triggering. |
| \end{fcvref} |
| |
| \begin{figure*} |
| \centering |
| \includegraphics{memorder/memorybarriercum} |
| \caption{Cumulativity} |
| \label{fig:memorder:Cumulativity} |
| \end{figure*} |
| |
| \begin{fcvref}[ln:formal:C-WRC+o+o-r+a-o:whole] |
| These ordering constraints are depicted graphically in |
| \cref{fig:memorder:Cumulativity}. |
| Note also that cumulativity is not limited to a single step back in time. |
| If there was another load from \co{x} or store to \co{x} from any thread |
| that came before the store on \clnref{P0:x}, that prior load or store would also |
| be ordered before the load on \clnref{P2:x}, though only if both \co{r1} and |
| \co{r2} both end up containing the value \co{1}. |
| \end{fcvref} |
| |
| In short, use of cumulative ordering operations can suppress |
| non-multicopy-atomic behaviors in some situations. |
| Cumulativity nevertheless has limits, which are examined in the next section. |
| |
| \subsubsection{Propagation} |
| \label{sec:memorder:Propagation} |
| |
| \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole] |
| \Cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| (\path{C-W+RWC+o-r+a-o+o-mb-o.litmus}) |
| shows the limitations of cumulativity and store-release, |
| even with a full memory barrier. |
| The problem is that although the \co{smp_store_release()} on |
| \clnref{P0:sr} has cumulativity, and although that cumulativity does |
| order \co{P2()}'s load on \clnref{P2:ld}, the \co{smp_store_release()}'s |
| ordering cannot propagate through the combination of \co{P1()}'s |
| load (\clnref{P1:ld}) and \co{P2()}'s store (\clnref{P2:st}). |
| This means that the \co{exists} clause on \clnref{exists} really can trigger. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-W+RWC+o-r+a-o+o-mb-o@whole.fcv} |
| \caption{W+RWC Litmus Test With Release (No Ordering)} |
| \label{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| \end{listing} |
| |
| \QuickQuiz{ |
| But it is not necessary to worry about propagation unless |
| there are at least three threads in the litmus test, right? |
| }\QuickQuizAnswer{ |
| Wrong. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-R+o-wmb-o+o-mb-o@whole.fcv} |
| \caption{R Litmus Test With Write Memory Barrier (No Ordering)} |
| \label{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-R+o-wmb-o+o-mb-o:whole] |
| \Cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| (\path{C-R+o-wmb-o+o-mb-o.litmus}) |
| shows a two-thread litmus test that requires propagation due to |
| the fact that it only has store-to-store and load-to-store |
| links between its pair of threads. |
| Even though \co{P0()} is fully ordered by the \co{smp_wmb()} and |
| \co{P1()} is fully ordered by the \co{smp_mb()}, the |
| counter-temporal nature of the links means that |
| the \co{exists} clause on \clnref{exists} really can trigger. |
| To prevent this triggering, the \co{smp_wmb()} on \clnref{wmb} |
| must become an \co{smp_mb()}, bringing propagation into play |
| twice, once for each non-temporal link. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| \QuickQuizLabel{\MemorderQQLitmusTestR} |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/fr}} |
| \caption{Load-to-Store is Counter-Temporal} |
| \label{fig:memorder:Load-to-Store is Counter-Temporal} |
| \end{figure} |
| |
| This situation might seem completely counter-intuitive, but keep |
| in mind that the speed of light is finite and computers are of |
| non-zero size. |
| It therefore takes time for the effect of the \co{P2()}'s store to |
| \co{x} to propagate to \co{P1()}, which in turn means that it is possible |
| that \co{P1()}'s read from \co{x} happens much later in time, but |
| nevertheless still sees the old value of zero. |
| This situation is depicted in |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal}: |
| Just because a load sees the old value does \emph{not} mean that |
| this load executed at an earlier time than did the store of the |
| new value. |
| |
| Note that |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| also shows the limitations of memory-barrier pairing, given that |
| there are not two but three processes. |
| These more complex litmus tests can instead be said to have \emph{cycles}, |
| where memory-barrier pairing is the special case of a two-thread cycle. |
| \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole] |
| The cycle in |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| goes through \co{P0()} (\clnref{P0:st,P0:sr}), \co{P1()} (\clnref{P1:la,P1:ld}), |
| \co{P2()} (\clnref{P2:st,P2:mb,P2:ld}), and back to \co{P0()} (\clnref{P0:st}). |
| The \co{exists} clause delineates this cycle: |
| The \co{1:r1=1} indicates that the \co{smp_load_acquire()} on \clnref{P1:la} |
| returned the value stored by the \co{smp_store_release()} on \clnref{P0:sr}, |
| the \co{1:r2=0} indicates that the \co{WRITE_ONCE()} on \clnref{P2:st} came |
| too late to affect the value returned by the \co{READ_ONCE()} on \clnref{P1:ld}, |
| and finally the \co{2:r3=0} indicates that the |
| \co{WRITE_ONCE()} on \clnref{P0:st} came too late to affect the value returned |
| by the \co{READ_ONCE()} on \clnref{P2:ld}. |
| In this case, the fact that the \co{exists} clause can trigger means that |
| the cycle is said to be \emph{allowed}. |
| In contrast, in cases where the \co{exists} clause cannot trigger, |
| the cycle is said to be \emph{prohibited}. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-W+RWC+o-mb-o+a-o+o-mb-o@whole.fcv} |
| \caption{W+WRC Litmus Test With More Barriers} |
| \label{lst:memorder:W+WRC Litmus Test With More Barriers} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole] |
| But what if we need to prohibit the cycle corresponding to the \co{exists} |
| clause on \clnref{exists} of |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}? |
| One solution is to replace \co{P0()}'s \co{smp_store_release()} |
| with an \co{smp_mb()}, which |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| shows to have not only cumulativity, but also propagation. |
| \end{fcvref} |
| The result is shown in |
| \cref{lst:memorder:W+WRC Litmus Test With More Barriers} |
| (\path{C-W+RWC+o-mb-o+a-o+o-mb-o.litmus}). |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole] |
| But given that \co{smp_mb()} has the propagation property, |
| why doesn't the \co{smp_mb()} on \clnref{P2:mb} of |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| prevent the \co{exists} clause from triggering? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:formal:C-W+RWC+o-r+a-o+o-mb-o:whole] |
| As a rough rule of thumb, the \co{smp_mb()} barrier's |
| propagation property is sufficient to maintain ordering |
| through only one load-to-store link between |
| processes. |
| Unfortunately, |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)} |
| has not one but two load-to-store links, with the |
| first being from the \co{READ_ONCE()} on \clnref{P1:ld} to the |
| \co{WRITE_ONCE()} on \clnref{P2:st} and the second being from |
| the \co{READ_ONCE()} on \clnref{P2:ld} to the \co{WRITE_ONCE()} |
| on \clnref{P0:st}. |
| Therefore, preventing the \co{exists} clause from triggering |
| should be expected to require not one but two |
| instances of \co{smp_mb()}. |
| \end{fcvref} |
| |
| As a special exception to this rule of thumb, a release-acquire |
| chain can have one load-to-store link between processes |
| and still prohibit the cycle. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/co}} |
| \caption{Store-to-Store is Counter-Temporal} |
| \label{fig:memorder:Store-to-Store is Counter-Temporal} |
| \end{figure} |
| |
| For completeness, |
| \cref{fig:memorder:Store-to-Store is Counter-Temporal} |
| shows that the ``winning'' store among a group of stores to the |
| same variable is not necessarily the store that started last. |
| This should not come as a surprise to anyone who carefully examined |
| \cref{fig:memorder:A Variable With More Simultaneous Values} |
| on |
| \cpageref{fig:memorder:A Variable With More Simultaneous Values}. |
| One way to rationalize the counter-temporal properties of both |
| load-to-store and store-to-store ordering is to clearly distinguish |
| between the temporal order in which the store instructions executed on |
| the one hand, and the order in which the corresponding cacheline visited |
| the CPUs that executed those instructions on the other. |
| It is the cacheline-visitation order that defines the externally |
| visible ordering of the actual stores. |
| This cacheline-visitation order is not directly visible to the code |
| executing the store instructions, which results in the counter-intuitive |
| counter-temporal nature of load-to-store and store-to-store ordering.\footnote{ |
| In some hardware-multithreaded systems, the store would become |
| visible to other CPUs in that same core as soon as the store |
| reached the shared store buffer. |
| As a result, such systems are non-multicopy atomic.} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-2+2W+o-wmb-o+o-wmb-o@whole.fcv} |
| \caption{2+2W Litmus Test With Write Barriers} |
| \label{lst:memorder:2+2W Litmus Test With Write Barriers} |
| \end{listing} |
| |
| \QuickQuiz{ |
| But for litmus tests having only ordered stores, as shown in |
| \cref{lst:memorder:2+2W Litmus Test With Write Barriers} |
| (\path{C-2+2W+o-wmb-o+o-wmb-o.litmus}), |
| research shows that the cycle is prohibited, even in weakly |
| ordered systems such as \ARM\ and Power~\cite{test6-pdf}. |
| Given that, is store-to-store ordering really \emph{always} |
| counter-temporal??? |
| }\QuickQuizAnswer{ |
| This litmus test is indeed a very interesting curiosity. |
| Its ordering apparently occurs naturally given typical |
| weakly ordered hardware design, which would normally be |
| considered a great gift from the relevant laws of physics |
| and cache-coherency-protocol mathematics. |
| |
| \begin{fcvref}[ln:formal:C-2+2W+o-wmb-o+o-wmb-o:whole] |
| Unfortunately, no one has been able to come up with a software use |
| case for this gift that does not have a much better alternative |
| implementation. |
| Therefore, neither the C11 nor the Linux kernel memory models |
| provide any guarantee corresponding to |
| \cref{lst:memorder:2+2W Litmus Test With Write Barriers}. |
| This means that the \co{exists} clause on \clnref{exists} can |
| trigger. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-2+2W+o-o+o-o@whole.fcv} |
| \caption{2+2W Litmus Test (No Ordering)} |
| \label{lst:memorder:2+2W Litmus Test (No Ordering)} |
| \end{listing} |
| |
| Of course, without the barrier, there are no ordering |
| guarantees, even on real weakly ordered hardware, as shown in |
| \cref{lst:memorder:2+2W Litmus Test (No Ordering)} |
| (\path{C-2+2W+o-o+o-o.litmus}). |
| }\QuickQuizEnd |
| |
| But sometimes time really is on our side. |
| Read on! |
| |
| \subsubsection{Happens-Before} |
| \label{sec:memorder:Happens-Before} |
| |
| As shown in |
| \cref{fig:memorder:Store-to-Load is Temporal}, |
| on platforms without user-visible speculation, if a load returns the value |
| from a particular store, then, courtesy of the finite speed of light and |
| the non-zero size of modern computing systems, the store absolutely has |
| to have executed at an earlier time than did the load. |
| This means that carefully constructed programs can rely on the |
| passage of time itself as a memory-ordering operation. |
| |
| \QuickQuiz{ |
| Why don't we just stick to sanely ordered CPU families like x86, |
| so that time will \emph{always} be on our side??? |
| }\QuickQuizAnswer{ |
| Sorry to be the one to break it to you, but x86 CPUs have |
| store buffers and they are not afraid to use them. |
| |
| The data shown in |
| \crefrange{fig:memorder:x86 CPUs Can Disagree}{fig:memorder:Store-to-Load is Temporal on x86} |
| was obtained from a series of torture tests |
| (\path{perftemporal.sh}) running on a two-socket x86 system |
| having a total of 80 hardware |
| threads.\footnote{ |
| \co{Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz}, |
| for those wanting more details.} |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe-nvals}} |
| \caption{x86 CPUs Can Disagree} |
| \label{fig:memorder:x86 CPUs Can Disagree} |
| \end{figure} |
| |
| \Cref{fig:memorder:x86 CPUs Can Disagree} |
| is from a torture-test run similar to the one that generated |
| \cref{fig:memorder:A Variable With More Simultaneous Values} |
| on |
| \cpageref{fig:memorder:A Variable With More Simultaneous Values}, |
| but showing the number of distinct opinions per unit time, where |
| each timestamp period is 0.5~nanoseconds.\footnote{ |
| Recall that each CPU writes its own number to a single |
| shared variable, which each CPU repeatedly polls, |
| recording time and value at each change in value.} |
| As you can see, there is a significant period of time during |
| which there are more than 40 distinct opinions as to the value |
| of a single shared variable. |
| Even on x86. |
| |
| \begin{figure} |
| \centering |
| \resizebox{.85\twocolumnwidth}{!}{\includegraphics{memorder/co-hopes}} |
| \caption{Is Store-to-Store Counter-Temporal on x86?} |
| \label{fig:memorder:Is Store-to-Store Counter-Temporal on x86?} |
| \end{figure} |
| |
| But perhaps we might hope that on x86, the last CPU to execute its |
| store in global time order would ``win'', that is, the final value |
| of the shared variable would be that of that last store. |
| If so, any store starting after the ``winning'' store finished would |
| overwrite the winning store. |
| In contrast, we know that on weakly ordered systems, a |
| counter-intuitive store such as that shown at the bottom of |
| \cref{fig:memorder:Is Store-to-Store Counter-Temporal on x86?} |
| could be overwritten by the old value, despite the fact that it |
| started $t$ timestamp periods after the winning store completed, |
| as depicted in the figure by the time interval that is labelled |
| $-t$.\footnote{ |
| That is completed from the viewpoint of the instruction |
| stream containing that store. |
| The value stored might well remain in the store buffer |
| long afterwards.} |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/coe}} |
| \caption{Store-to-Store is Counter-Temporal on x86} |
| \label{fig:memorder:Store-to-Store is Counter-Temporal on x86} |
| \end{figure} |
| |
| However, |
| \cref{fig:memorder:Store-to-Store is Counter-Temporal on x86} |
| dashes any fond hope of x86 refusing to indulge in |
| counter-intuitive counter-temporal stores. |
| The data in this figure summarizes the results from 79,000 |
| torture-test runs of the type that generated |
| \cref{fig:memorder:x86 CPUs Can Disagree}, |
| and is a histogram of the minimum time elapsed between the start of |
| a non-winning CPU's store and the end of the winning CPU's store. |
| Of course, if this value is negative, the winning store completed |
| (though its value might not have propagated past the store buffer) |
| before the other store even started. |
| And the negative-time data points in that figure show that this |
| counter-temporal behavior really can be made to happen most of |
| the time, even on x86. |
| |
| \begin{figure} |
| \centering |
| \resizebox{.6\twocolumnwidth}{!}{\includegraphics{memorder/fr-hopes}} |
| \caption{Is Load-to-Store Counter-Temporal on x86?} |
| \label{fig:memorder:Is Load-to-Store Counter-Temporal on x86?} |
| \end{figure} |
| |
| But perhaps we could instead hope that once a store had |
| executed, any future load from that same variable might |
| be guaranteed to return the new value. |
| In contrast, we know that on weakly ordered systems, a |
| counter-intuitive load such as that shown at the bottom |
| of |
| \cref{fig:memorder:Is Load-to-Store Counter-Temporal on x86?} |
| could return the old value, despite having started $t$ |
| timestamp periods after the end of the store, again, as |
| depicted in the figure by the $-t$. |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/fre}} |
| \caption{Load-to-Store is Counter-Temporal on x86} |
| \label{fig:memorder:Load-to-Store is Counter-Temporal on x86} |
| \end{figure} |
| |
| However, |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal on x86} |
| dashes any fond hopes that loads executing after a given store |
| would see that store's value (or some later value). |
| This data is generated by 1,000 runs of a program in which the |
| parent thread spawns the children, each of which polls a shared |
| variable. |
| After all the children are running, the parent writes a new |
| value to the shared variable. |
| The quantity histogrammed in the figure is the time from just |
| after the store until just before the last load of the old value. |
| And most of these data points lie on the negative x~axis, |
| clearly demonstrating that a torture test can easily provoke |
| the counter-temporal nature of cross-thread load-to-store. |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf-hopes}} |
| \caption{Is Store-to-Load Counter-Temporal on x86?} |
| \label{fig:memorder:Is Store-to-Load Counter-Temporal on x86?} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{CodeSamples/cpu/data/rfe}} |
| \caption{Store-to-Load is Temporal on x86} |
| \label{fig:memorder:Store-to-Load is Temporal on x86} |
| \end{figure} |
| |
| It is only reasonable to assume that a load that ends before |
| a store starts will be unable to load that store's value, |
| as shown in |
| \cref{fig:memorder:Is Store-to-Load Counter-Temporal on x86?}, |
| even on weakly ordered systems. |
| And the data points from all 1,000 runs shown in |
| \cref{fig:memorder:Store-to-Load is Temporal on x86}, |
| lie on the positive x~axis, so in at least this case, our temporal |
| intuitions, hopes, and dreams have been fulfilled. |
| |
| These results should be no surprise. |
| Even on x86, the cache line shuttles between cores and sockets, and |
| \cref{fig:memorder:x86 CPUs Can Disagree} |
| indicates that a given store can remain in its CPU's store buffer |
| for more than a microsecond, which is quite enough time for |
| today's multi-GHz CPUs to detect the resulting counter-temporal |
| behavior. |
| |
| However, for the store-to-load case, the laws of physics |
| guarantee temporal ordering because the finite speed of light |
| and the non-zero size of atoms work together to guarantee that |
| some time will pass between a store on one CPU and a load of |
| that store's value on some other CPU. |
| |
| Alert readers may have noticed that the distribution shown in |
| \cref{fig:memorder:Store-to-Store is Counter-Temporal on x86} |
| is nearly monomodal, which those in |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal on x86} |
| and |
| \cref{fig:memorder:Store-to-Load is Temporal on x86} |
| are decidedly trimodal. |
| Such readers are encouraged to consider why that might be, |
| perhaps referring to \path{CodeSamples/cpu/perftemporal.sh} and |
| the scripts and programs that it invokes. |
| |
| And all readers are encouraged to note that even on relatively |
| strongly ordered CPUs such as x86, store-to-store and |
| load-to-store IPC links can be strongly counter-temporal. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/rf}} |
| \caption{Store-to-Load is Temporal} |
| \label{fig:memorder:Store-to-Load is Temporal} |
| \end{figure} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+a-o+o-data-o+o-data-o@whole.fcv} |
| \caption{LB Litmus Test With One Acquire} |
| \label{lst:memorder:LB Litmus Test With One Acquire} |
| \end{listing} |
| |
| Of course, just the passage of time by itself is not enough, as |
| was seen in |
| \cref{lst:memorder:Load-Buffering Litmus Test (No Ordering)} |
| on |
| \cpageref{lst:memorder:Load-Buffering Litmus Test (No Ordering)}, |
| which has nothing but store-to-load links and, because it provides |
| absolutely no ordering, still can trigger its \co{exists} clause. |
| However, as long as each thread provides even the weakest possible |
| ordering, \co{exists} clause would not be able to trigger. |
| For example, |
| \cref{lst:memorder:LB Litmus Test With One Acquire} |
| (\path{C-LB+a-o+o-data-o+o-data-o.litmus}) |
| shows \co{P0()} ordered with an \co{smp_load_acquire()} and |
| both \co{P1()} and \co{P2()} ordered with data dependencies. |
| These orderings, which are close to the top of |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}, |
| suffice to prevent the \co{exists} clause from triggering. |
| |
| \QuickQuiz{ |
| Can you construct a litmus test like that in |
| \cref{lst:memorder:LB Litmus Test With One Acquire} |
| that uses \emph{only} dependencies? |
| }\QuickQuizAnswer{ |
| \Cref{lst:memorder:LB Litmus Test With No Acquires} |
| shows a somewhat nonsensical but very real example. |
| Creating a more useful (but still real) litmus test is left |
| as an exercise for the reader. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-data-o+o-data-o+o-data-o@whole.fcv} |
| \caption{LB Litmus Test With No Acquires} |
| \label{lst:memorder:LB Litmus Test With No Acquires} |
| \end{listing} |
| }\QuickQuizEnd |
| |
| An important use of time for ordering memory accesses is covered in the |
| next section. |
| |
| \subsubsection{Release-Acquire Chains} |
| \label{sec:memorder:Release-Acquire Chains} |
| |
| A minimal release-acquire chain was shown in |
| \cref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test} |
| on |
| \cpageref{lst:memorder:Enforcing Ordering of Load-Buffering Litmus Test}, |
| but these chains can be much longer, as shown in |
| \cref{lst:memorder:Long LB Release-Acquire Chain} |
| (\path{C-LB+a-r+a-r+a-r+a-r.litmus}). |
| The longer the release-acquire chain, the more ordering is gained |
| from the passage of time, so that no matter how many threads are |
| involved, the corresponding \co{exists} clause cannot trigger. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+a-r+a-r+a-r+a-r@whole.fcv} |
| \caption{Long LB Release-Acquire Chain} |
| \label{lst:memorder:Long LB Release-Acquire Chain} |
| \end{listing} |
| |
| Although release-acquire chains are inherently store-to-load creatures, |
| it turns out that they can tolerate one load-to-store step, despite |
| such steps being counter-temporal, as shown in |
| \cref{fig:memorder:Load-to-Store is Counter-Temporal} |
| on |
| \cpageref{fig:memorder:Load-to-Store is Counter-Temporal}. |
| For example, |
| \cref{lst:memorder:Long ISA2 Release-Acquire Chain} |
| (\path{C-ISA2+o-r+a-r+a-r+a-o.litmus}) |
| shows a three-step release-acquire chain, but where \co{P3()}'s |
| final access is a \co{READ_ONCE()} from \co{x0}, which is |
| accessed via \co{WRITE_ONCE()} by \co{P0()}, forming a non-temporal |
| load-to-store link between these two processes. |
| \begin{fcvref}[ln:formal:litmus:C-ISA2+o-r+a-r+a-r+a-o:whole] |
| However, because \co{P0()}'s \co{smp_store_release()} (\clnref{P0:rel}) |
| is cumulative, if \co{P3()}'s \co{READ_ONCE()} returns zero, |
| this cumulativity will force the \co{READ_ONCE()} to be ordered |
| before \co{P0()}'s \co{smp_store_release()}. |
| In addition, the release-acquire chain |
| (\clnref{P0:rel,P1:acq,P1:rel,P2:acq,P2:rel,P3:acq}) |
| forces \co{P3()}'s \co{READ_ONCE()} to be ordered after \co{P0()}'s |
| \co{smp_store_release()}. |
| Because \co{P3()}'s \co{READ_ONCE()} cannot be both before and after |
| \co{P0()}'s \co{smp_store_release()}, either or both of two things must |
| be true: |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-ISA2+o-r+a-r+a-r+a-o@whole.fcv} |
| \caption{Long ISA2 Release-Acquire Chain} |
| \label{lst:memorder:Long ISA2 Release-Acquire Chain} |
| \end{listing} |
| |
| \begin{enumerate} |
| \item \co{P3()}'s \co{READ_ONCE()} came after \co{P0()}'s |
| \co{WRITE_ONCE()}, so that the \co{READ_ONCE()} returned |
| the value two, so that the \co{exists} clause's \co{3:r2=0} |
| is false. |
| \item The release-acquire chain did not form, that is, one or more |
| of the \co{exists} clause's \co{1:r2=2}, \co{2:r2=2}, or \co{3:r1=2} |
| is false. |
| \end{enumerate} |
| |
| Either way, the \co{exists} clause cannot trigger, despite this litmus |
| test containing a notorious load-to-store link between |
| \co{P3()} and \co{P0()}. |
| But never forget that release-acquire chains can tolerate only one |
| load-to-store link, as was seen in |
| \cref{lst:memorder:W+RWC Litmus Test With Release (No Ordering)}. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-r+a-r+a-o@whole.fcv} |
| \caption{Long Z6.2 Release-Acquire Chain} |
| \label{lst:memorder:Long Z6.2 Release-Acquire Chain} |
| \end{listing} |
| |
| Release-acquire chains can also tolerate a single store-to-store step, |
| as shown in |
| \cref{lst:memorder:Long Z6.2 Release-Acquire Chain} |
| (\path{C-Z6.2+o-r+a-r+a-r+a-o.litmus}). |
| \begin{fcvref}[ln:formal:C-Z6.2+o-r+a-r+a-r+a-o:whole] |
| As with the previous example, \co{smp_store_release()}'s cumulativity |
| combined with the temporal nature of the release-acquire chain |
| prevents the \co{exists} clause on \clnref{exists} from triggering. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-Z6.2+o-r+a-o+o-mb-o@whole.fcv} |
| \caption{Z6.2 Release-Acquire Chain (Ordering?)} |
| \label{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)} |
| \end{listing} |
| |
| \QuickQuiz{ |
| Suppose we have a short release-acquire chain along with one |
| load-to-store link and one store-to-store link, like that shown in |
| \cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}. |
| Given that there is only one of each type of non-store-to-load |
| link, the \co{exists} cannot trigger, right? |
| }\QuickQuizAnswer{ |
| Wrong. |
| It is the number of non-store-to-load links that matters. |
| If there is only one non-store-to-load link, a release-acquire |
| chain can prevent the \co{exists} clause from triggering. |
| However, if there is more than one non-store-to-load link, |
| be they store-to-store, load-to-store, or any combination |
| thereof, it is necessary to have at least one full barrier |
| (\co{smp_mb()} or better) between each non-store-to-load link. |
| In |
| \cref{lst:memorder:Z6.2 Release-Acquire Chain (Ordering?)}, |
| preventing the \co{exists} clause from triggering therefore requires |
| an additional full barrier between either \co{P0()}'s or |
| \co{P1()}'s accesses. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MP+o-r+a-o@whole.fcv} |
| \caption{A Release-Acquire Chain Ordering Multiple Accesses} |
| \label{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses} |
| \end{listing} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-MPO+o-r+a-o+o@whole.fcv} |
| \caption{A Release-Acquire Chain With Added Store (Ordering?)} |
| \label{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)} |
| \end{listing} |
| |
| But beware: |
| Adding a second store-to-store link allows the correspondingly updated |
| \co{exists} clause to trigger. |
| To see this, review |
| \cref{lst:memorder:A Release-Acquire Chain Ordering Multiple Accesses,% |
| lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)}, |
| which have identical \co{P0()} and \co{P1()} processes. |
| The only code difference is that |
| \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)} |
| has an additional \co{P2()} that does an \co{smp_store_release()} to |
| the \co{x2} variable that \co{P0()} releases and \co{P1()} acquires. |
| The \co{exists} clause is also adjusted to exclude executions in which |
| \co{P2()}'s \co{smp_store_release()} precedes that of \co{P0()}. |
| |
| Running the litmus test in |
| \cref{lst:memorder:A Release-Acquire Chain With Added Store (Ordering?)} |
| shows that the addition of \co{P2()} can totally destroy the |
| ordering from the release-acquire chain. |
| Therefore, when constructing release-acquire chains, please take care |
| to construct them properly. |
| |
| \QuickQuiz{ |
| There are store-to-load links, load-to-store links, and |
| store-to-store links. |
| But what about load-to-load links? |
| }\QuickQuizAnswer{ |
| The problem with the concept of load-to-load links is that |
| if the two loads from the same variable return the same |
| value, there is no way to determine their ordering. |
| The only way to determine their ordering is if they return |
| different values, in which case there had to have been an |
| intervening store. |
| And that intervening store means that there is no load-to-load |
| link, but rather a load-to-store link followed by a |
| store-to-load link. |
| }\QuickQuizEnd |
| |
| In short, properly constructed release-acquire chains form a peaceful |
| (if rather tightly constrained) |
| island of intuitive bliss surrounded by a strongly counter-intuitive |
| sea of more complex memory-ordering constraints. |
| |
| \subsection{A Counter-Intuitive Case Study} |
| \label{sec:memorder:A Counter-Intuitive Case Study} |
| |
| This section will revisit |
| \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| on \cpageref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}, |
| which was presented in the answer to |
| \QuickQuizARef{\MemorderQQLitmusTestR}. |
| This litmus test has only two threads, with the stores in \co{P0()} |
| being ordered by \co{smp_wmb()} and the accesses in \co{P1()} being |
| ordered by \co{smp_mb()}. |
| Despite this litmus test's small size and heavy ordering, the |
| counter-intuitive outcome shown in the \co{exists} clause is in fact |
| allowed. |
| |
| One way to look at this was presented in the answer to |
| \QuickQuizARef{\MemorderQQLitmusTestR}, namely that the link from |
| \co{P0()} to \co{P1()} is a store-to-store link, and that back |
| from \co{P1()} to \co{P0()} is a store-to-store link. |
| Both links are counter-temporal, thus requiring full memory barriers |
| in both processes. |
| Revisiting |
| \cref{fig:memorder:Store-to-Store is Counter-Temporal,% |
| fig:memorder:Store-to-Load is Temporal} |
| shows that these counter-temporal links give the hardware considerable |
| latitude. |
| |
| But that raises the question of exactly how hardware would go about using |
| this latitude to satisfy the \co{exists} clause in |
| \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}. |
| There is no known ``toy'' hardware implementation that can do this, so |
| let us instead study the sequence of steps that the PowerPC architecture |
| goes through to make this happen. |
| |
| The first step in this study is to translate |
| \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)} |
| to a PowerPC assembly language litmus test |
| (\cref{sec:formal:Anatomy of a Litmus Test} on |
| \cpageref{sec:formal:Anatomy of a Litmus Test}): |
| |
| \begin{fcvlabel}[ln:memorder:inline:ppcasmr] |
| \begin{VerbatimN}[samepage=true,commandchars=\@\[\]] |
| PPC R+lwsync+sync |
| { |
| 0:r1=1; 0:r2=x; 0:r4=y; @lnlbl[init0] |
| 1:r1=2; 1:r2=y; 1:r4=x; @lnlbl[init1] |
| } |
| P0 | P1 ; @lnlbl[procs] |
| stw r1,0(r2) | stw r1,0(r2) ; @lnlbl[stores] |
| lwsync | sync ; @lnlbl[barriers] |
| stw r1,0(r4) | lwz r3,0(r4) ; @lnlbl[storeload] |
| exists (y=2 /\ 1:r3=0) @lnlbl[exists] |
| \end{VerbatimN} |
| \end{fcvlabel} |
| |
| \begin{fcvref}[ln:memorder:inline:ppcasmr] |
| The first line identifies the type of test (\co{PPC}) and gives |
| the test's name. |
| \Clnref{init0,init1} initialize \co{P0()}'s and \co{P1()}'s registers, |
| respectively. |
| \Clnrefrange{procs}{storeload} show the PowerPC assembly |
| statements corresponding to the C code from |
| \cref{lst:memorder:R Litmus Test With Write Memory Barrier (No Ordering)}, |
| with the first column being the code for \co{P0()} and the second column |
| being the code for \co{P1()}. |
| \Clnref{stores} shows the initial \co{WRITE_ONCE()} calls in both columns; |
| the columns of \clnref{barriers} show the \co{smp_wmb()} and \co{smp_mb()} |
| for \co{P0()} and \co{P1()}, respectively; |
| the columns of \clnref{storeload} shows \co{P0()}'s \co{WRITE_ONCE()} and |
| \co{P1()}'s \co{READ_ONCE()}, respectively; |
| and finally \clnref{exists} shows the \co{exists} clause. |
| \end{fcvref} |
| |
| In order for this \co{exists} clause to be satisfied, \co{P0()}'s |
| \co{stw} to \co{y} must precede that of \co{P1()}, but \co{P1()}'s |
| later \co{lwz} from \co{x} must precede \co{P0()}'s \co{stw} to \co{x}. |
| Seeing how this can happen requires a rough understanding of the |
| following PowerPC terminology. |
| |
| \begin{description}[style=nextline] |
| |
| \item[Instruction commit:] |
| This can be thought of as the execution of that instruction as opposed |
| to the memory-system consequences of having executed that instruction. |
| |
| \item[Write reaching coherence point:] |
| This can be thought of as the value written being deposited into the |
| corresponding cache line. |
| |
| \item[Partial coherence commit:] |
| This can be thought of as the system having worked out the order in which |
| a pair of values written will be deposited into the corresponding cache |
| line, but potentially well before that cache line arrives. |
| Some might argue that the data in |
| \cref{fig:memorder:A Variable With More Simultaneous Values} |
| suggests that real PowerPC hardware does in fact use partial coherence |
| commits to handle concurrent stores by multiple hardware threads within |
| a single core. |
| |
| \item[Write propagate to thread:] |
| This occurs when a second hardware thread becomes aware of the first |
| hardware thread's write. |
| The time at which a write propagates to a given thread might not have |
| any relation to cache-line movement. |
| For example, if a pair of threads share a store buffer, they might see |
| each others' writes long before the cache line gets involved. |
| On the other hand, if a pair of hardware threads are widely separated, |
| the first thread's write's value might have been deposited into the |
| corresponding cache line long before the second thread learns of that |
| write. |
| |
| \item[Barrier propagate to thread:] |
| Hardware threads make each other aware of memory-barrier instructions |
| as needed by propagating them to each other. |
| |
| \item[Acknowledge \tco{sync}:] |
| The PowerPC \co{sync} instruction implements the Linux kernel's |
| \co{smp_mb()} full barrier. |
| And one reason that the \co{sync} instruction provides such strong |
| ordering is that each \co{sync} is not only propagated to other hardware |
| threads, but these other threads must also acknowledge each \co{sync}. |
| This two-way communication allows the hardware threads to cooperate |
| to produce the required strong global ordering. |
| |
| \end{description} |
| |
| \begin{figure*}[tbp] |
| \centering |
| \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM0.png}} |
| \caption{PPCMEM Initial R State} |
| \label{fig:memorder:PPCMEM Initial R State} |
| \end{figure*} |
| |
| \begin{figure*}[tbp] |
| \centering |
| \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEM1.png}} |
| \caption{PPCMEM First R Step} |
| \label{fig:memorder:PPCMEM First R Step} |
| \end{figure*} |
| |
| We are now ready to step through the PowerPC sequence of events that |
| satisfies the above \co{exists} clause. |
| |
| To best understand this, please follow along at |
| \url{https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html}, |
| carefully copying the above assembly-language litmus test into the pane. |
| The result should look as shown in |
| \cref{fig:memorder:PPCMEM Initial R State}, give or take space characters. |
| Click on the ``Interactive'' button in the lower left, which, after a |
| short delay, should produce a display as shown in |
| \cref{fig:memorder:PPCMEM First R Step}. |
| If the ``Interactive'' button refuses to do anything, this usually means |
| that there is a syntax error, for example, a spurious newline character |
| might have been introduced during the copy-paste operation. |
| |
| This display has one clickable link in each section displaying thread |
| state, and as the ``Commit'' in each link suggests, these links commit |
| each thread's first \co{stw} instruction. |
| If you prefer, you can instead click on the corresponding links listed |
| under ``Enabled transitions'' near the bottom of the screen. |
| Note well that some of the later memory-system transitions will appear |
| in the upper ``Storage subsystem state'' section of this display. |
| |
| The following sequence of clicks demonstrates how the \co{exists} clause |
| can be satisfied: |
| |
| \begin{enumerate} |
| \item Commit \co{P0()}'s first \co{stw} instruction (to \co{x}). |
| \item Commit \co{P1()}'s \co{stw} instruction. |
| \item Commit \co{P0()}'s \co{lwsync} instruction. |
| \item Commit \co{P0()}'s second \co{stw} instruction (to \co{y}). |
| \item Commit \co{P1()}'s \co{sync} instruction. |
| \item At this point, there should be no clickable links in either of |
| the two sections displaying thread state, but there should be |
| quite a few of them up in the ``Storage subsystem state''. |
| The following steps tell you which of them to click on. |
| \item \co{Partial coherence commit: c:W y=1 -> d:W y=2}. |
| This commits the system to processing \co{P0()}'s store to |
| \co{y} before \co{P1()}'s store even though neither store |
| has reached either the coherence point or any other thread. |
| One might imagine partial coherence commits happening within a |
| store buffer that is shared by multiple hardware threads |
| that are writing to the same variable. |
| \item \co{Write propagate to thread: d:W y=2 to Thread 0}. |
| This is necessary to allow \co{P1()}'s \co{sync} instruction |
| to propagate to \co{P0()}. |
| \item \co{Barrier propagate to thread: e:Sync to Thread 0}. |
| \item \co{Write reaching coherence point: a:W x=1}. |
| \item \co{Write reaching coherence point: c:W y=1}. |
| \item \co{Write reaching coherence point: d:W y=2}. |
| These three operations were required in order to allow \co{P0()} |
| to acknowledge \co{P1()}'s \co{sync} instruction. |
| \item \co{Acknowledge sync: Sync e:Sync}. |
| \item Back down in thread \co{P1()}'s state, click on \co{Read i:W |
| x=0}, which loads the value zero, thus satisfying the \co{exists} |
| clause. |
| All that remains is cleanup, which can be carried out in any order. |
| \item Commit \co{P1()}'s \co{lwz} instruction. |
| \item \co{Write propagate to thread: a:W x=1 to Thread 1}. |
| \item \co{Barrier propagate to thread: b:Lwsync to Thread 1}. |
| \end{enumerate} |
| |
| \begin{figure*}[tbp] |
| \centering |
| \resizebox{\textwidth}{!}{\includegraphics{memorder/PPCMEMfinal.png}} |
| \caption{PPCMEM Final R State} |
| \label{fig:memorder:PPCMEM Final R State} |
| \end{figure*} |
| |
| At this point, you should see something like |
| \cref{fig:memorder:PPCMEM Final R State}. |
| Note that the satisified \co{exists} clause is shown in blue near the |
| bottom, confirming that this counter-intuitive really can happen. |
| If you wish, you can click on ``Undo'' to explore other options or |
| click on ``Reset'' to start over. |
| It can be very helpful to carry out these steps in different orders |
| to better understand how a |
| \IXalthy{non-multicopy-atomic}{non-}{multi-copy atomic} |
| architecture operates. |
| |
| \QuickQuiz{ |
| What happens if that \co{lwsync} instruction is instead a |
| \co{sync} instruction? |
| }\QuickQuizAnswer{ |
| The counter-intuitive outcome cannot happen. |
| (Try it!) |
| }\QuickQuizEnd |
| |
| Although a full understanding of how this counter-intuitive outcome |
| happens would require hardware details that are beyond the scope of |
| this book, this exercise should provide some helpful intuitions. |
| Or perhaps more accurately, destroy some counter-productive intuitions. |
| |
| % @@@ Exercises? |
| % @@@ Hardware details from Appendix? |
| |
| \section{Compile-Time Consternation} |
| \label{sec:memorder:Compile-Time Consternation} |
| % |
| \epigraph{Science increases our power in proportion as it lowers our pride.} |
| {Claude Bernard} |
| |
| Most languages, including C, were developed on uniprocessor systems |
| by people with little or no parallel-programming experience. |
| As a result, unless explicitly told otherwise, these languages assume |
| that the current CPU is the only thing that is reading or writing memory. |
| This in turn means that these languages' compilers' optimizers |
| are ready, willing, and oh so able to make dramatic changes to the |
| order, number, and sizes of memory references that your program |
| executes. |
| In fact, the reordering carried out by hardware can seem quite tame |
| by comparison. |
| |
| This section will help you tame your compiler, thus avoiding a great |
| deal of compile-time consternation. |
| \Cref{sec:memorder:Memory-Reference Restrictions} |
| describes how to keep the compiler from destructively optimizing |
| your code's memory references, |
| \cref{sec:memorder:Address- and Data-Dependency Difficulties} |
| describes how to protect address and data dependencies, |
| and finally, |
| \cref{sec:memorder:Control-Dependency Calamities} |
| describes how to protect those delicate control dependencies. |
| |
| \subsection{Memory-Reference Restrictions} |
| \label{sec:memorder:Memory-Reference Restrictions} |
| |
| As noted in \cref{sec:toolsoftrade:Accessing Shared Variables}, |
| unless told otherwise, compilers assume that nothing else |
| is affecting the variables that the code is accessing. |
| Furthermore, this assumption is not simply some design error, but is |
| instead enshrined in various standards.\footnote{ |
| Or perhaps it is a standardized design error.} |
| It is worth summarizing this material in preparation for the following |
| sections. |
| |
| \IXplx{Plain access}{es}, as in plain-access C-language assignment statements such |
| as \qco{r1 = a} or \qco{b = 1} are subject to the |
| shared-variable shenanigans described in |
| \cref{sec:toolsoftrade:Shared-Variable Shenanigans}. |
| Ways of avoiding these shenanigans are described in |
| \crefrange{sec:toolsoftrade:A Volatile Solution}{sec:toolsoftrade:Avoiding Data Races} |
| starting on |
| \cpageref{sec:toolsoftrade:A Volatile Solution}: |
| |
| \begin{enumerate} |
| \item Plain accesses can tear, for example, the compiler could choose |
| to access an eight-byte pointer one byte at a time. |
| Tearing of aligned machine-sized accesses can be prevented by |
| using \co{READ_ONCE()} and \co{WRITE_ONCE()}. |
| \item Plain loads can fuse, for example, if the results of an earlier |
| load from that same object are still in a machine register, |
| the compiler might opt to reuse the value in that register |
| instead of reloading from memory. |
| Load fusing can be prevented by using \co{READ_ONCE()} or by |
| enforcing ordering between the two loads using \co{barrier()}, |
| \co{smp_rmb()}, and other means shown in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| \item Plain stores can fuse, so that a store can be omitted entirely |
| if there is a later store to that same variable. |
| Store fusing can be prevented by using \co{WRITE_ONCE()} or by |
| enforcing ordering between the two stores using \co{barrier()}, |
| \co{smp_wmb()}, and other means shown in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| \item Plain accesses can be reordered in surprising ways by modern |
| optimizing compilers. |
| This reordering can be prevented by enforcing ordering as |
| called out above. |
| \item Plain loads can be invented, for example, register pressure might |
| cause the compiler to discard a previously loaded value from |
| its register, and then reload it later on. |
| Invented loads can be prevented by using \co{READ_ONCE()} or by |
| enforcing ordering as called out above between the load and a |
| later use of its value using \co{barrier()}. |
| \item Stores can be invented before a plain store, for example, by |
| using the stored-to location as temporary storage. |
| This can be prevented by use of \co{WRITE_ONCE()}. |
| \item Stores can be transformed into a load-check-store sequence, |
| which can defeat control dependencies. |
| This can be prevented by use of \co{smp_load_acquire()}. |
| \end{enumerate} |
| |
| \QuickQuiz{ |
| Why not place a \co{barrier()} call immediately before |
| a plain store to prevent the compiler from inventing stores? |
| }\QuickQuizAnswer{ |
| Because it would not work. |
| Although the compiler would be prevented from inventing a |
| store prior to the \co{barrier()}, nothing would prevent |
| it from inventing a store between that \co{barrier()} and |
| the plain store. |
| }\QuickQuizEnd |
| |
| Please note that all of these shared-memory shenanigans can instead be |
| avoided by avoiding \IXpl{data race} on plain accesses, as described in |
| \cref{sec:toolsoftrade:Avoiding Data Races}. |
| After all, if there are no data races, then each and every one of the |
| compiler optimizations mentioned above is perfectly safe. |
| But for code containing data races, this list is subject to change |
| without notice as compiler optimizations continue becoming increasingly |
| aggressive. |
| |
| In short, use of \co{READ_ONCE()}, \co{WRITE_ONCE()}, \co{barrier()}, |
| \co{volatile}, and other primitives called out in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| on |
| \cpageref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet} |
| are valuable tools in preventing the compiler from |
| optimizing your parallel algorithm out of existence. |
| Compilers are starting to provide other mechanisms for avoiding |
| load and store tearing, for example, \co{memory_order_relaxed} |
| atomic loads and stores, however, work is still |
| needed~\cite{JonathanCorbet2016C11atomics}. |
| In addition, compiler issues aside, \co{volatile} is still needed |
| to avoid fusing and invention of accesses, including C11 atomic accesses. |
| |
| Please note that, it is possible to overdo use of \co{READ_ONCE()} and |
| \co{WRITE_ONCE()}. |
| For example, if you have prevented a given variable from changing |
| (perhaps by holding the lock guarding all updates to that |
| variable), there is no point in using \co{READ_ONCE()}. |
| Similarly, if you have prevented any other CPUs or threads from |
| reading a given variable (perhaps because you are initializing |
| that variable before any other CPU or thread has access to it), |
| there is no point in using \co{WRITE_ONCE()}. |
| However, in my experience, developers need to use things like |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} more often than they think that |
| they do, and the overhead of unnecessary uses is quite low. |
| In contrast, the penalty for failing to use them when needed can be quite high. |
| |
| \subsection{Address- and Data-Dependency Difficulties} |
| \label{sec:memorder:Address- and Data-Dependency Difficulties} |
| \OriginallyPublished{sec:memorder:Address- and Data-Dependency Difficulties}{Address- and Data-Dependency Difficulties}{the Linux kernel}{PaulEMcKenney2014rcu-dereference} |
| |
| The low overheads of the \IXalth{address}{address}{dependency} |
| and \IXalth{data dependencies}{data}{dependency} discussed in |
| \cref{sec:memorder:Address Dependencies,% |
| sec:memorder:Data Dependencies}, |
| respectively, makes their use extremely attractive. |
| Unfortunately, compilers do not understand either address or data |
| dependencies, although there are efforts underway to teach them, or at |
| the very least, standardize the process of teaching |
| them~\cite{PaulEMcKennneyConsumeP0190R4,PaulEMcKenney2017markconsumeP0462R1}. |
| In the meantime, it is necessary to be very careful in order to prevent |
| your compiler from breaking your dependencies. |
| |
| \subsubsection{Give your dependency chain a good start} |
| The load that heads your dependency chain must use proper |
| ordering, for example \co{rcu_dereference()} or \co{READ_ONCE()}. |
| Failure to follow this rule can have serious side effects: |
| |
| \begin{enumerate} |
| \item On DEC Alpha, a dependent load might not be ordered with |
| the load heading the dependency chain, as described in |
| \cref{sec:memorder:Alpha}. |
| \item If the load heading the dependency chain is a |
| C11 non-volatile \co{memory_order_relaxed} load, |
| the compiler could omit the load, for example, by using a value |
| that it loaded in the past. |
| \item If the load heading the dependency chain is a plain load, |
| the compiler can omit the load, again by using a value |
| that it loaded in the past. |
| Worse yet, it could load twice instead of once, so that |
| different parts of your code use different values---and |
| compilers really do this, especially when under register |
| pressure. |
| \item The value loaded by the head of the dependency chain must |
| be a pointer. |
| In theory, yes, you could load an integer, perhaps to use |
| it as an array index. |
| In practice, the compiler knows too much about integers, |
| and thus has way too many opportunities to break your |
| dependency chain~\cite{PaulEMcKennneyConsumeP0190R4}. |
| \end{enumerate} |
| |
| \subsubsection{Avoid arithmetic dependency breakage} |
| Although it is just fine to do some arithmetic operations on a pointer in |
| your dependency chain, you need to be careful to avoid giving the |
| compiler too much information. |
| After all, if the compiler learns enough to determine the exact value |
| of the pointer, it can use that exact value instead of the pointer itself. |
| As soon as the compiler does that, the dependency is broken and all |
| ordering is lost. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Breakable Dependencies With Comparisons] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| int reserve_int; |
| int *gp; |
| int *p; |
| |
| p = rcu_dereference(gp); |
| if (p == &reserve_int) \lnlbl[cmp] |
| handle_reserve(p); \lnlbl[handle] |
| do_something_with(*p); /* buggy! */ |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Breakable Dependencies With Comparisons} |
| \label{lst:memorder:Breakable Dependencies With Comparisons} |
| \end{listing} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Broken Dependencies With Comparisons] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| int reserve_int; |
| int *gp; |
| int *p; |
| |
| p = rcu_dereference(gp); \lnlbl[deref1] |
| if (p == &reserve_int) { |
| handle_reserve(&reserve_int); |
| do_something_with(reserve_int); /* buggy! */ \lnlbl[deref2] |
| } else { |
| do_something_with(*p); /* OK! */ |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Broken Dependencies With Comparisons} |
| \label{lst:memorder:Broken Dependencies With Comparisons} |
| \end{listing} |
| |
| \begin{enumerate} |
| \item Although it is permissible to compute offsets from a |
| pointer, these offsets must not result in total cancellation. |
| For example, given a \co{char} pointer \co{cp}, |
| \co{cp-(uintptr_t)cp} will cancel and can allow the compiler |
| to break your dependency chain. |
| On the other hand, canceling offset values with each other |
| is perfectly safe and legal. |
| For example, if \co{a} and \co{b} are equal, \co{cp+a-b} |
| is an identity function, including preserving the dependency. |
| \item Comparisons can break dependencies. |
| \Cref{lst:memorder:Breakable Dependencies With Comparisons} |
| shows how this can happen. |
| Here global pointer \co{gp} points to a dynamically allocated |
| integer, but if memory is low, it might instead point to |
| the \co{reserve_int} variable. |
| \begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons] |
| This \co{reserve_int} case might need special handling, as |
| shown on \clnref{cmp,handle} of the listing. |
| \end{fcvref} |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Comparisons] |
| But the compiler could reasonably transform this code into |
| the form shown in |
| \cref{lst:memorder:Broken Dependencies With Comparisons}, |
| especially on systems where instructions with absolute |
| addresses run faster than instructions using addresses |
| supplied in registers. |
| However, there is clearly no ordering between the pointer |
| load on \clnref{deref1} and the dereference on \clnref{deref2}. |
| Please note that this is simply an example: |
| There are a great many other ways to break dependency chains |
| with comparisons. |
| \end{fcvref} |
| \end{enumerate} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| \begin{fcvref}[ln:memorder:Breakable Dependencies With Comparisons] |
| Why can't you simply dereference the pointer before comparing it |
| to \co{&reserve_int} on \clnref{cmp} of |
| \cref{lst:memorder:Breakable Dependencies With Comparisons}? |
| \end{fcvref} |
| }\QuickQuizAnswerB{ |
| For first, it might be necessary to invoke |
| \co{handle_reserve()} before \co{do_something_with()}. |
| |
| But more relevant to memory ordering, the compiler is often within |
| its rights to hoist the comparison ahead of the dereferences, |
| which would allow the compiler to use \co{&reserve_int} instead |
| of the variable \co{p} that the hardware has tagged with |
| a dependency. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But it should be safe to compare two pointer variables, right? |
| After all, the compiler doesn't know the value |
| of either, so how can it possibly learn anything from the |
| comparison? |
| }\QuickQuizAnswerE{ |
| % |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Breakable Dependencies With Non-Constant Comparisons] |
| \begin{VerbatimL} |
| int *gp1; |
| int *p; |
| int *q; |
| |
| p = rcu_dereference(gp1); |
| q = get_a_pointer(); |
| if (p == q) |
| handle_equality(p); |
| do_something_with(*p); |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Breakable Dependencies With Non-Constant Comparisons} |
| \label{lst:memorder:Breakable Dependencies With Non-Constant Comparisons} |
| \end{listing}% |
| % |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Broken Dependencies With Non-Constant Comparisons] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| int *gp1; |
| int *p; |
| int *q; |
| |
| p = rcu_dereference(gp1); \lnlbl[p] |
| q = get_a_pointer(); |
| if (p == q) { |
| handle_equality(q); |
| do_something_with(*q); \lnlbl[q] |
| } else { |
| do_something_with(*p); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Broken Dependencies With Non-Constant Comparisons} |
| \label{lst:memorder:Broken Dependencies With Non-Constant Comparisons} |
| \end{listing}% |
| % |
| Unfortunately, the compiler really can learn enough to |
| break your dependency chain, for example, as shown in |
| \cref{lst:memorder:Breakable Dependencies With Non-Constant Comparisons}. |
| The compiler is within its rights to transform this code |
| into that shown in |
| \cref{lst:memorder:Broken Dependencies With Non-Constant Comparisons}, |
| and might well make this transformation due to register pressure |
| if \co{handle_equality()} was inlined and needed a lot of registers. |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Non-Constant Comparisons] |
| \Clnref{q} of this transformed code uses \co{q}, which although |
| equal to \co{p}, is not necessarily tagged by the hardware as |
| carrying a dependency. |
| Therefore, this transformed code does not necessarily guarantee |
| that \clnref{q} is ordered after \clnref{p}.\footnote{ |
| Kudos to \ppl{Linus}{Torvalds} for providing this example.} |
| \end{fcvref} |
| }\QuickQuizEndE |
| } |
| |
| Note that a series of inequality comparisons might, when taken together, |
| give the compiler enough information to determine the exact value of |
| the pointer, at which point the dependency is broken. |
| Furthermore, the compiler might be able to combine information from |
| even a single inequality comparison with other information to learn |
| the exact value, again breaking the dependency. |
| Pointers to elements in arrays are especially susceptible to this latter |
| form of dependency breakage. |
| |
| \subsubsection{Safe comparison of dependent pointers} |
| It turns out that there are several safe ways to compare dependent |
| pointers: |
| |
| \begin{enumerate} |
| \item Comparisons against the \co{NULL} pointer. |
| In this case, all the compiler can learn is that the pointer |
| is \co{NULL}, in which case you are not allowed to |
| dereference it anyway. |
| \item The dependent pointer is never dereferenced, whether before or |
| after the comparison. |
| \item The dependent pointer is compared to a pointer that references |
| objects that were last modified a very long time ago, where |
| the only unconditionally safe value of ``a very long time ago'' is |
| ``at compile time''. |
| The key point is that something other than the address or data |
| dependency guarantees ordering. |
| \item Comparisons between two pointers, each of which carries |
| an appropriate dependency. |
| For example, you have a pair of pointers, each carrying a |
| dependency, to data structures each containing a lock, and you |
| want to avoid \IX{deadlock} by acquiring the locks in address order. |
| \item The comparison is not-equal, and the compiler does not have |
| enough other information to deduce the value of the |
| pointer carrying the dependency. |
| \end{enumerate} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Broken Dependencies With Pointer Comparisons] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| struct foo { \lnlbl[foo:b] |
| int a; |
| int b; |
| int c; |
| }; \lnlbl[foo:e] |
| struct foo *gp1; \lnlbl[gp1] |
| struct foo *gp2; \lnlbl[gp2] |
| |
| void updater(void) \lnlbl[upd:b] |
| { |
| struct foo *p; |
| |
| p = malloc(sizeof(*p)); \lnlbl[upd:alloc] |
| BUG_ON(!p); \lnlbl[upd:bug] |
| p->a = 42; \lnlbl[upd:init:a] |
| p->b = 43; |
| p->c = 44; \lnlbl[upd:init:c] |
| rcu_assign_pointer(gp1, p); \lnlbl[upd:assign1] |
| WRITE_ONCE(p->b, 143); \lnlbl[upd:upd:b] |
| WRITE_ONCE(p->c, 144); \lnlbl[upd:upd:c] |
| rcu_assign_pointer(gp2, p); \lnlbl[upd:assign2] |
| } \lnlbl[upd:e] |
| |
| void reader(void) \lnlbl[read:b] |
| { |
| struct foo *p; |
| struct foo *q; |
| int r1, r2 = 0; |
| |
| p = rcu_dereference(gp2); \lnlbl[read:gp2] |
| if (p == NULL) \lnlbl[read:nulchk] |
| return; \lnlbl[read:nulret] |
| r1 = READ_ONCE(p->b); \lnlbl[read:pb] |
| q = rcu_dereference(gp1); \lnlbl[read:gp1] |
| if (p == q) { \lnlbl[read:equ] |
| r2 = READ_ONCE(p->c); \lnlbl[read:pc] |
| } |
| do_something_with(r1, r2); |
| } \lnlbl[read:e] |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Broken Dependencies With Pointer Comparisons} |
| \label{lst:memorder:Broken Dependencies With Pointer Comparisons} |
| \end{listing} |
| |
| Pointer comparisons can be quite tricky, and so it is well worth working |
| through the example shown in |
| \cref{lst:memorder:Broken Dependencies With Pointer Comparisons}. |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons] |
| This example uses a simple \co{struct foo} shown on \clnrefrange{foo:b}{foo:e} |
| and two global pointers, \co{gp1} and \co{gp2}, shown on \clnref{gp1,gp2}, |
| respectively. |
| This example uses two threads, namely \co{updater()} on |
| \clnrefrange{upd:b}{upd:e} and \co{reader()} on \clnrefrange{read:b}{read:e}. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:upd] |
| The \co{updater()} thread allocates memory on \clnref{alloc}, and complains |
| bitterly on \clnref{bug} if none is available. |
| \Clnrefrange{init:a}{init:c} initialize the newly allocated structure, |
| and then \clnref{assign1} assigns the pointer to \co{gp1}. |
| \Clnref{upd:b,upd:c} then update two of the structure's fields, and does |
| so \emph{after} \clnref{assign1} has made those fields visible to readers. |
| Please note that unsynchronized update of reader-visible fields |
| often constitutes a bug. |
| Although there are legitimate use cases doing just this, such use cases |
| require more care than is exercised in this example. |
| |
| Finally, \clnref{assign2} assigns the pointer to \co{gp2}. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read] |
| The \co{reader()} thread first fetches \co{gp2} on \clnref{gp2}, with |
| \clnref{nulchk,nulret} checking for \co{NULL} and returning if so. |
| \Clnref{pb} fetches field \co{->b} and |
| \clnref{gp1} fetches \co{gp1}. |
| If \clnref{equ} sees that the pointers fetched on \clnref{gp2,gp1} |
| are equal, \clnref{pc} fetches \co{p->c}. |
| Note that \clnref{pc} uses pointer \co{p} fetched on \clnref{gp2}, not |
| pointer \co{q} fetched on \clnref{gp1}. |
| |
| But this difference might not matter. |
| An equals comparison on \clnref{equ} might lead the compiler to (incorrectly) |
| conclude that both pointers are equivalent, when in fact they carry |
| different dependencies. |
| This means that the compiler might well transform \clnref{pc} to instead |
| be \co{r2 = READ_ONCE(q->c)}, which might well cause the value 44 to be loaded |
| instead of the expected value 144. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read] |
| But doesn't the condition in \clnref{equ} supply a control dependency |
| that would keep \clnref{pc} ordered after \clnref{gp1}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:memorder:Broken Dependencies With Pointer Comparisons:read] |
| Yes, but no. |
| Yes, there is a control dependency, but control dependencies do |
| not order later loads, only later stores. |
| If you really need ordering, you could place an \co{smp_rmb()} |
| between \clnref{equ,pc}. |
| Or better yet, have \co{updater()} |
| allocate two structures instead of reusing the structure. |
| For more information, see |
| \cref{sec:memorder:Control-Dependency Calamities}. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| In short, great care is required to ensure that dependency |
| chains in your source code are still dependency chains in the |
| compiler-generated assembly code. |
| |
| \subsection{Control-Dependency Calamities} |
| \label{sec:memorder:Control-Dependency Calamities} |
| |
| The \IXalth{control dependencies}{control}{dependency} described in |
| \cref{sec:memorder:Control Dependencies} |
| are attractive due to their low overhead, but are also especially |
| tricky because current compilers do not understand them and can easily |
| break them. |
| The rules and examples in this section are intended to help you |
| prevent your compiler's ignorance from breaking your code. |
| |
| A load-load control dependency requires a full \IXh{read}{memory barrier}, |
| not simply a data dependency barrier. |
| Consider the following bit of code: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) { |
| <data dependency barrier> |
| q = READ_ONCE(y); |
| } |
| \end{VerbatimN} |
| |
| This will not have the desired effect because there is no actual data |
| dependency, but rather a control dependency that the CPU may short-circuit |
| by attempting to predict the outcome in advance, so that other CPUs see |
| the load from~\co{y} as having happened before the load from~\co{x}. |
| In such a case what's actually required is: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) { |
| <read barrier> |
| q = READ_ONCE(y); |
| } |
| \end{VerbatimN} |
| |
| However, stores are not speculated. |
| This means that ordering \emph{is} provided for load-store control |
| dependencies, as in the following example: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) |
| WRITE_ONCE(y, 1); |
| \end{VerbatimN} |
| |
| Control dependencies pair normally with other types of ordering operations. |
| That said, please note that neither \co{READ_ONCE()} nor \co{WRITE_ONCE()} |
| are optional! |
| Without the \co{READ_ONCE()}, the compiler might fuse the load |
| from~\co{x} with other loads from~\co{x}. |
| Without the \co{WRITE_ONCE()}, the compiler might fuse the store |
| to~\co{y} with other stores to~\co{y}, or, worse yet, read the |
| value, compare it, and only conditionally do the store. |
| Any of these can result in highly counter-intuitive effects on ordering. |
| |
| Worse yet, if the compiler is able to prove (say) that the value of |
| variable~\co{x} is always non-zero, it would be well within its rights |
| to optimize the original example by eliminating the \qco{if} statement |
| as follows: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| WRITE_ONCE(y, 1); /* BUG: CPU can reorder!!! */ |
| \end{VerbatimN} |
| |
| \QuickQuiz{ |
| But there is a \co{READ_ONCE()}, so how can the compiler |
| prove anything about the value of \co{q}? |
| }\QuickQuizAnswer{ |
| Given the simple \co{if} statement comparing against zero, |
| it is hard to imagine the compiler proving anything. |
| But suppose that later code executed a division by \co{q}. |
| Because division by zero is undefined behavior, as of 2023, |
| many compilers will assume that the value of \co{q} must |
| be non-zero, and will thus remove that \co{if} statement, |
| thus unconditionally executing the \co{WRITE_ONCE()}, in |
| turn destroying the control dependency. |
| |
| There are some who argue (correctly, in Paul's view) that |
| back-propagating undefined behavior across volatile accesses |
| constitutes a compiler bug, but many compiler writers insist |
| that this is not a bug, but rather a valuable optimization. |
| }\QuickQuizEnd |
| |
| It is tempting to try to enforce ordering on identical stores on both |
| branches of the \qco{if} statement as follows: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) { |
| barrier(); |
| WRITE_ONCE(y, 1); |
| do_something(); |
| } else { |
| barrier(); |
| WRITE_ONCE(y, 1); |
| do_something_else(); |
| } |
| \end{VerbatimN} |
| |
| Unfortunately, current compilers will transform this as follows at high |
| optimization levels: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| barrier(); |
| WRITE_ONCE(y, 1); /* BUG: No ordering!!! */ |
| if (q) |
| do_something(); |
| else |
| do_something_else(); |
| \end{VerbatimN} |
| |
| Now there is no conditional between the load from~\co{x} and the store |
| to~\co{y}, which means that the CPU is within its rights to reorder them: |
| The conditional is absolutely required, and must be present in the |
| assembly code even after all compiler optimizations have been applied. |
| Therefore, if you need ordering in this example, you need explicit |
| memory-ordering operations, for example, a \IX{release store}: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) { |
| smp_store_release(&y, 1); |
| do_something(); |
| } else { |
| smp_store_release(&y, 1); |
| do_something_else(); |
| } |
| \end{VerbatimN} |
| |
| The initial \co{READ_ONCE()} is still required to prevent the compiler from |
| guessing the value of~\co{x}. |
| In addition, you need to be careful what you do with the local variable~% |
| \co{q}, |
| otherwise the compiler might be able to guess its value and again remove |
| the needed conditional. |
| For example: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q % MAX) { |
| WRITE_ONCE(y, 1); |
| do_something(); |
| } else { |
| WRITE_ONCE(y, 2); |
| do_something_else(); |
| } |
| \end{VerbatimN} |
| |
| If \co{MAX} is defined to be~1, then the compiler knows that \co{(q\%MAX)} is |
| equal to zero, in which case the compiler is within its rights to |
| transform the above code into the following: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| WRITE_ONCE(y, 2); |
| do_something_else(); |
| \end{VerbatimN} |
| |
| Given this transformation, the CPU is not required to respect the ordering |
| between the load from variable~\co{x} and the store to variable~\co{y}. |
| It is tempting to add a \co{barrier()} to constrain the compiler, |
| but this does not help. |
| The conditional is gone, and the \co{barrier()} won't bring it back. |
| Therefore, if you are relying on this ordering, you should make sure |
| that \co{MAX} is greater than one, perhaps as follows: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| BUILD_BUG_ON(MAX <= 1); |
| if (q % MAX) { |
| WRITE_ONCE(y, 1); |
| do_something(); |
| } else { |
| WRITE_ONCE(y, 2); |
| do_something_else(); |
| } |
| \end{VerbatimN} |
| |
| Please note once again that the stores to~\co{y} differ. |
| If they were identical, as noted earlier, the compiler could pull this |
| store outside of the \qco{if} statement. |
| |
| You must also avoid excessive reliance on boolean short-circuit evaluation. |
| Consider this example: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q || 1 > 0) |
| WRITE_ONCE(y, 1); |
| \end{VerbatimN} |
| |
| Because the first condition cannot fault and the second condition is |
| always true, the compiler can transform this example as following, |
| defeating the control dependency: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| WRITE_ONCE(y, 1); |
| \end{VerbatimN} |
| |
| This example underscores the need to ensure that the compiler cannot |
| out-guess your code. |
| Never forget that, although \co{READ_ONCE()} does force |
| the compiler to actually emit code for a given load, it does not force |
| the compiler to use the value loaded. |
| |
| In addition, control dependencies apply only to the then-clause and |
| else-clause of the if-statement in question. |
| In particular, it does |
| not necessarily apply to code following the if-statement: |
| |
| \begin{VerbatimN} |
| q = READ_ONCE(x); |
| if (q) |
| WRITE_ONCE(y, 1); |
| else |
| WRITE_ONCE(y, 2); |
| WRITE_ONCE(z, 1); /* BUG: No ordering. */ |
| \end{VerbatimN} |
| |
| It is tempting to argue that there in fact is ordering because the |
| compiler cannot reorder volatile accesses and also cannot reorder |
| the writes to~\co{y} with the condition. |
| Unfortunately for this line |
| of reasoning, the compiler might compile the two writes to~\co{y} as |
| conditional-move instructions, as in this fanciful pseudo-assembly |
| language: |
| |
| \begin{VerbatimN} |
| ld r1,x |
| cmp r1,$0 |
| cmov,ne r4,$1 |
| cmov,eq r4,$2 |
| st r4,y |
| st $1,z |
| \end{VerbatimN} |
| |
| A weakly ordered CPU would have no dependency of any sort between the load |
| from~\co{x} and the store to~\co{z}. |
| The control dependencies would extend |
| only to the pair of \co{cmov} instructions and the store depending on them. |
| In short, control dependencies apply only to the stores in the \qco{then} |
| and \qco{else} of the \qco{if} in question (including functions invoked by |
| those two clauses), and not necessarily to code following that \qco{if}. |
| |
| Finally, control dependencies do \emph{not} provide cumulativity.\footnote{ |
| Refer to \cref{sec:memorder:Cumulativity} for |
| the meaning of cumulativity.} |
| This is demonstrated by two related litmus tests, namely |
| \cref{lst:memorder:LB Litmus Test With Control Dependency,% |
| lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)} |
| with the initial values |
| of~\co{x} and~\co{y} both being zero. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-LB+o-cgt-o+o-cgt-o@whole.fcv} |
| \caption{LB Litmus Test With Control Dependency} |
| \label{lst:memorder:LB Litmus Test With Control Dependency} |
| \end{listing} |
| |
| The \co{exists} clause in the two-thread example of |
| \cref{lst:memorder:LB Litmus Test With Control Dependency} |
| (\path{C-LB+o-cgt-o+o-cgt-o.litmus}) |
| will never trigger. |
| If control dependencies guaranteed cumulativity (which they do |
| not), then adding a thread to the example as in |
| \cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)} |
| (\path{C-WWC+o-cgt-o+o-cgt-o+o.litmus}) |
| would guarantee the related \co{exists} clause never to trigger. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/litmus/C-WWC+o-cgt-o+o-cgt-o+o@whole.fcv} |
| \caption{WWC Litmus Test With Control Dependency (Cumulativity?)} |
| \label{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)} |
| \end{listing} |
| |
| But because control dependencies do \emph{not} provide cumulativity, the |
| \co{exists} clause in the three-thread litmus test can trigger. |
| If you need the three-thread example to provide ordering, you will need |
| \co{smp_mb()} between the load and store in \co{P0()}, |
| that is, just before or just after the \qco{if} statements. |
| Furthermore, the original two-thread example is very fragile and should be avoided. |
| |
| \QuickQuiz{ |
| Can't you instead add an \co{smp_mb()} to \co{P1()} in |
| \cref{lst:memorder:WWC Litmus Test With Control Dependency (Cumulativity?)}? |
| }\QuickQuizAnswer{ |
| Not given the Linux kernel memory model. |
| (Try it!) |
| However, you can instead replace \co{P0()}'s |
| \co{WRITE_ONCE()} with \co{smp_store_release()}, |
| which usually has less overhead than does adding an \co{smp_mb()}. |
| }\QuickQuizEnd |
| |
| The following list of rules summarizes the lessons of this section: |
| |
| \begin{enumerate} |
| \item Compilers do not understand control dependencies, so it is |
| your job to make sure that the compiler cannot break your code. |
| |
| \item Control dependencies can order prior loads against later stores. |
| However, they do \emph{not} guarantee any other sort of ordering: |
| Not prior loads against later loads, nor prior stores against |
| later anything. |
| If you need these other forms of ordering, use \co{smp_rmb()}, |
| \co{smp_wmb()}, or, in the case of prior stores and later loads, |
| \co{smp_mb()}. |
| |
| \item If both legs of the \qco{if} statement begin with identical stores |
| to the same variable, then the control dependency will not order |
| those stores. |
| If ordering is needed, precede both of them with \co{smp_mb()} or |
| use \co{smp_store_release()}. |
| Please note that it is \emph{not} sufficient to use \co{barrier()} |
| at beginning of each leg of the \qco{if} statement because, as shown |
| by the example above, optimizing compilers can destroy the control |
| dependency while respecting the letter of the \co{barrier()} law. |
| |
| \item Control dependencies require at least one run-time conditional |
| between the prior load and the subsequent store, and this |
| conditional must involve the prior load. |
| If the compiler is able to optimize the conditional away, it |
| will have also optimized away the ordering. |
| Careful use of \co{READ_ONCE()} and \co{WRITE_ONCE()} can help |
| to preserve the needed conditional. |
| |
| \item Control dependencies require that the compiler avoid reordering |
| the dependency into nonexistence. |
| Careful use of \co{READ_ONCE()}, \co{atomic_read()}, or |
| \co{atomic64_read()} can help to preserve your control |
| dependency. |
| |
| \item Control dependencies apply only to the \qco{then} and |
| \qco{else} of the \qco{if} containing the control |
| dependency, including any functions that these two clauses call. |
| Control dependencies do \emph{not} apply to code following the |
| end of the \qco{if} statement containing the control dependency. |
| |
| \item Control dependencies pair normally with other types of |
| memory-ordering operations. |
| |
| \item Control dependencies do \emph{not} provide cumulativity. |
| If you need cumulativity, use something that provides it, |
| such as \co{smp_store_release()} or \co{smp_mb()}. |
| \end{enumerate} |
| |
| Again, many popular languages were designed with single-threaded use |
| in mind. |
| Successful multithreaded use of these languages requires you to pay |
| special attention to your memory references and dependencies. |
| |
| \section{Higher-Level Primitives} |
| \label{sec:memorder:Higher-Level Primitives} |
| % |
| \epigraph{Method will teach you to win time.} |
| {Johann Wolfgang von Goethe} |
| |
| The answer to one of the quick quizzes in |
| \cref{sec:formal:Axiomatic Approaches and Locking} |
| demonstrated exponential speedups due to verifying programs |
| modeled at higher levels of abstraction. |
| This section will look into how higher levels of abstraction can |
| also provide a deeper understanding of the synchronization primitives |
| themselves. |
| \Cref{sec:memorder:Memory Allocation} |
| takes a look at memory allocation, |
| \cref{sec:memorder:Locking} |
| examines the surprisingly varied semantics of locking, and |
| \cref{sec:memorder:RCU} |
| digs more deeply into RCU\@. |
| |
| \subsection{Memory Allocation} |
| \label{sec:memorder:Memory Allocation} |
| |
| \Cref{sec:SMPdesign:Parallel Fastpath for Resource Allocation} |
| touched upon memory allocation, and this section expands upon the relevant |
| memory-ordering issues. |
| |
| The key requirement is that any access executed on a given block of |
| memory before freeing that block must be ordered before any access |
| executed after that same block is reallocated. |
| It would after all be a cruel and unusual memory-allocator bug if a store |
| preceding the free were to be reordered after another store following |
| the reallocation! |
| However, it would also be cruel and unusual to require developers to use |
| \co{READ_ONCE()} and \co{WRITE_ONCE()} to access dynamically allocated |
| memory. |
| Full ordering must therefore be provided for plain accesses, in spite of |
| all the shared-variable shenanigans called out in |
| \cref{sec:toolsoftrade:Shared-Variable Shenanigans}. |
| |
| Of course, each CPU sees its own accesses in order and the compiler |
| always has fully accounted for intra-CPU shenanigans, give or take |
| the occasional compiler bug. |
| These facts are what enables the lockless fastpaths in |
| \co{memblock_alloc()} and \co{memblock_free()}, which are shown in |
| \cref{lst:SMPdesign:Allocator-Cache Allocator Function,% |
| lst:SMPdesign:Allocator-Cache Free Function}, |
| respectively. |
| However, this is also why the developer is responsible for providing |
| appropriate ordering (for example, by using \co{smp_store_release()}) |
| when publishing a pointer to a newly allocated block of memory. |
| After all, in the CPU-local case, the allocator has not necessarily |
| provided any cross-CPU ordering. |
| |
| This means that the allocator must provide ordering when rebalancing |
| its per-thread pools. |
| This ordering is provided by the calls to \co{spin_lock()} and |
| \co{spin_unlock()} from \co{memblock_alloc()} and \co{memblock_free()}. |
| For any block that has migrated from one thread to another, the old |
| thread will have executed \co{spin_unlock(&globalmem.mutex)} after |
| placing the block in the \co{globalmem} pool, and the new thread will |
| have executed \co{spin_lock(&globalmem.mutex)} before moving that |
| block to its per-thread pool. |
| This \co{spin_unlock()} and \co{spin_lock()} ensures that both the |
| old and new threads see the old thread's accesses as having happened |
| before those of the new thread. |
| |
| \QuickQuiz{ |
| But doesn't PowerPC have weak unlock-lock ordering properties |
| within the Linux kernel, allowing a write before the unlock to |
| be reordered with a read after the lock? |
| }\QuickQuizAnswer{ |
| Yes, but only from the perspective of a third thread not holding |
| that lock. |
| In contrast, memory allocators need only concern themselves with |
| the two threads migrating the memory. |
| It is after all the developer's responsibility to properly |
| synchronize with any other threads that need access to the newly |
| migrated block of memory. |
| }\QuickQuizEnd |
| |
| Therefore, the ordering required by conventional uses of memory allocation |
| can be provided solely by non-fastpath locking, allowing the fastpath to |
| remain synchronization-free. |
| |
| \subsection{Locking} |
| \label{sec:memorder:Locking} |
| |
| Locking is a well-known synchronization primitive with which the |
| parallel-programming community has had decades of experience. |
| As such, locking's semantics are quite simple. |
| |
| That is, they are quite simple until you start trying to mathematically |
| model them. |
| |
| The simple part is that any CPU or thread holding a given lock is |
| guaranteed to see any accesses executed by CPUs or threads while they |
| were previously holding that same lock. |
| Similarly, any CPU or thread holding a given lock is guaranteed not |
| to see accesses that will be executed by other CPUs or threads while |
| subsequently holding that same lock. |
| And what else is there? |
| |
| As it turns out, quite a bit: |
| |
| \begin{enumerate} |
| \item Are CPUs, threads, or compilers allowed to pull memory accesses |
| into a given lock-based critical section? |
| \item Will a CPU or thread holding a given lock also be guaranteed |
| to see accesses executed by CPUs and threads before they last |
| acquired that same lock, and vice versa? |
| \item Suppose that a given CPU or thread executes one access |
| (call it ``A''), releases a lock, reacquires that same lock, |
| then executes another access (call it ``B'')\@. |
| Is some other CPU or thread not holding that lock guaranteed to |
| see A and B in order? |
| \item As above, but with the lock reacquisition carried out by some |
| other CPU or thread? |
| \item As above, but with the lock reacquisition being some other lock? |
| \item What ordering guarantees are provided by \co{spin_is_locked()}? |
| \end{enumerate} |
| |
| The reaction to some or even all of these questions might well be ``Why |
| would anyone do \emph{that}?'' |
| However, any complete mathematical definition of locking must have |
| answers to all of these questions. |
| Therefore, the following sections address these questions in the context |
| of the Linux kernel. |
| |
| \subsubsection{Accesses Into Critical Sections?} |
| \label{sec:memorder:Accesses Into Critical Sections?} |
| |
| Can memory accesses be reordered into lock-based critical sections? |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-before-into@whole.fcv} |
| \caption{Prior Accesses Into Critical Section (Ordering?)} |
| \label{lst:memorder:Prior Accesses Into Critical Section (Ordering?)} |
| \end{listing} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-after-into@whole.fcv} |
| \caption{Subsequent Accesses Into Critical Section (Ordering?)} |
| \label{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)} |
| \end{listing} |
| |
| Within the context of the Linux-kernel memory model, the simple answer |
| is ``yes''. |
| This may be verified by running the litmus tests shown in |
| \cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?),% |
| lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)} |
| (\path{C-Lock-before-into.litmus} and \path{C-Lock-after-into.litmus}, |
| respectively), both of which will yield the \co{Sometimes} result. |
| This result indicates that the \co{exists} clause can be satisfied, that |
| is, that the final value of both \co{P0()}'s and \co{P1()}'s \co{r1} variable |
| can be zero. |
| This means that neither \co{spin_lock()} nor \co{spin_unlock()} |
| are required to act as a \IXh{full}{memory barrier}. |
| |
| However, other environments might make other choices. |
| For example, locking implementations that run only on the x86 CPU |
| family will have lock-acquisition primitives that fully order the lock |
| acquisition with any prior and any subsequent accesses. |
| Therefore, on such systems the ordering shown in |
| \cref{lst:memorder:Prior Accesses Into Critical Section (Ordering?)} |
| comes for free. |
| There are x86 lock-release implementations that are weakly ordered, |
| thus failing to provide the ordering shown in |
| \cref{lst:memorder:Subsequent Accesses Into Critical Section (Ordering?)}, |
| but an implementation could nevertheless choose to guarantee this ordering. |
| |
| For their part, weakly ordered systems might well choose to execute |
| the memory-barrier instructions required to guarantee both orderings, |
| possibly simplifying code making advanced use of combinations of locked |
| and lockless accesses. |
| However, as noted earlier, \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model} |
| chooses not to provide these additional |
| orderings, in part to avoid imposing performance penalties on the simpler |
| and more prevalent locking use cases. |
| Instead, the \co{smp_mb__after_spinlock()} and \co{smp_mb__after_unlock_lock()} |
| primitives are provided for those more complex use cases, as discussed |
| in \cref{sec:memorder:Hardware Specifics}. |
| |
| Thus far, this section has discussed only hardware reordering. |
| Can the compiler also reorder memory references into lock-based |
| critical sections? |
| |
| The answer to this question in the context of the Linux kernel is a |
| resounding ``No!'' |
| One reason for this otherwise inexplicable favoring of hardware reordering |
| over compiler optimizations is that the hardware will avoid reordering |
| a page-faulting access into a lock-based critical section. |
| In contrast, compilers have no clue about page faults, and would |
| therefore happily reorder a page fault into a critical section, which |
| could crash the kernel. |
| The compiler is also unable to reliably determine which accesses |
| will result in cache misses, so that compiler reordering into critical |
| sections could also result in excessive lock contention. |
| Therefore, the Linux kernel prohibits the compiler (but not the CPU) |
| from moving accesses into lock-based critical sections. |
| |
| \subsubsection{Accesses Outside of Critical Section?} |
| \label{sec:memorder:Accesses Outside of Critical Section?} |
| |
| If a given CPU or thread holds a given lock, it is guaranteed to see |
| accesses executed during all prior critical sections for that same |
| lock. |
| Similarly, such a CPU or thread is guaranteed not to see accesses |
| that will be executed during all subsequent critical sections for |
| that same lock. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-outside-across@whole.fcv} |
| \caption{Accesses Outside of Critical Sections} |
| \label{lst:memorder:Accesses Outside of Critical Sections} |
| \end{listing} |
| |
| But what about accesses preceding prior critical sections and |
| following subsequent critical sections? |
| |
| This question can be answered for the Linux kernel by referring to |
| \cref{lst:memorder:Accesses Outside of Critical Sections} |
| (\path{C-Lock-outside-across.litmus}). |
| Running this litmus test yields the \co{Never} result, |
| which means that accesses in code leading up to a prior critical section |
| is also visible to the current CPU or thread holding that same lock. |
| Similarly, code that is placed after a subsequent critical section |
| is never visible to the current CPU or thread holding that same lock. |
| |
| As a result, the Linux kernel cannot allow accesses to be moved |
| across the entirety of a given critical section. |
| Other environments might well wish to allow such code motion, but please |
| be advised that doing so is likely to yield profoundly counter-intuitive |
| results. |
| |
| In short, the ordering provided by \co{spin_lock()} extends not only |
| throughout the critical section, but also indefinitely beyond the end |
| of that critical section. |
| Similarly, the ordering provided by \co{spin_unlock()} extends not |
| only throughout the critical section, but also indefinitely beyond the |
| beginning of that critical section. |
| |
| \subsubsection{Ordering for Non-Lock Holders?} |
| \label{sec:memorder:Ordering for Non-Lock Holders?} |
| |
| Does a CPU or thread that is not holding a given lock see that lock's |
| critical sections as being ordered? |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-1@whole.fcv} |
| \caption{Accesses Between Same-CPU Critical Sections (Ordering?)} |
| \label{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)} |
| \end{listing} |
| |
| This question can be answered for the Linux kernel by referring to |
| \cref{lst:memorder:Accesses Between Same-CPU Critical Sections (Ordering?)} |
| (\path{C-Lock-across-unlock-lock-1.litmus}), which |
| shows an example where \co{P(0)} places its write and read in two |
| different critical sections for the same lock. |
| Running this litmus test shows that the \co{exists} can be satisfied, |
| which means that the answer is ``no'', and that CPUs can reorder accesses |
| across consecutive critical sections. |
| In other words, not only are \co{spin_lock()} and \co{spin_unlock()} |
| weaker than a full barrier when considered separately, they are also |
| weaker than a full barrier when taken together. |
| |
| If the ordering of a given lock's critical sections are to be observed, |
| then either the observer must hold that lock on the one hand or either |
| \co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()} |
| must be executed just after the second lock acquisition on the other. |
| |
| But what if the two critical sections run on different CPUs or threads? |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-2@whole.fcv} |
| \caption{Accesses Between Different-CPU Critical Sections (Ordering?)} |
| \label{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)} |
| \end{listing} |
| |
| This question is answered for the Linux kernel by referring to |
| \cref{lst:memorder:Accesses Between Different-CPU Critical Sections (Ordering?)} |
| (\path{C-Lock-across-unlock-lock-2.litmus}), |
| in which the first lock acquisition is executed by \co{P0()} and the |
| second lock acquisition is executed by \co{P1()}. |
| Note that \co{P1()} must read \co{x} to reject executions in which |
| \co{P1()} executes before \co{P0()} does. |
| Running this litmus test shows that the \co{exists} can be satisfied, |
| which means that the answer is ``no'', and that CPUs can reorder accesses |
| across consecutive critical sections, even if each of those critical |
| sections runs on a different CPU or thread. |
| |
| \QuickQuiz{ |
| But if there are three critical sections, isn't it true that |
| CPUs not holding the lock will observe the accesses from the |
| first and the third critical section as being ordered? |
| }\QuickQuizAnswer{ |
| No. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock-across-unlock-lock-3@whole.fcv} |
| \caption{Accesses Between Multiple Different-CPU Critical Sections} |
| \label{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections} |
| \end{listing} |
| |
| \Cref{lst:memorder:Accesses Between Multiple Different-CPU Critical Sections} |
| shows an example three-critical-section chain |
| (\path{Lock-across-unlock-lock-3.litmus}). |
| Running this litmus test shows that the \co{exists} clause can |
| still be satisfied, so this additional critical section is still |
| not sufficient to force ordering. |
| |
| However, as the reader can verify, placing an |
| \co{smp_mb__after_spinlock()} after either \co{P1()}'s or |
| \co{P2()}'s lock acquisition does suffice to force ordering. |
| }\QuickQuizEnd |
| |
| As before, if the ordering of a given lock's critical sections are to |
| be observed, then either the observer must hold that lock or either |
| \co{smp_mb__after_spinlock()} or \co{smp_mb__after_unlock_lock()} must |
| be executed just after \co{P1()}'s lock acquisition. |
| |
| Given that ordering is not guaranteed when both critical sections are |
| protected by the same lock, there is no hope of any ordering guarantee |
| when different locks are used. |
| However, readers are encouraged to construct the corresponding litmus |
| test and see this for themselves. |
| |
| This situation can seem counter-intuitive, but it is rare for code to |
| care. |
| This approach also allows certain weakly ordered systems to implement |
| locks more efficiently. |
| |
| \subsubsection{Ordering for \tco{spin_is_locked()}?} |
| \label{sec:memorder:Ordering for spin-is-locked()?} |
| |
| The Linux kernel's \co{spin_is_locked()} primitive returns |
| \co{true} if the specified lock is held and \co{false} otherwise. |
| Note that \co{spin_is_locked()} returns \co{true} when some other |
| CPU or thread holds the lock, not just when the current CPU or thread |
| holds that lock. |
| This raises the question of what ordering guarantees \co{spin_is_locked()} |
| might provide. |
| |
| In the Linux kernel, the answer has varied over time. |
| Initially, \co{spin_is_locked()} was unordered, but a few interesting |
| use cases motivated strong ordering. |
| Later discussions surrounding the Linux-kernel memory model concluded |
| that \co{spin_is_locked()} should be used only for debugging. |
| Part of the reason for this is that even a fully ordered |
| \co{spin_is_locked()} might return \co{true} because some other CPU or |
| thread was just about to release the lock in question. |
| In this case, there is little that can be learned from that return value |
| of \co{true}, which means that reliable use of \co{spin_is_locked()} |
| is surprisingly complex. |
| Other approaches almost always work better, for example, use of explicit |
| shared variables or the \co{spin_trylock()} primitive. |
| |
| This situation resulted in the current state, namely that |
| \co{spin_is_locked()} provides no ordering guarantees, except that if |
| it returns \co{false}, the current CPU or thread cannot be holding the |
| corresponding lock. |
| |
| \QuickQuiz{ |
| But if \co{spin_is_locked()} returns \co{false}, don't we also |
| know that no other CPU or thread is holding the corresponding |
| lock? |
| }\QuickQuizAnswer{ |
| No. |
| By the time that the code inspects the return value from |
| \co{spin_is_locked()}, some other CPU or thread might well have |
| acquired the corresponding lock. |
| }\QuickQuizEnd |
| |
| \subsubsection{Why Mathematically Model Locking?} |
| \label{sec:memorder:Why Mathematically Model Locking?} |
| |
| Given all these possible choices, why model locking in general? |
| Why not simply model a simple implementation? |
| |
| One reason is modeling performance, as shown in |
| \cref{tab:formal:Locking: Modeling vs. Emulation Time (s)} |
| on |
| \cpageref{tab:formal:Locking: Modeling vs. Emulation Time (s)}. |
| Directly modeling locking in general is orders of magnitude faster |
| than emulating even a trivial implementation. |
| This should be no surprise, given the combinatorial explosion experienced |
| by present-day formal-verification tools with increases in the number of |
| memory accesses executed by the code being modeled. |
| Splitting the modeling at API boundaries can therefore result in |
| combinatorial implosion. |
| |
| Another reason is that a trivial implementation might needlessly constrain |
| either real implementations or real use cases. |
| In contrast, modeling a platonic lock allows the widest variety of |
| implementations while providing specific guidance to locks' users. |
| |
| \subsection{RCU} |
| \label{sec:memorder:RCU} |
| |
| As described in |
| \cref{sec:defer:RCU Fundamentals}, |
| the fundamental property of RCU grace periods is this straightforward |
| two-part guarantee: |
| \begin{enumerate*}[(1)] |
| \item If any part of a given RCU read-side critical section precedes |
| the beginning of a given \IX{grace period}, then the entirety of that |
| critical section precedes the end of that grace period. |
| \item If any part of a given RCU read-side critical section follows |
| the end of a given grace period, then the entirety of that |
| critical section follows the beginning of that grace period. |
| \end{enumerate*} |
| These guarantees are summarized in |
| \cref{fig:memorder:RCU Grace-Period Ordering Guarantees}, |
| where the grace period is denoted by the dashed arrow between the |
| \co{call_rcu()} invocation in the upper right and the corresponding |
| RCU callback invocation in the lower left.\footnote{ |
| For more detail, please see |
| \crefrange{fig:defer:RCU Reader and Later Grace Period}{fig:defer:RCU Reader Within Grace Period} |
| starting on |
| \cpageref{fig:defer:RCU Reader and Later Grace Period}.} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{memorder/RCUGPordering}} |
| \caption{RCU Grace-Period Ordering Guarantees} |
| \label{fig:memorder:RCU Grace-Period Ordering Guarantees} |
| \end{figure} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul@whole.fcv} |
| \caption{RCU Fundamental Property} |
| \label{lst:memorder:RCU Fundamental Property} |
| \end{listing} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+i-rl-o-o-rul@whole.fcv} |
| \caption{RCU Fundamental Property and Reordering} |
| \label{lst:memorder:RCU Fundamental Property and Reordering} |
| \end{listing} |
| |
| In short, an RCU read-side critical section is guaranteed never to |
| completely overlap an RCU grace period, as demonstrated by |
| \cref{lst:memorder:RCU Fundamental Property} |
| (\path{C-SB+o-rcusync-o+rl-o-o-rul.litmus}). |
| Either or neither of the \co{r2} registers can have the final value of zero, |
| but at least one of them must be non-zero (that is, the cycle identified |
| by the \co{exists} clause is prohibited), courtesy of RCU's fundamental |
| grace-period guarantee, as can be seen by running \co{herd} on this litmus test. |
| Note that this guarantee is insensitive to the ordering of the accesses |
| within \co{P1()}'s critical section, so the litmus test shown in |
| \cref{lst:memorder:RCU Fundamental Property and Reordering}\footnote{ |
| Dependencies can of course limit the ability to reorder accesses |
| within RCU read-side critical sections.} |
| also forbids this same cycle. |
| |
| However, this definition is incomplete, as can be seen from the following |
| list of questions:\footnote{ |
| Several of which were introduced to Paul by \ppl{Jade}{Alglave} during |
| early work on LKMM, and a few more of which came from other |
| LKMM participants~\cite{Alglave:2018:FSC:3173162.3177156}.} |
| |
| \begin{enumerate} |
| \item What ordering is provided by \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()}, independent of RCU grace periods? |
| \item What ordering is provided by \co{synchronize_rcu()} |
| and \co{synchronize_rcu_expedited()}, independent of RCU read-side |
| critical sections? |
| \item If the entirety of a given RCU read-side critical section |
| precedes the end of a given RCU grace period, what about |
| accesses preceding that critical section? |
| \item If the entirety of a given RCU read-side critical section |
| follows the beginning of a given RCU grace period, what about |
| accesses following that critical section? |
| \item What happens in situations involving more than one RCU read-side |
| critical section and/or more than one RCU grace period? |
| \item What happens when RCU is combined with other memory-ordering |
| mechanisms? |
| \end{enumerate} |
| |
| These questions are addressed in the following sections. |
| |
| \subsubsection{RCU Read-Side Ordering} |
| \label{sec:memorder:RCU Read-Side Ordering} |
| |
| On their own, RCU's read-side primitives \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} provide no ordering whatsoever. |
| In particular, despite their names, they do not act like locks, as can |
| be seen in |
| \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering} |
| (\path{C-LB+rl-o-o-rul+rl-o-o-rul.litmus}). |
| This litmus test's cycle is allowed: |
| Both instances of the \co{r1} register can have final values of 1. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-LB+rl-o-o-rul+rl-o-o-rul@whole.fcv} |
| \caption{RCU Readers Provide No Lock-Like Ordering} |
| \label{lst:memorder:RCU Readers Provide No Lock-Like Ordering} |
| \end{listing} |
| |
| Nor do these primitives have barrier-like ordering properties, |
| at least not unless there is a grace period in the mix, as can be seen in |
| \cref{lst:memorder:RCU Readers Provide No Barrier-Like Ordering} |
| (\path{C-LB+o-rl-rul-o+o-rl-rul-o.litmus}). |
| This litmus test's cycle is also allowed. |
| (Try it!) |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-LB+o-rl-rul-o+o-rl-rul-o@whole.fcv} |
| \caption{RCU Readers Provide No Barrier-Like Ordering} |
| \label{lst:memorder:RCU Readers Provide No Barrier-Like Ordering} |
| \end{listing} |
| |
| Of course, lack of ordering in both these litmus tests should be absolutely |
| no surprise, given that both \co{rcu_read_lock()} and \co{rcu_read_unlock()} |
| are no-ops in the \IXacr{qsbr} implementation of RCU\@. |
| |
| \subsubsection{RCU Update-Side Ordering} |
| \label{sec:memorder:RCU Update-Side Ordering} |
| |
| In contrast with RCU readers, the RCU update-side functions |
| \co{synchronize_rcu()} and \co{synchronize_rcu_expedited()} |
| provide memory ordering at least as strong as \co{smp_mb()},\footnote{ |
| And also way more expensive!} |
| as can be seen by running \co{herd} on the litmus test shown in |
| \cref{lst:memorder:RCU Updaters Provide Full Ordering}. |
| This test's cycle is prohibited, just as it would with \co{smp_mb()}. |
| This should be no surprise given the information presented in |
| \cref{tab:memorder:Linux-Kernel Memory-Ordering Cheat Sheet}. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o@whole.fcv} |
| \caption{RCU Updaters Provide Full Ordering} |
| \label{lst:memorder:RCU Updaters Provide Full Ordering} |
| \end{listing} |
| |
| \subsubsection{RCU Readers: |
| Before and After} |
| \label{sec:memorder:RCU Readers: Before and After} |
| |
| Before reading this section, it would be well to reflect on the distinction |
| between guarantees that are available and guarantees that maintainable |
| software should rely on. |
| Keeping that firmly in mind, this section presents a few of the |
| more exotic RCU guarantees. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-o-rul@whole.fcv} |
| \caption{What Happens Before RCU Readers?} |
| \label{lst:memorder:What Happens Before RCU Readers?} |
| \end{listing} |
| |
| \Cref{lst:memorder:What Happens Before RCU Readers?} |
| (\path{C-SB+o-rcusync-o+o-rl-o-rul.litmus}) |
| shows a litmus test similar to that in |
| \cref{lst:memorder:RCU Fundamental Property}, |
| but with the RCU reader's first access preceding the RCU read-side critical |
| section, rather than the more conventional (and maintainable!\@) approach of |
| being contained within it. |
| Perhaps surprisingly, running \co{herd} on this litmus test gives the |
| same result as for that in |
| \cref{lst:memorder:RCU Fundamental Property}: |
| The cycle is forbidden. |
| |
| Why would this be the case? |
| |
| Because both of \co{P1()}'s accesses are volatile, |
| as discussed in |
| \cref{sec:toolsoftrade:A Volatile Solution}, |
| the compiler is not permitted to reorder them. |
| This means that the code emitted for \co{P1()}'s \co{WRITE_ONCE()} will |
| precede that of \co{P1()}'s \co{READ_ONCE()}. |
| Therefore, RCU implementations that place memory-barrier instructions in |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} will preserve the ordering |
| of \co{P1()}'s two accesses all the way down to the hardware level. |
| On the other hand, RCU implementations that rely on interrupt-based |
| state machines will also fully preserve this ordering |
| \emph{relative to the grace period} due to the fact that interrupts take |
| place at a precise location in the execution of the interrupted code. |
| |
| This in turn means that if the \co{WRITE_ONCE()} follows the end of a |
| given RCU grace period, then the accesses within \emph{and following} |
| that RCU read-side critical section must follow the beginning of that |
| same grace period. |
| Similarly, if the \co{READ_ONCE()} precedes the beginning of the grace |
| period, everything within \emph{and preceding} that critical section |
| must precede the end of that same grace period. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-rul-o@whole.fcv} |
| \caption{What Happens After RCU Readers?} |
| \label{lst:memorder:What Happens After RCU Readers?} |
| \end{listing} |
| |
| \Cref{lst:memorder:What Happens After RCU Readers?} |
| (\path{C-SB+o-rcusync-o+rl-o-rul-o.litmus}) |
| is similar, but instead looks at accesses after the RCU read-side |
| critical section. |
| This test's cycle is also forbidden, as can be checked with the \co{herd} |
| tool. |
| The reasoning is similar to that for |
| \cref{lst:memorder:What Happens Before RCU Readers?}, |
| and is left as an exercise for the reader. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rl-rul-o@whole.fcv} |
| \caption{What Happens With Empty RCU Readers?} |
| \label{lst:memorder:What Happens With Empty RCU Readers?} |
| \end{listing} |
| |
| \Cref{lst:memorder:What Happens With Empty RCU Readers?} |
| (\path{C-SB+o-rcusync-o+o-rl-rul-o.litmus}) |
| takes things one step farther, moving \co{P1()}'s \co{WRITE_ONCE()} |
| to precede the RCU read-side critical section and moving |
| \co{P1()}'s \co{READ_ONCE()} to follow it, resulting in an |
| empty RCU read-side critical section. |
| |
| Perhaps surprisingly, despite the empty critical section, RCU nevertheless |
| still manages to forbid the cycle. |
| This can again be checked using the \co{herd} tool. |
| Furthermore, the reasoning is once again similar to that for |
| \cref{lst:memorder:What Happens Before RCU Readers?}. |
| Recapping, if \co{P1()}'s \co{WRITE_ONCE()} follows the end of a given |
| grace period, then \co{P1()}'s RCU read-side critical section---and |
| everything following it---must follow the beginning of that same grace |
| period. |
| Similarly, if \co{P1()}'s \co{READ_ONCE()} precedes the beginning of a |
| given grace period, then \co{P1()}'s RCU read-side critical section---and |
| everything preceding it---must precede the end of that same grace period. |
| In both cases, the critical section's emptiness is irrelevant. |
| |
| \QuickQuiz{ |
| Wait a minute! |
| In QSBR implementations of RCU, no code is emitted for |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()}. |
| This means that the RCU read-side critical section in |
| \cref{lst:memorder:What Happens With Empty RCU Readers?} |
| isn't just empty, it is completely nonexistent!!! |
| So how can something that doesn't exist at all possibly have |
| any effect whatsoever on ordering??? |
| }\QuickQuizAnswer{ |
| Because in QSBR, RCU read-side critical sections don't |
| actually disappear. |
| Instead, they are extended in both directions until a quiescent |
| state is encountered. |
| For example, in the Linux kernel, the critical section might |
| be extended back to the most recent \co{schedule()} call and |
| ahead to the next \co{schedule()} call. |
| Of course, in non-QSBR implementations, \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} really do emit code, which can clearly |
| provide ordering. |
| And within the Linux kernel, even the QSBR implementation |
| has a compiler \co{barrier()} in \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()}, which is necessary to prevent |
| the compiler from moving memory accesses that might result |
| in page faults into the RCU read-side critical section. |
| |
| Therefore, strange though it might seem, empty RCU read-side |
| critical sections really can and do provide some degree of |
| ordering. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-o@whole.fcv} |
| \caption{What Happens With No RCU Readers?} |
| \label{lst:memorder:What Happens With No RCU Readers?} |
| \end{listing} |
| |
| This situation leads to the question of what happens if |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} are omitted |
| entirely, as shown in |
| \cref{lst:memorder:What Happens With No RCU Readers?} |
| (\path{C-SB+o-rcusync-o+o-o.litmus}). |
| As can be checked with \co{herd}, this litmus test's cycle is allowed, |
| that is, both instances of \co{r2} can have final values of zero. |
| |
| This might seem strange in light of the fact that empty RCU |
| read-side critical sections can provide ordering. |
| And it is true that QSBR implementations of RCU would in fact forbid |
| this outcome, due to the fact that there is no quiescent state anywhere |
| in \co{P1()}'s function body, so that \co{P1()} would run |
| within an implicit RCU read-side critical section. |
| However, RCU also has non-QSBR implementations, which have no implied |
| RCU read-side critical section, and in turn no way for RCU to enforce |
| ordering. |
| Therefore, this litmus test's cycle is allowed. |
| |
| \QuickQuiz{ |
| Can \co{P1()}'s accesses be reordered in the litmus tests shown in |
| \cref{lst:memorder:What Happens Before RCU Readers?,% |
| lst:memorder:What Happens After RCU Readers?,% |
| lst:memorder:What Happens With Empty RCU Readers?} |
| in the same way that they were reordered going from |
| \cref{lst:memorder:RCU Fundamental Property} |
| to |
| \cref{lst:memorder:RCU Fundamental Property and Reordering}? |
| }\QuickQuizAnswer{ |
| No, because none of these later litmus tests have more than one |
| access within their RCU read-side critical sections. |
| But what about swapping the accesses, for example, in |
| \cref{lst:memorder:What Happens Before RCU Readers?}, |
| placing \co{P1()}'s \co{WRITE_ONCE()} within its critical |
| section and the \co{READ_ONCE()} before its critical section? |
| |
| Swapping the accesses allows both instances of \co{r2} to |
| have a final value of zero, in other words, although RCU read-side |
| critical sections' ordering properties can extend outside of |
| those critical sections, the same is not true of their |
| reordering properties. |
| Checking this with \co{herd} and explaining why is left as an |
| exercise for the reader. |
| }\QuickQuizEnd |
| |
| \subsubsection{Multiple RCU Readers and Updaters} |
| \label{sec:memorder:Multiple RCU Readers and Updaters} |
| |
| Because \co{synchronize_rcu()} has ordering semantics that are at least |
| as strong as \co{smp_mb()}, no matter how many processes there are in |
| an SB litmus test |
| (such as \cref{lst:memorder:RCU Updaters Provide Full Ordering}), |
| placing \co{synchronize_rcu()} between each process's |
| accesses prohibits the cycle. |
| In addition, the cycle is prohibited in an SB test where one process |
| uses \co{synchronize_rcu()} and the other uses \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()}, as shown by |
| \cref{lst:memorder:RCU Fundamental Property}. |
| However, if both processes use \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()}, the cycle will be allowed, as shown by |
| \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}. |
| |
| Is it possible to say anything general about which RCU-protected |
| litmus tests will be prohibited and which will be allowed? |
| This section takes up that question. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv} |
| \caption{One RCU Grace Period and Two Readers} |
| \label{lst:memorder:One RCU Grace Period and Two Readers} |
| \end{listing} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+o-rcusync-o+o-rcusync-o+rl-o-o-rul+rl-o-o-rul@whole.fcv} |
| \caption{Two RCU Grace Periods and Two Readers} |
| \label{lst:memorder:Two RCU Grace Periods and Two Readers} |
| \end{listing} |
| |
| More specifically, what if the litmus test has one RCU grace |
| period and two RCU readers, as shown in |
| \cref{lst:memorder:One RCU Grace Period and Two Readers}? |
| The \co{herd} tool says that this cycle is allowed, but it would be |
| good to know \emph{why}.\footnote{ |
| Especially given that Paul changed his mind several times about |
| this particular litmus test when working with \ppl{Jade}{Alglave} to |
| generalize RCU ordering semantics.} |
| |
| \begin{figure*} |
| \centering |
| \resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2R}} |
| \caption{Cycle for One RCU Grace Period and Two RCU Readers} |
| \label{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers} |
| \end{figure*} |
| |
| \begin{figure*} |
| \centering |
| \resizebox{\onecolumntextwidth}{!}{\includegraphics{memorder/RCU2G2R}} |
| \caption{No Cycle for Two RCU Grace Periods and Two RCU Readers} |
| \label{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers} |
| \end{figure*} |
| |
| The key point is that even strongly ordered CPUs such as x86 can |
| and will reorder \co{P1()}'s and \co{P2()}'s \co{WRITE_ONCE()} and |
| \co{READ_ONCE()}. |
| With that reordering, |
| \cref{fig:memorder:Cycle for One RCU Grace Period and Two RCU Readers} |
| shows how the cycle forms: |
| |
| \begin{enumerate} |
| \item \co{P0()}'s read from \co{x1} precedes \co{P1()}'s write, as |
| depicted by the dashed arrow near the bottom of the diagram. |
| \item Because \co{P1()}'s write follows the end of \co{P0()}'s grace period, |
| \co{P1()}'s read from \co{x2} cannot precede the beginning of |
| \co{P0()}'s grace period. |
| \item \co{P1()}'s read from \co{x2} precedes \co{P2()}'s write. |
| \item Because \co{P2()}'s write to \co{x2} precedes the end of |
| \co{P0()}'s grace period, it is completely legal for \co{P2()}'s |
| read from \co{x0} to precede the beginning of \co{P0()}'s grace period. |
| \item Therefore, \co{P2()}'s read from \co{x0} can precede \co{P0()}'s |
| write, thus allowing the cycle to form. |
| \end{enumerate} |
| |
| But what happens when another grace period is added? |
| This situation is shown in |
| \cref{lst:memorder:Two RCU Grace Periods and Two Readers}, |
| an SB litmus test in which \co{P0()} and \co{P1()} have RCU grace periods |
| and \co{P2()} and \co{P3()} have RCU readers. |
| Again, the CPUs can reorder the accesses within RCU read-side critical |
| sections, as shown in |
| \cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers}. |
| For this cycle to form, \co{P2()}'s critical section must |
| end after \co{P1()}'s grace period and \co{P3()}'s must end after the |
| beginning of that same grace period, which happens to also be after the |
| end of \co{P0()}'s grace period. |
| Therefore, \co{P3()}'s critical section must start after the beginning |
| of \co{P0()}'s grace period, which in turn means that \co{P3()}'s |
| read from \co{x0} cannot possibly precede \co{P0()}'s write. |
| Therefore, the cycle is forbidden because RCU read-side critical sections |
| cannot span full RCU grace periods. |
| |
| However, a closer look at |
| \cref{fig:memorder:No Cycle for Two RCU Grace Periods and Two RCU Readers} |
| makes it clear that adding a third reader would allow the cycle. |
| This is because this third reader could end before the end of \co{P0()}'s |
| grace period, and thus start before the beginning of that same grace |
| period. |
| This in turn suggests the general rule, which is: |
| In these sorts of RCU-only litmus tests, if there are at least as many |
| RCU grace periods as there are RCU read-side critical sections, |
| the cycle is forbidden.\footnote{ |
| Interestingly enough, Alan Stern proved that within the context |
| of LKMM, the two-part fundamental property of RCU expressed |
| in \cref{sec:defer:RCU Fundamentals} actually implies |
| this seemingly more general result, which is called the RCU |
| axiom~\cite{Alglave:2018:FSC:3173162.3177156}.} |
| |
| \subsubsection{RCU and Other Ordering Mechanisms} |
| \label{sec:memorder:RCU and Other Ordering Mechanisms} |
| |
| But what about litmus tests that combine RCU with other ordering |
| mechanisms? |
| |
| The general rule is that it takes only one mechanism to forbid a cycle. |
| |
| For example, refer back to |
| \cref{lst:memorder:RCU Readers Provide No Lock-Like Ordering}. |
| Applying the general rule from the previous section, because this litmus |
| test has two RCU read-side critical sections and no RCU grace periods, |
| the cycle is allowed. |
| But what if \co{P0()}'s \co{WRITE_ONCE()} is replaced by an |
| \co{smp_store_release()} and \co{P1()}'s \co{READ_ONCE()} is replaced |
| by an \co{smp_load_acquire()}? |
| |
| RCU would still allow the cycle, but the release-acquire pair would |
| forbid it. |
| Because it only takes one mechanism to forbid a cycle, the release-acquire |
| pair would prevail, thus forbidding the cycle. |
| |
| \begin{figure*} |
| \centering |
| \resizebox{0.75\onecolumntextwidth}{!}{\includegraphics{memorder/RCU1G2Rmb}} |
| \caption{Cycle for One RCU Grace Period, Two RCU Readers, and Memory Barrier} |
| \label{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier} |
| \end{figure*} |
| |
| For another example, refer back to |
| \cref{lst:memorder:One RCU Grace Period and Two Readers}. |
| Because this litmus test has two RCU readers but only one grace period, |
| its cycle is allowed. |
| But suppose that an \co{smp_mb()} was placed between \co{P1()}'s |
| pair of accesses. |
| In this new litmus test, because of the addition of the \co{smp_mb()}, |
| \co{P2()}'s as well as \co{P1()}'s critical sections would extend beyond the |
| end of \co{P0()}'s grace period, which in turn would prevent \co{P2()}'s |
| read from \co{x0} from preceding \co{P0()}'s write, as depicted by the |
| red dashed arrow in |
| \cref{fig:memorder:Cycle for One RCU Grace Period; Two RCU Readers; and Memory Barrier}. |
| In this case, RCU and the \IXh{full}{memory barrier} work together to forbid |
| the cycle, with RCU preserving ordering between \co{P0()} and both |
| \co{P1()} and \co{P2()}, and with the \co{smp_mb()} preserving |
| ordering between \co{P1()} and \co{P2()}. |
| |
| \QuickQuiz{ |
| What would happen if the \co{smp_mb()} was instead added between |
| \co{P2()}'s accesses in |
| \cref{lst:memorder:One RCU Grace Period and Two Readers}? |
| }\QuickQuizAnswer{ |
| The cycle would again be forbidden. |
| Further analysis is left as an exercise for the reader. |
| }\QuickQuizEnd |
| |
| In short, where RCU's semantics were once purely pragmatic, they are |
| now fully |
| formalized~\cite{PaulMcKenney2005RCUSemantics,MathieuDesnoyers2012URCU,AlexeyGotsman2013ESOPRCU,Alglave:2018:FSC:3173162.3177156}. |
| |
| % \subsection{SRCU} |
| % \label{sec:memorder:SRCU} |
| % @@@ After LWN article |
| |
| % Nesting vs. value passed from \co{srcu_read_lock()} to |
| % \co{srcu_read_unlock()}. |
| |
| % When augmented by \co{smp_mb__after_srcu_read_unlock()}. |
| |
| \subsection{Higher-Level Primitives: |
| Discussion} |
| \label{sec:memorder:Higher-Level Primitives: Discussion} |
| |
| It is quite beneficial to verify code in terms of a higher-level primitive |
| instead of in terms of the low-level memory accesses used in a particular |
| implementation of that primitive. |
| First, this allows code using`those primitives to be |
| verified against an abstract representation of those primitives, |
| thus making that code less vulnerable to implementation changes. |
| Second, partitioning the verification at API boundaries results in |
| combinatorial implosion, greatly reducing the overhead of formal |
| verification. |
| |
| It is hoped that verifying against detailed semantics for higher-level |
| primitives will greatly increase the effectiveness of static analysis |
| and model checking. |
| |
| \section{Hardware Specifics} |
| \label{sec:memorder:Hardware Specifics} |
| \OriginallyPublished{sec:memorder:Hardware Specifics}{Memory-Barrier Instructions For Specific CPUs}{Linux Journal}{PaulMcKenney2005i,PaulMcKenney2005j} |
| % |
| \epigraph{Rock beats paper!}{Derek Williams} |
| |
| Each CPU family has its own peculiar approach to memory ordering, which |
| can make portability a challenge, as you can see in |
| \cref{tab:memorder:Summary of Memory Ordering}. |
| |
| \begin{table*}[tb] % @@@ Omitting 'p' prevents unordered floats in 2c builds |
| \rowcolors{4}{}{lightgray} |
| \small |
| \centering |
| \newcommand{\cpufml}[1]{\begin{picture}(6,50)(0,0)\rotatebox{90}{#1}\end{picture}} |
| \renewcommand*{\arraystretch}{1.2}\OneColumnHSpace{-.35in} |
| \ebresizewidth{ |
| \begin{tabular}{llccccccccc} |
| \toprule |
| \multicolumn{2}{l}{~} & \multicolumn{9}{c}{CPU Family} \\ |
| \cmidrule{3-11} |
| \multicolumn{2}{c}{\raisebox{.5ex}{Property}} |
| & \cpufml{Alpha} |
| & \cpufml{\ARMv7-A/R} |
| & \cpufml{\ARMv8} |
| & \cpufml{Itanium} |
| & \cpufml{MIPS} |
| & \cpufml{\Power{}} |
| & \cpufml{SPARC TSO} |
| & \cpufml{x86} |
| & \cpufml{z~Systems} |
| \\ |
| \cmidrule(r){1-2} \cmidrule{3-11} |
| % Alpha ARMv8 ARMv7 Itanium MIPS PPC SPARC x86 z Systems |
| \cellcolor{white} |
| Memory Ordering |
| & Loads Reordered After Loads or Stores? |
| & Y & Y & Y & Y & Y & Y & ~ & ~ & ~ \\ |
| & Stores Reordered After Stores? |
| & Y & Y & Y & Y & Y & Y & ~ & ~ & ~ \\ |
| \cellcolor{white} |
| & Stores Reordered After Loads? |
| & Y & Y & Y & Y & Y & Y & Y & Y & Y \\ |
| & \parbox[c][6ex]{2in}{\raggedright Atomic Instructions Reordered With\par Loads or Stores?} |
| & Y & Y & Y & ~ & Y & Y & ~ & ~ & ~ \\ |
| \cellcolor{white} |
| & Dependent Loads Reordered? |
| & Y & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\ |
| & Dependent Stores Reordered? |
| & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ & ~ \\ |
| \cellcolor{white} |
| & Non-Sequentially Consistent? |
| & Y & Y & Y & Y & Y & Y & Y & Y & Y \\ |
| & Non-Multicopy Atomic? |
| & Y & Y & Y & Y & Y & Y & Y & Y & ~ \\ |
| \cellcolor{white} |
| & Non-Other-Multicopy Atomic? |
| & Y & Y & ~ & Y & Y & Y & ~ & ~ & ~ \\ |
| & Non-Cache Coherent? |
| & ~ & ~ & ~ & Y & ~ & ~ & ~ & ~ & ~ \\ |
| \cmidrule(r){1-2} \cmidrule{3-11} |
| \cellcolor{white} |
| Instructions |
| & Load-Acquire/Store-Release? |
| & F & F & i & I & F & b & ~ & ~ & ~ \\ |
| & Atomic RMW Instruction Type? |
| & L & L & L & C & L & L & C & C & C \\ |
| \cellcolor{white} |
| & Incoherent Instruction Cache/Pipeline? |
| & Y & Y & Y & Y & Y & Y & Y & Y & Y \\ |
| \bottomrule |
| \end{tabular} |
| } |
| |
| \vspace{5pt}\hfill |
| \ebresizeverb{.6}{ |
| \framebox[\width]{\footnotesize\setlength{\tabcolsep}{3pt} |
| \rowcolors{1}{}{} |
| \renewcommand*{\arraystretch}{1} |
| \begin{tabular}{llcl} |
| { \bf Key: } |
| & \multicolumn{3}{l}{Load-Acquire/Store-Release?} \\ |
| ~ & ~ & b: & Lightweight memory barrier \\ |
| ~ & ~ & F: & Full memory barrier \\ |
| ~ & ~ & i: & Instruction with lightweight ordering \\ |
| ~ & ~ & I: & Instruction with heavyweight ordering \\ |
| ~ &\multicolumn{3}{l}{Atomic RMW Instruction Type?} \\ |
| ~ & ~ & C: & Compare-and-exchange instruction \\ |
| ~ & ~ & L: & Load-linked/store-conditional instruction \\ |
| \end{tabular} |
| } |
| }\OneColumnHSpace{-0.4in} |
| \caption{Summary of Memory Ordering} |
| \label{tab:memorder:Summary of Memory Ordering} |
| \end{table*} |
| |
| In fact, some software environments simply prohibit |
| direct use of memory-ordering operations, restricting the programmer |
| to mutual-exclusion primitives that incorporate them to the extent that |
| they are required. |
| Please note that this section is not intended to be a reference manual |
| covering all (or even most) aspects of each CPU family, but rather |
| a high-level overview providing a rough comparison. |
| For full details, see the reference manual for the CPU of interest. |
| |
| Getting back to |
| \cref{tab:memorder:Summary of Memory Ordering}, |
| the first group of rows look at memory-ordering |
| properties and the second group looks at instruction properties. |
| Please note that these properties hold at the machine-instruction |
| level. |
| Compilers can and do reorder far more aggressively than does hardware. |
| Use marked accesses such as \co{READ_ONCE()} and \co{WRITE_ONCE()} |
| to constrain the compiler's optimizations and prevent undesireable |
| reordering. |
| |
| The first three rows indicate whether a given CPU allows the four |
| possible combinations of loads and stores to be reordered, as discussed |
| in |
| \cref{sec:memorder:Ordering: Why and How?} and |
| \crefrange{sec:memorder:Load Followed By Load}{sec:memorder:Store Followed By Store}. |
| The next row (``Atomic Instructions Reordered With Loads or Stores?\@'') |
| indicates whether a given CPU allows loads and stores |
| to be reordered with atomic instructions. |
| |
| The fifth and sixth rows cover reordering and dependencies, |
| which was covered in |
| \crefrange{sec:memorder:Address Dependencies}{sec:memorder:Control Dependencies} |
| and which is explained in more detail in |
| \cref{sec:memorder:Alpha}. |
| The short version is that Alpha requires memory barriers for readers |
| as well as updaters of linked data structures, however, these memory |
| barriers are provided by the Alpha architecture-specific code in |
| v4.15 and later Linux kernels. |
| |
| The next row, ``Non-Sequentially Consistent'', indicates whether |
| the CPU's normal load and store instructions are constrained by |
| sequential consistency. |
| Performance considerations have dictated that no modern mainstream |
| system is sequentially consistent. |
| |
| The next three rows cover \IXalt{multicopy atomicity}{multi-copy atomic}, |
| which was defined in |
| \cref{sec:memorder:Multicopy Atomicity}. |
| The first is full-up (and rare) |
| \IXalt{multicopy atomicity}{multi-copy atomic}, the second is the |
| weaker \IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic}, |
| and the third is the weakest |
| \IXalthy{non-multicopy atomicity}{non-}{multi-copy atomic}. |
| |
| The next row, ``Non-Cache Coherent'', covers accesses from multiple |
| threads to a single variable, which was discussed in |
| \cref{sec:memorder:Cache Coherence}. |
| |
| The final three rows cover instruction-level choices and issues. |
| The first row indicates how each CPU implements load-acquire |
| and store-release, the second row classifies CPUs by atomic-instruction |
| type, and the third and final row |
| indicates whether a given CPU has an incoherent |
| instruction cache and pipeline. |
| Such CPUs require special instructions be executed for self-modifying |
| code. |
| |
| %Parenthesized CPU names indicate modes that are architecturally allowed, |
| %but rarely used in practice. |
| |
| The common ``just say no'' approach to memory-ordering operations |
| can be eminently reasonable where it applies, |
| but there are environments, such as the Linux kernel, where direct |
| use of memory-ordering operations is required. |
| Therefore, |
| Linux provides a carefully chosen least-common-denominator |
| set of memory-ordering primitives, which are as follows: |
| \begin{description} |
| \item [\tco{smp_mb()}] (\IXh{full}{memory barrier}) that orders both loads and |
| stores. |
| This means that loads and stores preceding the memory barrier |
| will be committed to memory before any loads and stores |
| following the memory barrier. |
| \item [\tco{smp_rmb()}] (\IXh{read}{memory barrier}) that orders only loads. |
| \item [\tco{smp_wmb()}] (\IXh{write}{memory barrier}) that orders only stores. |
| \item [\tco{smp_mb__before_atomic()}] that forces ordering of accesses |
| preceding the \co{smp_mb__before_atomic()} against accesses following |
| a later RMW atomic operation. |
| This is a noop on systems that fully order atomic RMW operations. |
| \item [\tco{smp_mb__after_atomic()}] that forces ordering of accesses |
| preceding an earlier RMW atomic operation against accesses |
| following the \co{smp_mb__after_atomic()}. |
| This is also a noop on systems that fully order atomic RMW operations. |
| \item [\tco{smp_mb__after_spinlock()}] that forces ordering of accesses |
| preceding a lock acquisition against accesses |
| following the \co{smp_mb__after_spinlock()}. |
| This is also a noop on systems that fully order lock acquisitions. |
| \item [\tco{mmiowb()}] that forces ordering on MMIO writes that are guarded |
| by global spinlocks, and is more thoroughly described |
| in a 2016 LWN article on MMIO~\cite{PaulEMcKenney2016LinuxKernelMMIO}. |
| \end{description} |
| The \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()} |
| primitives also force |
| the compiler to eschew any optimizations that would have the effect |
| of reordering memory optimizations across the barriers. |
| |
| \QuickQuiz{ |
| What happens to code between an atomic operation and an |
| \co{smp_mb__after_atomic()}? |
| }\QuickQuizAnswer{ |
| First, please don't do this! |
| |
| But if you do, this intervening code will either be ordered |
| after the atomic operation or before the |
| \co{smp_mb__after_atomic()}, depending on the architecture, |
| but not both. |
| This also applies to \co{smp_mb__before_atomic()} and |
| \co{smp_mb__after_spinlock()}, that is, both the uncertain |
| ordering of the intervening code and the plea to avoid such code. |
| }\QuickQuizEnd |
| |
| These primitives generate code only in SMP kernels, however, several |
| have UP versions (\co{mb()}, \co{rmb()}, and \co{wmb()}, |
| respectively) that generate a memory barrier even in UP kernels. |
| The \co{smp_} versions should be used in most cases. |
| However, these latter primitives are useful when writing drivers, |
| because MMIO accesses must remain ordered even in UP kernels. |
| In absence of memory-ordering operations, both CPUs and compilers would |
| happily rearrange these accesses, which at best would make the device |
| act strangely, and could crash your kernel or even damage your hardware. |
| |
| So most kernel programmers need not worry about the memory-ordering |
| peculiarities of each and every CPU, as long as they stick to these |
| interfaces and to the fully ordered atomic operations.\footnote{ |
| For a full list, expand the patterns in |
| \path{Documentation/atomic_t.txt}.} |
| If you are working deep in a given CPU's architecture-specific code, |
| of course, all bets are off. |
| |
| Furthermore, |
| all of Linux's locking primitives (spinlocks, reader-writer locks, |
| semaphores, RCU, \ldots) include any needed ordering primitives. |
| So if you are working with code that uses these primitives properly, |
| you need not worry about Linux's memory-ordering primitives. |
| |
| That said, deep knowledge of each CPU's |
| \IXalt{memory-consistency model}{memory model} |
| can be very helpful when debugging, to say nothing of when writing |
| architecture-specific code or synchronization primitives. |
| |
| Besides, they say that a little knowledge is a very dangerous thing. |
| Just imagine the damage you could do with a lot of knowledge! |
| For those who wish to understand more about individual CPUs' |
| memory consistency models, the next sections describe those of a few |
| popular and prominent CPUs. |
| Although there is no substitute for actually reading a given CPU's |
| documentation, these sections do give a good overview. |
| |
| \subsection{Alpha} |
| \label{sec:memorder:Alpha} |
| |
| It may seem strange to say much of anything about a CPU whose end of life |
| has long since passed, but Alpha is interesting because it is the only |
| mainstream CPU that reorders dependent loads, and has thus had outsized |
| influence on concurrency APIs, including within the Linux kernel. |
| The need for core Linux-kernel code to accommodate Alpha ended |
| with version v4.15 of the Linux kernel, and all traces of this |
| accommodation were removed in v5.9 with the removal of the |
| \apikh{smp_read_barrier_depends()} and \co{read_barrier_depends()} APIs. |
| This section is nevertheless retained in the Third Edition |
| because here in early 2023 there are still a few Linux kernel hackers |
| still working on pre-v4.15 versions of the Linux kernel. |
| In addition, the modifications to \co{READ_ONCE()} that permitted |
| these APIs to be removed have not necessarily propagated to all |
| userspace projects that might still support Alpha. |
| |
| \begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)] |
| The dependent-load difference between Alpha and the other CPUs is |
| illustrated by the code shown in |
| \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)}. |
| This \co{smp_store_release()} |
| guarantees that the element initialization |
| in \clnrefrange{init:b}{init:e} is executed before the element is added to the |
| list on \clnref{add}, so that the lock-free search will work correctly. |
| That is, it makes this guarantee on all CPUs {\em except} Alpha. |
| \end{fcvref} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Insert and Lock-Free Search (No Ordering)] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| struct el *insert(long key, long data) |
| { |
| struct el *p; |
| p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| spin_lock(&mutex); |
| p->next = head.next; \lnlbl[init:b] |
| p->key = key; |
| p->data = data; \lnlbl[init:e] |
| smp_store_release(&head.next, p); \lnlbl[add] |
| spin_unlock(&mutex); |
| } |
| |
| struct el *search(long searchkey) |
| { |
| struct el *p; |
| p = READ_ONCE_OLD(head.next); \lnlbl[h:next] |
| while (p != &head) { |
| /* Prior to v4.15, BUG ON ALPHA!!! */ \lnlbl[BUG] |
| if (p->key == searchkey) { \lnlbl[key] |
| return (p); |
| } |
| p = READ_ONCE_OLD(p->next); \lnlbl[next] |
| }; |
| return (NULL); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Insert and Lock-Free Search (No Ordering)} |
| \label{lst:memorder:Insert and Lock-Free Search (No Ordering)} |
| \end{listing} |
| |
| \begin{fcvref}[ln:memorder:Insert and Lock-Free Search (No Ordering)] |
| Given the pre-v4.15 implementation of \co{READ_ONCE()}, indicated by |
| \co{READ_ONCE_OLD()} in the listing, Alpha actually allows the code on |
| \clnref{key} of |
| \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)} |
| to see the old garbage values that were present before the initialization |
| on \clnrefrange{init:b}{init:e}. |
| |
| \Cref{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels} |
| shows how this can happen on |
| an aggressively parallel machine with partitioned caches, so that |
| alternating \IXpl{cache line} are processed by the different partitions |
| of the caches. |
| For example, the load of \co{head.next} on \clnref{h:next} of |
| \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)} |
| might access cache bank~0, |
| and the load of \co{p->key} on \clnref{key} and of \co{p->next} on \clnref{next} |
| might access cache bank~1. |
| On Alpha, the \co{smp_store_release()} will guarantee that the cache |
| invalidations performed by \clnrefrange{init:b}{init:e} of |
| \cref{lst:memorder:Insert and Lock-Free Search (No Ordering)} |
| (for \co{p->next}, \co{p->key}, and \co{p->data}) will reach |
| the interconnect before that of \clnref{add} (for \co{head.next}), but |
| makes absolutely no guarantee about the order of |
| propagation through the reading CPU's cache banks. |
| For example, it is possible that the reading CPU's cache bank~1 is very |
| busy, but cache bank~0 is idle. |
| This could result in the cache invalidations for the new element |
| (\co{p->next}, \co{p->key}, and \co{p->data}) being |
| delayed, so that the reading CPU loads the new value for \co{head.next}, |
| but loads the old cached values for \co{p->key} and \co{p->next}. |
| Yes, this does mean that Alpha can in effect fetch |
| the data pointed to {\em before} it fetches the pointer itself, strange |
| but true. |
| \end{fcvref} |
| See the documentation~\cite{Compaq01,WilliamPugh2000Gharachorloo} |
| called out earlier for more information, |
| or if you think that I am just making all this up.\footnote{ |
| Of course, the astute reader will have already recognized that |
| Alpha is nowhere near as mean and nasty as it could be, |
| the (thankfully) mythical architecture in |
| \cref{sec:app:whymb:Ordering-Hostile Architecture} |
| being a case in point.} |
| The benefit of this unusual approach to ordering is that Alpha can use |
| simpler cache hardware, which in turn permitted higher clock frequencies |
| in Alpha's heyday. |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{memorder/Alpha}} |
| \caption{Why \tco{smp_read_barrier_depends()} is Required in Pre-v4.15 Linux Kernels} |
| \label{fig:memorder:fig:memorder:Why smp-read-barrier-depends() is Required in Pre-v4.15 Linux Kernels} |
| \end{figure} |
| |
| One could place an \co{smp_rmb()} primitive |
| between the pointer fetch and dereference in order to force Alpha |
| to order the pointer fetch with the later dependent load. |
| However, this imposes unneeded overhead on systems (such as \ARM, |
| Itanium, PPC, and SPARC) that respect |
| \IXalth{address dependencies}{address}{dependency} on the read side. |
| A \co{smp_read_barrier_depends()} primitive was therefore added to the |
| Linux kernel to eliminate overhead on these systems, but was removed |
| in v5.9 of the Linux kernel in favor of augmenting Alpha's definition |
| of \co{READ_ONCE()}. |
| Thus, as of v5.9, core kernel code no longer needs to concern itself |
| with this aspect of DEC Alpha. |
| \begin{fcvref}[ln:memorder:Safe Insert and Lock-Free Search] |
| However, it is better to use \co{rcu_dereference()} |
| as shown on \clnref{deref1,deref2} of |
| \cref{lst:memorder:Safe Insert and Lock-Free Search}, |
| which works safely and efficiently for all recent kernel versions. |
| \end{fcvref} |
| |
| It is also possible to implement a software mechanism |
| that could be used in place of \co{smp_store_release()} to force |
| all reading CPUs to see the writing CPU's writes in order. |
| This software barrier could be implemented by sending \IXacrfpl{ipi} |
| to all other CPUs. |
| Upon receipt of such an IPI, a CPU would execute a memory-barrier |
| instruction, implementing a system-wide memory barrier similar to that |
| provided by the Linux kernel's \co{sys_membarrier()} system call. |
| Additional logic is required to avoid deadlocks. |
| Of course, CPUs that respect address dependencies would define such a barrier |
| to simply be \co{smp_store_release()}. |
| However, this approach was deemed by the Linux community |
| to impose excessive overhead~\cite{McKenney01f}, and to their point would |
| be completely inappropriate for systems having |
| aggressive real-time response requirements. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:Safe Insert and Lock-Free Search] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| struct el *insert(long key, long data) |
| { |
| struct el *p; |
| p = kmalloc(sizeof(*p), GFP_ATOMIC); |
| spin_lock(&mutex); |
| p->next = head.next; |
| p->key = key; |
| p->data = data; |
| smp_store_release(&head.next, p); |
| spin_unlock(&mutex); |
| } |
| |
| struct el *search(long searchkey) |
| { |
| struct el *p; |
| p = rcu_dereference(head.next); \lnlbl[deref1] |
| while (p != &head) { |
| if (p->key == searchkey) { |
| return (p); |
| } |
| p = rcu_dereference(p->next); \lnlbl[deref2] |
| }; |
| return (NULL); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Safe Insert and Lock-Free Search} |
| \label{lst:memorder:Safe Insert and Lock-Free Search} |
| \end{listing} |
| |
| The Linux memory-barrier primitives took their names from the Alpha |
| instructions, so \co{smp_mb()} is \co{mb}, \co{smp_rmb()} is \co{rmb}, |
| and \co{smp_wmb()} is \co{wmb}. |
| Alpha is the only CPU whose \co{READ_ONCE()} includes an \co{smp_mb()}. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Why does Alpha's \co{READ_ONCE()} include an |
| \co{mb()} rather than \co{rmb()}? |
| }\QuickQuizAnswerB{ |
| Alpha has only \co{mb} and \co{wmb} instructions, |
| so \co{smp_rmb()} would be implemented by the Alpha \co{mb} |
| instruction in either case. |
| In addition, at the time that the Linux kernel started relying on |
| dependency ordering, it was not clear that Alpha ordered dependent |
| stores, and thus \co{smp_mb()} was therefore the safe choice. |
| |
| However, given the aforementioned v5.9 changes to \co{READ_ONCE()} |
| and a few of Alpha's atomic read-modify-write operations, |
| no Linux-kernel core code need concern itself with DEC Alpha, |
| thus greatly reducing Paul E.~McKenney's incentive to remove |
| Alpha support from the kernel. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Isn't DEC Alpha significant as having the weakest possible |
| memory ordering? |
| }\QuickQuizAnswerE{ |
| Although DEC Alpha does take considerable flak, it does avoid |
| reordering reads from the same CPU to the same variable. |
| It also avoids the out-of-thin-air problem that plagues |
| the Java and C11 memory |
| models~\cite{Boehm:2014:OGA:2618128.2618134,conf/esop/BattyMNPS15,MarkBatty2013OOTA-WorkingNote,HansBoehm2020ConcurrentUB,DavidGoldblatt2019NoElegantOOTAfix,AlanJeffrey2014JavaDRF,PaulEMcKenney2020RelaxedGuideRelaxed,PaulEMcKenney2016OOTA,Sevcik:2011:SOS:1993316.1993534,Vafeiadis:2015:CCO:2775051.2676995}. |
| }\QuickQuizEndE |
| } |
| |
| For more on Alpha, see its reference manual~\cite{ALPHA2002}. |
| |
| \subsection{\ARMv7-A/R} |
| \label{sec:memorder:ARMv7-A/R} |
| |
| The \ARM\ family of CPUs is popular in deep embedded applications, |
| particularly for power-constrained microcontrollers. |
| Its \IXalthmr{memory model}{Armv7}{memory model} is similar to that of \Power{} |
| (see \cref{sec:memorder:POWER / PowerPC}), but \ARM\ uses a |
| different set of memory-barrier instructions~\cite{ARMv7A:2010}: |
| |
| \begin{description} |
| \item [\tco{DMB}] (data memory barrier) causes the specified type of |
| operations to \emph{appear} to have completed before any |
| subsequent operations of the same type. |
| The ``type'' of operations can be all operations or can be |
| restricted to only writes (similar to the Alpha \co{wmb} |
| and the \Power{} \co{eieio} instructions). |
| In addition, \ARM\ allows \IX{cache coherence} to have one of three |
| scopes: |
| Single processor, a subset of the processors |
| (``inner'') and global (``outer''). |
| \item [\tco{DSB}] (data synchronization barrier) causes the specified |
| type of operations to actually complete before any subsequent |
| operations (of any type) are executed. |
| The ``type'' of operations is the same as that of \co{DMB}. |
| The \co{DSB} instruction was called \co{DWB} (drain write buffer |
| or data write barrier, your choice) in early versions of the |
| \ARM\ architecture. |
| \item [\tco{ISB}] (instruction synchronization barrier) flushes the CPU |
| pipeline, so that all instructions following the \co{ISB} |
| are fetched only after the \co{ISB} completes. |
| For example, if you are writing a self-modifying program |
| (such as a JIT), you should execute an \co{ISB} between |
| generating the code and executing it. |
| \end{description} |
| |
| None of these instructions exactly match the semantics of Linux's |
| \co{rmb()} primitive, which must therefore be implemented as a full |
| \co{DMB}. |
| The \co{DMB} and \co{DSB} instructions have a recursive definition |
| of accesses ordered before and after the barrier, which has an effect |
| similar to that of \Power{}'s cumulativity, both of which are |
| stronger than \IXalthmr{\acr{lkmm}}{Linux kernel}{memory model}'s |
| cumulativity described in |
| \cref{sec:memorder:Cumulativity}. |
| |
| \ARM\ also implements \IXalth{control dependencies}{control}{dependency}, |
| so that if a conditional |
| branch depends on a load, then any store executed after that conditional |
| branch will be ordered after the load. |
| However, loads following the conditional branch will \emph{not} |
| be guaranteed to be ordered unless there is an \co{ISB} |
| instruction between the branch and the load. |
| Consider the following example: |
| |
| \begin{fcvlabel}[ln:memorder:ARM:load-store control dependency] |
| \begin{VerbatimN}[commandchars=\\\[\]] |
| r1 = x; \lnlbl[x] |
| if (r1 == 0) \lnlbl[if] |
| nop(); \lnlbl[nop] |
| y = 1; \lnlbl[y] |
| r2 = z; \lnlbl[z1] |
| ISB(); \lnlbl[isb] |
| r3 = z; \lnlbl[z2] |
| \end{VerbatimN} |
| \end{fcvlabel} |
| |
| \begin{fcvref}[ln:memorder:ARM:load-store control dependency] |
| In this example, load-store control dependency ordering causes |
| the load from \co{x} on \clnref{x} to be ordered before the store to |
| \co{y} on \clnref{y}. |
| However, \ARM\ does not respect load-load control dependencies, so that |
| the load on \clnref{x} might well happen \emph{after} the |
| load on \clnref{z1}. |
| On the other hand, the combination of the conditional branch on \clnref{if} |
| and the \co{ISB} instruction on \clnref{isb} ensures that |
| the load on \clnref{z2} happens after the load on \clnref{x}. |
| Note that inserting an additional \co{ISB} instruction somewhere between |
| \clnref{if,z1} would enforce ordering between \clnref{x,z1}. |
| \end{fcvref} |
| |
| \subsection{\ARMv8} |
| \label{sec:memorder:ARMv8} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-LDLAR}} |
| \caption{Half Memory Barrier} |
| \ContributedBy{fig:memorder:Half Memory Barrier}{Melissa Brossard} |
| \end{figure} |
| |
| \ARM's \ARMv8 CPU family~\cite{ARMv8A:2017} |
| includes 64-bit capabilities, |
| in contrast to their 32-bit-only CPU described in |
| \cref{sec:memorder:ARMv7-A/R}. |
| \IXalthmr{\ARMv8's memory model}{Armv8}{memory model} |
| closely resembles its \ARMv7 counterpart, |
| but adds load-acquire (\co{LDLARB}, \co{LDLARH}, and \co{LDLAR}) |
| and store-release (\co{STLLRB}, \co{STLLRH}, and \co{STLLR}) |
| instructions. |
| These instructions act as ``half memory barriers'', so that |
| \ARMv8 CPUs can reorder previous accesses with a later \co{LDLAR} |
| instruction, but are prohibited from reordering an earlier \co{LDLAR} |
| instruction with later accesses, as fancifully depicted in |
| \cref{fig:memorder:Half Memory Barrier}. |
| Similarly, \ARMv8 CPUs can reorder an earlier \co{STLLR} instruction with |
| a subsequent access, but are prohibited from reordering |
| previous accesses with a later \co{STLLR} instruction. |
| As one might expect, this means that these instructions directly support |
| the C11 notion of load-acquire and store-release. |
| |
| However, \ARMv8 goes well beyond the C11 memory model by mandating that |
| the combination of a store-release and load-acquire act as a full |
| barrier under certain circumstances. |
| For example, in \ARMv8, given a store followed by a store-release followed |
| a load-acquire followed by a load, all to different variables and all from |
| a single CPU, all CPUs |
| would agree that the initial store preceded the final load. |
| Interestingly enough, most TSO architectures (including x86 and the |
| mainframe) do not make this guarantee, as the two loads could be |
| reordered before the two stores. |
| |
| \ARMv8 is one of only two architectures that needs the |
| \co{smp_mb__after_spinlock()} primitive to be a full barrier, |
| due to its relatively weak lock-acquisition implementation in |
| the Linux kernel. |
| |
| \ARMv8 also has the distinction of being the first CPU whose vendor publicly |
| defined its memory ordering with an executable formal model~\cite{ARMv8A:2017}. |
| |
| \subsection{Itanium} |
| \label{sec:memorder:Itanium} |
| |
| Itanium offers a \IXalth{weak consistency}{weak}{memory consistency} |
| \IXalthmr{model}{Itanium}{memory model}, so that in absence of explicit |
| memory-barrier instructions or dependencies, Itanium is within its rights |
| to arbitrarily reorder memory references~\cite{IntelItanium02v2}. |
| Itanium has a memory-fence instruction named \co{mf}, but also has |
| ``half-memory fence'' modifiers to loads, stores, and to some of its atomic |
| instructions~\cite{IntelItanium02v3}. |
| The \co{acq} modifier prevents subsequent memory-reference instructions |
| from being reordered before the \co{acq}, but permits |
| prior memory-reference instructions to be reordered after the \co{acq}, |
| similar to the \ARMv8 load-acquire instructions. |
| Similarly, the \co{rel} modifier prevents prior memory-reference |
| instructions from being reordered after the \co{rel}, but allows |
| subsequent memory-reference instructions to be reordered before |
| the \co{rel}. |
| |
| These half-memory fences are useful for critical sections, since |
| it is safe to push operations into a critical section, but can be |
| fatal to allow them to bleed out. |
| However, as one of the few CPUs with this property, Itanium at one |
| time defined Linux's semantics of memory ordering associated with lock |
| acquisition and release.\footnote{ |
| PowerPC is now the architecture with this dubious privilege.} |
| Oddly enough, actual Itanium hardware is rumored to implement |
| both load-acquire and store-release instructions as full barriers. |
| Nevertheless, Itanium was the first mainstream CPU to introduce the concept |
| (if not the reality) of load-acquire and store-release into its |
| instruction set. |
| |
| \QuickQuiz{ |
| Given that hardware can have a half memory barrier, why don't |
| locking primitives allow the compiler to move memory-reference |
| instructions into lock-based critical sections? |
| }\QuickQuizAnswer{ |
| In fact, as we saw in \cref{sec:memorder:ARMv8} and will |
| see in \cref{sec:memorder:POWER / PowerPC}, hardware really does |
| implement partial memory-ordering instructions and it also turns |
| out that these really are used to construct locking primitives. |
| However, these locking primitives use full compiler barriers, |
| thus preventing the compiler from reordering memory-reference |
| instructions both out of and into the corresponding critical |
| section. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:memorder:synchronize-rcu] |
| \begin{VerbatimL}[commandchars=\@\[\]] |
| static inline int rcu_gp_ongoing(unsigned long *ctr) |
| { |
| unsigned long v; |
| |
| v = LOAD_SHARED(*ctr);@lnlbl[load] |
| return v && (v != rcu_gp_ctr); |
| } |
| |
| static void update_counter_and_wait(void) |
| { |
| struct rcu_reader *index; |
| |
| STORE_SHARED(rcu_gp_ctr, rcu_gp_ctr + RCU_GP_CTR); |
| barrier(); |
| list_for_each_entry(index, ®istry, node) {@lnlbl[loop] |
| while (rcu_gp_ongoing(&index->ctr))@lnlbl[call2] |
| msleep(10); |
| } |
| } |
| |
| void synchronize_rcu(void) |
| { |
| unsigned long was_online; |
| |
| was_online = rcu_reader.ctr; |
| smp_mb(); |
| if (was_online)@lnlbl[if] |
| STORE_SHARED(rcu_reader.ctr, 0);@lnlbl[store] |
| mutex_lock(&rcu_gp_lock);@lnlbl[acqmutex] |
| update_counter_and_wait();@lnlbl[call1] |
| mutex_unlock(&rcu_gp_lock); |
| if (was_online) |
| STORE_SHARED(rcu_reader.ctr, LOAD_SHARED(rcu_gp_ctr)); |
| smp_mb(); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Userspace RCU Code Reordering} |
| \label{lst:memorder:Userspace RCU Code Reordering} |
| \end{listing} |
| |
| To see why the compiler is forbidden from doing reordering that |
| is permitted by hardware, consider the following sample code |
| in \cref{lst:memorder:Userspace RCU Code Reordering}. |
| This code is based on the userspace RCU update-side |
| code~\cite[Supplementary Materials Figure 5]{MathieuDesnoyers2012URCU}. |
| |
| \begin{fcvref}[ln:memorder:synchronize-rcu] |
| Suppose that the compiler reordered \clnref{if,store} into |
| the critical section starting at \clnref{acqmutex}. |
| Now suppose that two updaters start executing \co{synchronize_rcu()} |
| at about the same time. |
| Then consider the following sequence of events: |
| \begin{enumerate} |
| \item CPU~0 acquires the lock at \clnref{acqmutex}. |
| \item \Clnref{if} determines that CPU~0 was online, so it clears |
| its own counter at \clnref{store}. |
| (Recall that \clnref{if,store} have been reordered by the |
| compiler to follow \clnref{acqmutex}). |
| \item CPU~0 invokes \co{update_counter_and_wait()} from |
| \clnref{call1}. |
| \item CPU~0 invokes \co{rcu_gp_ongoing()} on itself at |
| \clnref{call2}, and \clnref{load} sees that CPU~0 is |
| in a quiescent state. |
| Control therefore returns to \co{update_counter_and_wait()}, |
| and \clnref{loop} advances to CPU~1. |
| \item CPU~1 invokes \co{synchronize_rcu()}, but because CPU~0 |
| already holds the lock, CPU~1 blocks waiting for this |
| lock to become available. |
| Because the compiler reordered \clnref{if,store} to follow |
| \clnref{acqmutex}, CPU~1 does not clear its own counter, |
| despite having been online. |
| \item CPU~0 invokes \co{rcu_gp_ongoing()} on CPU~1 at |
| \clnref{call2}, and \clnref{load} sees that CPU~1 is |
| not in a quiescent state. |
| The \co{while} loop at \clnref{call2} therefore never |
| exits. |
| \end{enumerate} |
| |
| So the compiler's reordering results in a deadlock. |
| In contrast, hardware reordering is temporary, so that CPU~1 |
| might undertake its first attempt to acquire the mutex on |
| \clnref{acqmutex} before executing \clnref{if,store}, but it |
| will eventually execute \clnref{if,store}. |
| Because hardware reordering only results in a short delay, it |
| can be tolerated. |
| On the other hand, because compiler reordering results in a |
| deadlock, it must be prohibited. |
| |
| Some research efforts have used hardware transactional memory |
| to allow compilers to safely reorder more aggressively, but |
| the overhead of hardware transactions has thus far made |
| such optimizations unattractive. |
| % @@@ Citation for compilers use of HTM in this manner? |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| The Itanium \co{mf} instruction is used for the \co{smp_rmb()}, |
| \co{smp_mb()}, and \co{smp_wmb()} primitives in the Linux kernel. |
| Despite persistent rumors to the contrary, the \qco{mf} mnemonic stands |
| for ``memory fence''. |
| |
| Itanium also offers a global total order for release operations, |
| including the \co{mf} instruction. |
| This provides the notion of transitivity, where if a given code fragment |
| sees a given access as having happened, any later code fragment will |
| also see that earlier access as having happened. |
| Assuming, that is, that all the code fragments involved correctly use |
| memory barriers. |
| |
| Finally, Itanium is the only architecture supporting the Linux kernel |
| that can reorder normal loads to the same variable. |
| The Linux kernel avoids this issue because \co{READ_ONCE()} emits |
| a \co{volatile} load, which is compiled as a \co{ld,acq} instruction, |
| which forces ordering of all \co{READ_ONCE()} invocations by a given |
| CPU, including those to the same variable. |
| |
| \subsection{MIPS} |
| |
| The \IXhmr{MIPS}{memory model}~\cite[page~479]{MIPSvII-A-2016} |
| appears to resemble that of \ARM, Itanium, and \Power{}, |
| being weakly ordered by default, but respecting dependencies. |
| MIPS has a wide variety of memory-barrier instructions, but ties them |
| not to hardware considerations, but rather to the use cases provided |
| by the Linux kernel and the C++11 standard~\cite{RichardSmith2019N4800} |
| in a manner similar to the \ARMv8 additions: |
| |
| \begin{description}[style=nextline] |
| \item[\tco{SYNC}] |
| Full barrier for a number of hardware operations in addition |
| to memory references, which is used to implement the v4.13 |
| Linux kernel's \co{smp_mb()} for OCTEON systems. |
| \item[\tco{SYNC_WMB}] |
| Write memory barrier, which can be used on OCTEON systems |
| to implement the |
| \co{smp_wmb()} primitive in the v4.13 Linux kernel via the |
| \co{syncw} mnemonic. |
| Other systems use plain \co{sync}. |
| \item[\tco{SYNC_MB}] |
| Full memory barrier, but only for memory operations. |
| This may be used to implement the |
| C++ \co{atomic_thread_fence(memory_order_seq_cst)}. |
| \item[\tco{SYNC_ACQUIRE}] |
| Acquire memory barrier, which could be used to implement |
| C++'s \co{atomic_thread_fence(memory_order_acquire)}. |
| In theory, it could also be used to implement the v4.13 Linux-kernel |
| \co{smp_load_acquire()} primitive, but in practice |
| \co{sync} is used instead. |
| \item[\tco{SYNC_RELEASE}] |
| Release memory barrier, which may be used to implement |
| C++'s \co{atomic_thread_fence(memory_order_release)}. |
| In theory, it could also be used to implement the v4.13 Linux-kernel |
| \co{smp_store_release()} primitive, but in practice |
| \co{sync} is used instead. |
| \item[\tco{SYNC_RMB}] |
| Read memory barrier, which could in theory be used to implement the |
| \co{smp_rmb()} primitive in the Linux kernel, except that current |
| MIPS implementations supported by the v4.13 Linux kernel do not |
| need an explicit instruction to force ordering. |
| Therefore, \co{smp_rmb()} instead simply constrains the compiler. |
| \item[\tco{SYNCI}] |
| Instruction-cache synchronization, which is used in conjunction with |
| other instructions to allow self-modifying code, such as that produced |
| by just-in-time (JIT) compilers. |
| \end{description} |
| |
| Informal discussions with MIPS architects indicates that MIPS has a |
| definition of transitivity or cumulativity similar to that of |
| \ARM\ and \Power{}\@. |
| However, it appears that different MIPS implementations can have |
| different memory-ordering properties, so it is important to consult |
| the documentation for the specific MIPS implementation you are using. |
| |
| \subsection{\Power{} / PowerPC} |
| \label{sec:memorder:POWER / PowerPC} |
| |
| The \Power{} and PowerPC CPU families have a wide variety of memory-barrier |
| instructions~\cite{PowerPC94,MichaelLyons05a}: |
| \begin{description} |
| \item [\tco{sync}] causes all preceding operations to {\em appear to have} |
| completed before any subsequent operations are started. |
| This instruction is therefore quite expensive. |
| \item [\tco{lwsync}] (lightweight sync) orders loads with respect to |
| subsequent loads and stores, and also orders stores. |
| However, it does {\em not} order stores with respect to subsequent |
| loads. |
| The \co{lwsync} instruction may be used to implement |
| load-acquire and store-release operations. |
| Interestingly enough, the \co{lwsync} instruction enforces |
| the same within-CPU ordering as does x86, z~Systems, and coincidentally, |
| SPARC TSO\@. |
| However, placing the \co{lwsync} instruction between each |
| pair of memory-reference instructions will \emph{not} |
| result in x86, z~Systems, or SPARC TSO memory ordering. |
| On these other systems, if a pair of CPUs independently execute |
| stores to different variables, all other CPUs will agree on the |
| order of these stores. |
| Not so on PowerPC, even with an \co{lwsync} instruction between each |
| pair of memory-reference instructions, because PowerPC is |
| \IXalthy{non-multicopy atomic}{non-}{multi-copy atomic}. |
| \item [\tco{eieio}] (enforce in-order execution of I/O, in case you |
| were wondering) causes all preceding cacheable stores to appear |
| to have completed before all subsequent stores. |
| However, stores to cacheable memory are ordered separately from |
| stores to non-cacheable memory, which means that \co{eieio} |
| will not force an MMIO store to precede a spinlock release. |
| This instruction may well be unique in having a five-vowel mnemonic. |
| \item [\tco{isync}] forces all preceding instructions to appear to have |
| completed before any subsequent instructions start execution. |
| This means that the preceding instructions must have progressed |
| far enough that any traps they might generate have either happened |
| or are guaranteed not to happen, and that any side-effects of |
| these instructions (for example, page-table changes) are seen by the |
| subsequent instructions. |
| However, it does \emph{not} force all memory references to be |
| ordered, only the actual execution of the instruction itself. |
| Thus, the loads might return old still-cached values and the |
| \co{isync} instruction does not force values previously stored |
| to be flushed from the store buffers. |
| \end{description} |
| |
| Unfortunately, none of these instructions line up exactly with Linux's |
| \co{wmb()} primitive, which requires \emph{all} stores to be ordered, |
| but does not require the other high-overhead actions of the \co{sync} |
| instruction. |
| The \co{rmb()} primitive doesn't have a matching light-weight instruction |
| either. |
| But there is no choice: |
| {ppc64} versions of \co{wmb()}, \co{rmb()}, and \co{mb()} are defined |
| to be the heavyweight \co{sync} instruction. |
| However, Linux's \co{smp_wmb()} primitive is never used for MMIO |
| (since a driver must carefully order MMIOs in UP as well as |
| SMP kernels, after all), so it is defined to be the lighter weight |
| \co{eieio} or \co{lwsync} instruction~\cite{PaulEMcKenney2016LinuxKernelMMIO}. |
| The \co{smp_mb()} primitive is also defined to be the \co{sync} |
| instruction, while \co{smp_rmb()} is defined to be the lighter-weight |
| \co{lwsync} instruction. |
| |
| \Power{} features ``cumulativity'', which can be used to obtain |
| transitivity. |
| When used properly, any code seeing the results of an earlier |
| code fragment will also see the accesses that this earlier code |
| fragment itself saw. |
| Much more detail is available from |
| McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}. |
| |
| \Power{} respects control dependencies in much the same way that \ARM\ |
| does, with the exception that the \Power{} \co{isync} instruction |
| is substituted for the \ARM\ \co{ISB} instruction. |
| |
| Like \ARMv8, \Power{} requires \co{smp_mb__after_spinlock()} to be |
| a full memory barrier. |
| In addition, \Power{} is the only architecture requiring |
| \co{smp_mb__after_unlock_lock()} to be a full memory barrier. |
| In both cases, this is because of the weak ordering properties |
| of \Power{}'s locking primitives, due to the use of the \co{lwsync} |
| instruction to provide ordering for both acquisition and release. |
| |
| Many members of the \Power{} architecture have incoherent instruction |
| caches, so that a store to memory will not necessarily be reflected |
| in the instruction cache. |
| Thankfully, few people write self-modifying code these days, but JITs |
| and compilers do it all the time. |
| Furthermore, recompiling a recently run program looks just like |
| self-modifying code from the CPU's viewpoint. |
| The \co{icbi} instruction (instruction cache block invalidate) |
| invalidates a specified cache line from |
| the instruction cache, and may be used in these situations. |
| |
| \subsection{SPARC TSO} |
| |
| Although SPARC's TSO (total-store order) is used by both Linux and |
| Solaris, the architecture also defines PSO (partial store order) and RMO |
| (relaxed-memory order). |
| Any program that runs in RMO will also run in either PSO or TSO, and similarly, |
| a program that runs in PSO will also run in TSO\@. |
| Moving a shared-memory parallel program in the other direction may |
| require careful insertion of memory barriers. |
| |
| Although SPARC's PSO and RMO modes are not used much these days, they |
| did give rise to a very flexible memory-barrier instruction~\cite{SPARC94} |
| that permits fine-grained control of ordering: |
| \begin{description} |
| \item [\tco{StoreStore}] orders preceding stores before subsequent stores. |
| (This option is used by the Linux \co{smp_wmb()} primitive.) |
| \item [\tco{LoadStore}] orders preceding loads before subsequent stores. |
| \item [\tco{StoreLoad}] orders preceding stores before subsequent loads. |
| \item [\tco{LoadLoad}] orders preceding loads before subsequent loads. |
| (This option is used by the Linux \co{smp_rmb()} primitive.) |
| \item [\tco{Sync}] fully completes all preceding operations before starting |
| any subsequent operations. |
| \item [\tco{MemIssue}] completes preceding memory operations before subsequent |
| memory operations, important for some instances of memory-mapped |
| I/O. |
| \item [\tco{Lookaside}] does the same as MemIssue, |
| but only applies to preceding stores |
| and subsequent loads, and even then only for stores and loads that |
| access the same memory location. |
| \end{description} |
| |
| So, why is \qco{membar #MemIssue} needed? |
| Because a \qco{membar #StoreLoad} could permit a subsequent |
| load to get its value from a store buffer, which would be |
| disastrous if the write was to an MMIO register that induced side effects |
| on the value to be read. |
| In contrast, \qco{membar #MemIssue} would wait until the store buffers |
| were flushed before permitting the loads to execute, |
| thereby ensuring that the load actually gets its value from the MMIO register. |
| Drivers could instead use \qco{membar #Sync}, but the lighter-weight |
| \qco{membar #MemIssue} is preferred in cases where the additional function |
| of the more-expensive \qco{membar #Sync} are not required. |
| |
| The \qco{membar #Lookaside} is a lighter-weight version of |
| \qco{membar #MemIssue}, which is useful when writing to a given MMIO register |
| affects the value that will next be read from that register. |
| However, the heavier-weight \qco{membar #MemIssue} must be used when |
| a write to a given MMIO register affects the value that will next be |
| read from {\em some other} MMIO register. |
| |
| SPARC requires a \co{flush} instruction be used between the time that |
| the instruction stream is modified and the time that any of these |
| instructions are executed~\cite{SPARC94}. |
| This is needed to flush any prior value for that location from |
| the SPARC's instruction cache. |
| Note that \co{flush} takes an address, and will flush only that address |
| from the instruction cache. |
| On SMP systems, all CPUs' caches are flushed, but there is no |
| convenient way to determine when the off-CPU flushes complete, |
| though there is a reference to an implementation note. |
| |
| But again, the Linux kernel runs SPARC in TSO mode, so |
| all of the above \co{membar} variants are strictly of historical |
| interest. |
| In particular, the \co{smp_mb()} primitive only needs to use \co{#StoreLoad} |
| because the other three reorderings are prohibited by TSO\@. |
| |
| \subsection{x86} |
| |
| Historically, the x86 CPUs provided ``process ordering'' so that all CPUs |
| agreed on the order of a given CPU's writes to memory. |
| This allowed the \co{smp_wmb()} |
| primitive to be a no-op for the CPU~\cite{IntelXeonV3-96a}. |
| Of course, a compiler directive was also required to prevent optimizations |
| that would reorder across the \co{smp_wmb()} primitive. |
| In ancient times, certain x86 CPUs gave no ordering guarantees for loads, so |
| the \co{smp_mb()} and \co{smp_rmb()} primitives expanded to \co{lock;addl}. |
| This atomic instruction acts as a barrier to both loads and stores. |
| |
| But those were ancient times. |
| More recently, Intel has published a |
| \IXalthmr{memory model}{x86}{memory model} for |
| x86~\cite{Intelx86MemoryOrdering2007}. |
| It turns out that Intel's modern CPUs enforce tighter ordering than was |
| claimed in the previous specifications, so this model simply mandates |
| this modern behavior. |
| Even more recently, Intel published an updated memory model for |
| x86~\cite[Section 8.2]{Intel64IA32v3A2011}, which mandates a total global order |
| for stores, although individual CPUs are still permitted to see their |
| own stores as having happened earlier than this total global order |
| would indicate. |
| This exception to the total ordering is needed to allow important |
| hardware optimizations involving store buffers. |
| In addition, x86 provides |
| \IXalthy{other-multicopy atomicity}{other-}{multi-copy atomic}, for example, |
| so that if CPU~0 sees a store by CPU~1, then CPU~0 is guaranteed to see |
| all stores that CPU~1 saw prior to its store. |
| Software may use atomic operations to override these hardware optimizations, |
| which is one reason that atomic operations tend to be more expensive |
| than their non-atomic counterparts. |
| |
| It is also important to note that atomic instructions operating |
| on a given memory location should all be of the same |
| size~\cite[Section 8.1.2.2]{Intel64IA32v3A2016}. |
| For example, if you write a program where one CPU atomically increments |
| a byte while another CPU executes a 4-byte atomic increment on |
| that same location, you are on your own. |
| |
| Some SSE instructions are weakly ordered (\co{clflush} |
| and non-temporal move instructions~\cite{IntelXeonV2b-96a}). |
| Code that uses these non-temporal move instructions |
| can use \co{mfence} for \co{mb()}, |
| \co{lfence} for \co{rmb()}, and \co{sfence} for \co{wmb()}.\footnote{ |
| \co{smp_mb()}, \co{smp_rmb()}, and \co{smp_wmb()} don't suffice |
| for ordering non-temporal move instructions since Linux v4.15.} |
| A few older variants of the x86 CPU have a mode bit that enables out-of-order |
| stores, and for these CPUs, \co{smp_wmb()} must also be defined to |
| be \co{lock;addl}. |
| |
| A 2017 kernel commit by Michael S.~Tsirkin replaced \co{mfence} with |
| \co{lock;addl} in \co{smp_mb()}, achieving a 60 percent performance |
| boost~\cite{Tsirkin2017}. |
| The change used a 4-byte negative offset from \co{SP} to avoid |
| slowness due to false data dependencies, instead of directly |
| accessing memory pointed to by \co{SP}. |
| \co{clflush} users still need to use \co{mfence} for ordering. |
| Therefore, they were converted to use \co{mb()}, which uses \co{mfence} |
| as before, instead of \co{smp_mb()}. |
| |
| Although newer x86 implementations accommodate self-modifying code |
| without any special instructions, to be fully compatible with |
| past and potential future x86 implementations, a given CPU must |
| execute a jump instruction or a serializing instruction (e.g., \co{cpuid}) |
| between modifying the code and executing |
| it~\cite[Section 8.1.3]{Intel64IA32v3A2011}. |
| |
| \subsection{z Systems} |
| |
| The z~Systems machines make up the IBM mainframe family, previously |
| known as the 360, 370, 390 and zSeries~\cite{IBMzSeries04a}. |
| Parallelism came late to z~Systems, but given that these mainframes first |
| shipped in the mid 1960s, this is not saying much. |
| The \qco{bcr 15,0} instruction is used for the Linux \co{smp_mb()} primitives, |
| but compiler constraints suffices for both the |
| \co{smp_rmb()} and \co{smp_wmb()} primitives. |
| It also has strong memory-ordering semantics, as shown in |
| \cref{tab:memorder:Summary of Memory Ordering}. |
| In particular, all CPUs will agree on the order of unrelated stores from |
| different CPUs, that is, the z~Systems CPU family is |
| \IXalt{fully multicopy atomic}{multi-copy atomic}, |
| and is the only commercially available system with this property. |
| |
| As with most CPUs, the z~Systems architecture does not guarantee a |
| cache-coherent instruction stream, hence, |
| self-modifying code must execute a serializing instruction between updating |
| the instructions and executing them. |
| That said, many actual z~Systems machines do in fact accommodate self-modifying |
| code without serializing instructions. |
| The z~Systems instruction set provides a large set of serializing instructions, |
| including compare-and-swap, some types of branches (for example, the |
| aforementioned \qco{bcr 15,0} instruction), and test-and-set. |
| |
| \subsection{Hardware Specifics: |
| Discussion} |
| \label{sec:memorder:Hardware Specifics: Discussion} |
| |
| There is considerable variation among these CPU families, and this section |
| only scratched the surface of a few families that are either heavily used |
| or historically significant. |
| Those wishing more detail are invited to consult the reference manuals. |
| |
| However, perhaps the biggest benefit of the Linux-kernel memory model is |
| that you can ignore these details when writing architecture-independent |
| Linux-kernel code. |
| |
| \QuickQuizAnswersChp{qqzmemorder} |