| % advsync/advsync.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{sec:advsync:Advanced Synchronization}{Advanced Synchronization}{qqzadvsync} |
| % |
| \Epigraph{If a little knowledge is a dangerous thing, just think what |
| you could do with a lot of knowledge!}{Unknown} |
| |
| This chapter covers synchronization techniques used for lockless |
| algorithms and parallel real-time systems. |
| |
| Although lockless algorithms can be quite helpful when faced with |
| extreme requirements, they are no panacea. |
| For example, as noted at the end of \cref{chp:Counting}, |
| you should thoroughly apply partitioning, batching, and |
| well-tested packaged weak APIs |
| (see \cref{chp:Data Ownership,chp:Deferred Processing}) |
| before even thinking about lockless algorithms. |
| |
| But after doing all that, you still might find yourself needing the |
| advanced techniques described in this chapter. |
| To that end, |
| \cref{sec:advsync:Avoiding Locks} |
| summarizes techniques used thus far for avoiding locks and |
| \cref{sec:advsync:Non-Blocking Synchronization} |
| gives a brief overview of non-blocking synchronization. |
| Memory ordering is also quite important, but it warrants its own |
| \lcnamecref{chp:Advanced Synchronization: Memory Ordering}, namely |
| \cref{chp:Advanced Synchronization: Memory Ordering}. |
| |
| The second form of advanced synchronization provides the stronger |
| forward-progress guarantees needed for parallel real-time computing, |
| which is the topic of |
| \cref{sec:advsync:Parallel Real-Time Computing}. |
| |
| \section{Avoiding Locks} |
| \label{sec:advsync:Avoiding Locks} |
| % |
| \epigraph{We are confronted with insurmountable opportunities.} |
| {Walt Kelly} |
| |
| Although locking is the workhorse of parallelism in production, in |
| many situations performance, scalability, and real-time response can |
| all be greatly improved through use of lockless techniques. |
| A particularly impressive example of such a lockless technique is |
| the statistical counters described in |
| \cref{sec:count:Statistical Counters}, |
| which avoids not only locks, but also read-modify-write atomic operations, |
| memory barriers, and even cache misses for counter increments. |
| Other examples we have covered include: |
| |
| \begin{enumerate} |
| \item The fastpaths through a number of other counting algorithms |
| in \cref{chp:Counting}. |
| \item The fastpath through resource allocator caches in |
| \cref{sec:SMPdesign:Resource Allocator Caches}. |
| \item The maze solver in \cref{sec:SMPdesign:Beyond Partitioning}. |
| \item The data-ownership techniques in \cref{chp:Data Ownership}. |
| \item The reference-counting, hazard-pointer, and RCU techniques |
| in \cref{chp:Deferred Processing}. |
| \item The lookup code paths in \cref{chp:Data Structures}. |
| \item Many of the techniques in \cref{chp:Putting It All Together}. |
| \end{enumerate} |
| |
| In short, lockless techniques are quite useful and are heavily used. |
| However, it is best if lockless techniques are hidden behind a |
| well-defined API, such as the \co{inc_count()}, \co{memblock_alloc()}, |
| \co{rcu_read_lock()}, and so on. |
| The reason for this is that undisciplined use of lockless techniques |
| is a good way to create difficult bugs. |
| If you believe that finding and fixing such bugs is easier than avoiding |
| them, please re-read |
| \cref{chp:Validation,chp:Formal Verification}. |
| |
| \section{Non-Blocking Synchronization} |
| \label{sec:advsync:Non-Blocking Synchronization} |
| % |
| \epigraph{Never worry about theory as long as the machinery does what |
| it's supposed to do.} |
| {Robert A. Heinlein} |
| |
| The term \IXBacrfst{nbs}~\cite{MauriceHerlihy90a} |
| describes eight classes of \IX{linearizable} algorithms with differing |
| \emph{\IXBpl{forward-progress guarantee}}~\cite{DanAlitarh2013PracticalProgress}, |
| which are as follows: |
| |
| \begin{enumerate} |
| \item \emph{\IXalth{Bounded population-oblivious wait-free}{bounded population-oblivious}{wait free} synchronization}: |
| Every thread will make progress within a specific finite period |
| of time, where this period of time is independent of the number |
| of threads~\cite{HerlihyShavit2008Textbook}. |
| This level is widely considered to be even less achievable than |
| bounded wait-free synchronization. |
| \item \emph{\IXalth{Bounded wait-free}{bounded}{wait free} synchronization}: |
| Every thread will make progress within |
| a specific finite period of time~\cite{Herlihy91}. |
| This level is widely considered to be unachievable, which might be why |
| Alitarh et al.\ omitted it~\cite{DanAlitarh2013PracticalProgress}. |
| \item \emph{\IXalt{Wait-free}{wait free} synchronization}: |
| Every thread will make progress |
| in finite time~\cite{Herlihy93}. |
| \item \emph{\IXalt{Lock-free}{lock free} synchronization}: |
| At least one thread will |
| make progress in finite time~\cite{Herlihy93}. |
| \item \emph{\IXalt{Obstruction-free}{obstruction free} synchronization}: |
| Every thread will make progress in finite time in the absence of |
| contention~\cite{HerlihyLM03}. |
| \item \emph{\IXalt{Clash-free}{clash free} synchronization}: |
| At least one thread will make progress in finite time in the absence of |
| contention~\cite{DanAlitarh2013PracticalProgress}. |
| \item \emph{\IXalt{Starvation-free}{starvation free} synchronization}: |
| Every thread will make progress in finite time in the absence of |
| failures~\cite{DanAlitarh2013PracticalProgress}. |
| \item \emph{\IXalt{Deadlock-free}{deadlock free} synchronization}: |
| At least one thread will make progress in finite time in the absence of |
| failures~\cite{DanAlitarh2013PracticalProgress}. |
| \end{enumerate} |
| |
| NBS class~1 was formulated some time before 2015, |
| classes~2, 3, and~4 were first formulated in the early 1990s, |
| class~5 was first formulated in the early 2000s, |
| and class~6 was first formulated in 2013. |
| The final two classes have seen informal use for a great many decades, |
| but were reformulated in 2013. |
| |
| \QuickQuiz{ |
| Given that there will always be a sharply limited number of |
| CPUs available, is population obliviousness really useful? |
| }\QuickQuizAnswer{ |
| Given the surprisingly limited scalability of any number of |
| NBS algorithms, population obliviousness can be surprisingly |
| useful. |
| Nevertheless, the overall point of the question is valid. |
| It is not normally helpful for an algorithm to scale beyond the |
| size of the largest system it is ever going to run on. |
| }\QuickQuizEnd |
| |
| In theory, any parallel algorithm can be cast into wait-free form, |
| but there are a relatively small subset of NBS algorithms that are |
| in common use. |
| A few of these are listed in the following section. |
| |
| \subsection{Simple NBS} |
| \label{sec:advsync:Simple NBS} |
| |
| Perhaps the simplest NBS algorithm is atomic update of an integer |
| counter using fetch-and-add (\co{atomic_add_return()}) primitives. |
| This section lists a few additional commonly used NBS algorithms in |
| roughly increasing order of complexity. |
| |
| \subsubsection{NBS Sets} |
| \label{sec:advsync:NBS Sets} |
| |
| One simple NBS algorithm implements a set of integers in an array. |
| Here the array index indicates a value that might be a member of the set |
| and the array element indicates whether or not that value actually is |
| a set member. |
| The linearizability criterion for NBS algorithms requires that reads from |
| and updates to the array either use atomic instructions or be accompanied |
| by memory barriers, but in the not-uncommon case where linearizability |
| is not important, simple volatile loads and stores suffice, for example, |
| using \co{READ_ONCE()} and \co{WRITE_ONCE()}. |
| |
| An NBS set may also be implemented using a bitmap, where each value that |
| might be a member of the set corresponds to one bit. |
| Reads and updates must normally be carried out via atomic bit-manipulation |
| instructions, although compare-and-swap (\co{cmpxchg()} or CAS) |
| instructions can also be used. |
| |
| \subsubsection{NBS Counters} |
| \label{sec:advsync:NBS Counters} |
| |
| The statistical counters algorithm discussed in |
| \cref{sec:count:Statistical Counters} |
| can be considered to be bounded-wait-free, but only by using a cute |
| definitional trick in which the sum is considered to be approximate |
| rather than exact.\footnote{ |
| Citation needed. |
| I heard of this trick verbally from Mark Moir.} |
| Given sufficiently wide error bounds that are a function of the length |
| of time that the \co{read_count()} function takes to sum the counters, |
| it is not possible to prove that any non-linearizable behavior occurred. |
| This definitely (if a bit artificially) classifies the statistical-counters |
| algorithm as bounded wait-free. |
| This algorithm is probably the most heavily used NBS algorithm in the |
| Linux kernel. |
| |
| \subsubsection{Half-NBS Queue} |
| \label{sec:advsync:Half-NBS Queue} |
| |
| \begin{fcvref}[ln:advsync:NBS Enqueue Algorithm] |
| Another common NBS algorithm is the atomic queue where elements are |
| enqueued using an atomic exchange instruction~\cite{MagedMichael1993JPDC}, |
| followed by a store into the \co{->next} pointer of the new element's |
| predecessor, as shown in \cref{lst:advsync:NBS Enqueue Algorithm}, |
| which shows the userspace-RCU library |
| implementation~\cite{MathieuDesnoyers2009URCU}. |
| \Clnref{tail} updates the tail pointer to reference the new element while |
| returning a reference to its predecessor, which is stored in |
| local variable \co{old_tail}. |
| \Clnref{pred} then updates the predecessor's \co{->next} pointer to |
| reference the newly added element, and finally \clnref{ret} |
| returns an indication as to whether or not the queue was initially |
| empty. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:advsync:NBS Enqueue Algorithm] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| static inline bool |
| ___cds_wfcq_append(struct cds_wfcq_head *head, |
| struct cds_wfcq_tail *tail, |
| struct cds_wfcq_node *new_head, |
| struct cds_wfcq_node *new_tail) |
| { |
| struct cds_wfcq_node *old_tail; |
| |
| old_tail = uatomic_xchg(&tail->p, new_tail); \lnlbl[tail] |
| CMM_STORE_SHARED(old_tail->next, new_head); \lnlbl[pred] |
| return old_tail != &head->node; \lnlbl[ret] |
| } |
| |
| static inline bool |
| _cds_wfcq_enqueue(struct cds_wfcq_head *head, |
| struct cds_wfcq_tail *tail, |
| struct cds_wfcq_node *new_tail) |
| { |
| return ___cds_wfcq_append(head, tail, |
| new_tail, new_tail); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{NBS Enqueue Algorithm} |
| \label{lst:advsync:NBS Enqueue Algorithm} |
| \end{listing} |
| |
| Although mutual exclusion is required to dequeue a single element |
| (so that dequeue is blocking), it is possible to carry out a non-blocking |
| removal of the entire contents of the queue. |
| What is not possible is to dequeue any given element in a non-blocking |
| manner: |
| The enqueuer might have failed between \clnref{tail,pred} of the |
| listing, so that the element in question is only partially enqueued. |
| This results in a half-NBS algorithm where enqueues are NBS but |
| dequeues are blocking. |
| This algorithm is nevertheless heavily used in practice, in part because |
| most production software is not required to tolerate arbitrary fail-stop |
| errors. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| Wait! |
| In order to dequeue all elements, both the \co{->head} and |
| \co{->tail} pointers must be changed, which cannot be done |
| atomically on typical computer systems. |
| So how is this supposed to work??? |
| }\QuickQuizAnswer{ |
| One pointer at a time! |
| |
| First, atomically exchange the \co{->head} pointer with \co{NULL}. |
| If the return value from the atomic exchange operation is \co{NULL}, |
| the queue was empty and you are done. |
| And if someone else attempts a dequeue-all at this point, |
| they will get back a \co{NULL} pointer. |
| |
| Otherwise, atomically exchange the \co{->tail} pointer with a |
| pointer to the now-\co{NULL} \co{->head} pointer. |
| The return value from the atomic exchange operation is a pointer |
| to the \co{->next} field of the eventual last element on the list. |
| |
| Producing and testing actual code is left as an exercise for the |
| interested and enthusiastic reader, as are strategies for handling |
| half-enqueued elements. |
| }\QuickQuizEnd |
| |
| \subsubsection{NBS Stack} |
| \label{sec:advsync:NBS Stack} |
| |
| \begin{fcvref}[ln:advsync:lifo_push:whole] |
| \Cref{lst:advsync:NBS Stack Algorithm} |
| shows the LIFO push algorithm, which boasts lock-free push and |
| bounded wait-free pop (\path{lifo-push.c}), forming an NBS stack. |
| The origins of this algorithm are unknown, but it was referred to in |
| a patent granted in 1975~\cite{PaulJBrown1975LIFOpush}. |
| This patent was filed in 1973, a few months before your editor |
| saw his first computer, which had but one CPU\@. |
| |
| \begin{listing} |
| \input{CodeSamples/advsync/lifo_push@whole.fcv} |
| \caption{NBS Stack Algorithm} |
| \label{lst:advsync:NBS Stack Algorithm} |
| \end{listing} |
| |
| \Clnrefrange{struct:b}{struct:e} show the \co{node_t} structure, |
| which contains an arbitrary value and a pointer to the next structure |
| on the stack and |
| \clnref{top} shows the top-of-stack pointer. |
| |
| The \co{list_push()} function spans \clnrefrange{push:b}{push:e}. |
| \Clnref{push:alloc} allocates a new node and |
| \clnref{push:initialize} initializes it. |
| \Clnref{push:next} initializes the newly allocated node's \co{->next} |
| pointer, and \clnref{push:cmpxchg} attempts to push it on the stack. |
| If \clnref{push:check} detects \co{cmpxchg()} failure, another pass |
| through the loop retries. |
| Otherwise, the new node has been successfully pushed, and this function |
| returns to its caller. |
| Note that \clnref{push:check} resolves races in which two concurrent |
| instances of \co{list_push()} attempt to push onto the stack. |
| The \co{cmpxchg()} will succeed for one and fail for the other, |
| causing the other to retry, thereby selecting an arbitrary order for |
| the two node on the stack. |
| |
| The \co{list_pop_all()} function spans \clnrefrange{popall:b}{popall:e}. |
| The \co{xchg()} statement on \clnref{popall:xchg} atomically removes |
| all nodes on the stack, placing the head of the resulting list in local |
| variable \co{p} and setting \co{top} to \co{NULL}. |
| This atomic operation serializes concurrent calls to \co{list_pop_all()}: |
| One of them will get the list, and the other a \co{NULL} pointer, at |
| least assuming that there were no concurrent calls to \co{list_push()}. |
| |
| An instance of \co{list_pop_all()} that obtains a non-empty list in |
| \co{p} processes this list in the loop spanning |
| \clnrefrange{popall:loop:b}{popall:loop:e}. |
| \Clnref{popall:next} prefetches the \co{->next} pointer, |
| \clnref{popall:foo} invokes the function referenced by \co{foo()} on the |
| current node, |
| \clnref{popall:free} frees the current node, and |
| \clnref{popall:pnext} sets up \co{p} for the next pass through the loop. |
| |
| But suppose that a pair of \co{list_push()} instances run concurrently |
| with a \co{list_pop_all()} with a list initially containing a single |
| \Node{A}. |
| Here is one way that this scenario might play out: |
| |
| \begin{enumerate} |
| \item The first \co{list_push()} instance pushes a new \Node{B}, |
| executing through \clnref{push:next}, having just stored |
| a pointer to \Node{A} into \Node{B}'s \co{->next} pointer. |
| \item The \co{list_pop_all()} instance runs to completion, |
| setting \co{top} to \co{NULL} and freeing \Node{A}. |
| \item The second \co{list_push()} instance runs to completion, |
| pushing a new \Node{C}, but happens to allocate the memory |
| that used to belong to \Node{A}. |
| \item The first \co{list_push()} instance executes the \co{cmpxchg()} |
| on \clnref{push:cmpxchg}. |
| Because new \Node{C} has the same address as the newly freed \Node{A}, |
| this \co{cmpxchg()} succeeds and this \co{list_push()} instance |
| runs to completion. |
| \end{enumerate} |
| |
| Note that both pushes and the popall all ran successfully despite the |
| reuse of \Node{A}'s memory. |
| This is an unusual property: |
| Most data structures require protection against what is often called |
| the ABA problem. |
| |
| But this property holds only for algorithm written in assembly |
| language. |
| The sad fact is that most languages (including C and C++) do not support |
| pointers to lifetime-ended objects, such as the pointer to the old \Node{A} |
| contained in \Node{B}'s \co{->next} pointer. |
| In fact, compilers are within their rights to assume that if two pointers |
| (call them \co{p} and \co{q}) were returned from two different calls to |
| \co{malloc()}, then those pointers must not be equal. |
| Real compilers really will generate the constant \co{false} in |
| response to a \co{p==q} comparison. |
| A pointer to an object that has been freed, but whose memory has been |
| reallocated for a compatibly typed object is termed a \emph{zombie pointer}. |
| |
| Many concurrent applications avoid this problem by carefully hiding the |
| memory allocator from the compiler, thus preventing the compiler from |
| making inappropriate assumptions. |
| This obfuscatory approach currently works in practice, but might well |
| one day fall victim to increasingly aggressive optimizers. |
| There is work underway in both the C and C++ standards committees |
| to address this |
| problem~\cite{PaulEMcKenney2019PointerLifetimeEndZap,PaulEMcKenney2020PointerLifetimeEndZapCpp}. |
| In the meantime, please exercise great care when coding ABA-tolerant |
| algorithms. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| So why not ditch antique languages like C and C++ for something |
| more modern? |
| }\QuickQuizAnswer{ |
| That won't help unless the more-modern languages proponents |
| are energetic enough to write their own compiler backends. |
| The usual practice of re-using existing backends also reuses |
| charming properties such as refusal to support pointers to |
| lifetime-ended objects. |
| }\QuickQuizEnd |
| |
| \subsection{Applicability of NBS Benefits} |
| \label{sec:advsync:Applicability of NBS Benefits} |
| |
| The most heavily cited NBS benefits stem from its forward-progress |
| guarantees, its tolerance of fail-stop bugs, and from its linearizability. |
| Each of these is discussed in one of the following sections. |
| |
| \subsubsection{NBS Forward Progress Guarantees} |
| \label{sec:advsync:NBS Forward Progress Guarantees} |
| |
| NBS's forward-progress guarantees have caused many to suggest its use in |
| real-time systems, and NBS algorithms are in fact used in a great many |
| such systems. |
| However, it is important to note that forward-progress guarantees are |
| largely orthogonal to those that form the basis of real-time programming: |
| |
| \begin{enumerate} |
| \item Real-time forward-progress guarantees usually have some |
| definite time associated with them, for example, |
| ``\IXh{scheduling}{latency} must be less than 100 microseconds.'' |
| In contrast, the most popular forms of NBS only guarantees |
| that progress will be made in finite time, with no definite |
| bound. |
| \item Real-time forward-progress guarantees are often |
| probabilistic, as in the soft-real-time guarantee that |
| ``at least 99.9\pct\ of the time, scheduling latency must |
| be less than 100 microseconds.'' |
| In contrast, many of NBS's forward-progress guarantees are |
| unconditional, which can impose additional unnecessary complexity |
| and performance degradation. |
| \item Real-time forward-progress guarantees are often conditioned on |
| environmental constraints, for example, only being honored: |
| \begin{enumerate*}[(1)] |
| \item For the highest-priority tasks, |
| \item When each CPU spends at least a certain fraction of its time idle, |
| and |
| \item When I/O rates are below some specified maximum. |
| \end{enumerate*} |
| In contrast, NBS's forward-progress guarantees are often |
| unconditional, again incurring unnecessary complexity and |
| performance degradation. |
| On the other hand, some recent NBS work accommodates some |
| conditional |
| guarantees~\cite{DanAlitarh2013PracticalProgress}. |
| \item An important component of a real-time program's environment |
| is the scheduler. |
| NBS algorithms assume a worst-case \emph{demonic scheduler}, |
| though for whatever reason, not a scheduler so demonic that |
| it simply refuses to ever run the application housing the NBS |
| algorithm. |
| In contrast, real-time systems assume that the scheduler is |
| doing its level best to satisfy any scheduling constraints |
| it knows about, and, in the absence of such constraints, |
| its level best to honor process priorities and to provide |
| fair scheduling to processes of the same priority. |
| Non-demonic schedulers allow real-time programs to use simpler |
| algorithms than those required for |
| NBS~\cite{DanAlitarh2013PracticalProgress,BjoernBrandenburgPhD}. |
| \item The progress guarantees provided by non-blocking synchronization |
| are most useful in overload conditions. |
| However, overload conditions are most likely to cause queueing |
| delays that, in many real-world systems, will cause the program |
| (or even the system running that program) to be terminated |
| with prejudice. |
| As a rough rule of thumb, latency increases as the reciprocal |
| of the idle time. |
| In other words, if a nearly idle system has |
| latency $T$, then a 50\pct\ idle system will have latency $2 T$, a |
| 10\pct\ idle (90\pct\ utilized) system will have latency $10 T$, a 1\pct\ |
| idle system (99\pct\ utilized) will have latency $100 T$, and so on. |
| This situation means that many latency-sensitive systems will |
| actively limit load, for example, to 50\pct. |
| \item In the not-uncommon case where a given computed result is nice |
| to have rather than critically important, use of timeouts can |
| cause a blocking operation to have non-blocking properties that |
| are often sufficient in practice. |
| \item NBS forward-progress guarantee classes assume that a number of |
| underlying operations (such as system calls) are lock-free or |
| even wait-free, when in fact these operations are blocking on |
| common-case computer systems. |
| \item NBS forward-progress guarantees are often achieved by subdividing |
| operations. |
| For example, in order to avoid a blocking dequeue operation, |
| an NBS algorithm might substitute a non-blocking polling |
| operation. |
| This is fine in theory, but not helpful in practice to real-world |
| programs that require an element to propagate through the queue |
| in a timely fashion. |
| \item Real-time forward-progress guarantees usually apply only |
| in the absence of software bugs. |
| In contrast, many classes of NBS guarantees apply even in the |
| face of fail-stop bugs, though obviously not in the face of |
| software bugs in the general arbitrarily misbehaving case. |
| \item NBS forward-progress guarantee classes imply linearizability. |
| In contrast, real-time forward progress guarantees usually do |
| not require ordering constraints such as linearizability. |
| Once again, providing linearizability when it is not needed |
| can unnecessarily increase complexity and decrease performance. |
| \item Although the simpler NBS algorithms combine exemplary simplicity, |
| forward-progress, guarantees, and performance, other |
| NBS algorithms are quite complex. |
| This added complexity can result in increased development |
| and maintenance costs, decreased performance, and decreased |
| reliability. |
| \end{enumerate} |
| |
| \QuickQuiz{ |
| Why does anyone care about demonic schedulers? |
| }\QuickQuizAnswer{ |
| A demonic scheduler is one way to model an insanely overloaded |
| system. |
| After all, if you have an algorithm that you can prove runs |
| reasonably given a demonic scheduler, mere overload should |
| be no problem, right? |
| |
| On the other hand, it is only reasonable to ask if a demonic |
| scheduler is really the best way to model overload conditions. |
| And perhaps it is time for more accurate models. |
| For one thing, a system might be overloaded in any of a number |
| of ways. |
| After all, an NBS algorithm that works fine on a demonic scheduler |
| might or might not do well in out-of-memory conditions, when |
| mass storage fills, or when the network is congested. |
| |
| Except that systems' core counts have been increasing, which |
| means that an overloaded system is quite likely to be running |
| more than one concurrent program.\footnote{ |
| As a point of reference, back in the mid-1990s, Paul |
| witnessed a 16-CPU system running about 20 instances |
| of a certain high-end proprietary database.} |
| In that case, even if a demonic scheduler is not so demonic |
| as to inject idle cycles while there are runnable tasks, |
| it is easy to imagine such a scheduler consistently favoring |
| the other program over yours. |
| If both programs could consume all available CPU, then this |
| scheduler might not run your program at all. |
| |
| One way to avoid these issues is to simply avoid overload |
| conditions. |
| This is often the preferred approach in production, where load |
| balancers direct traffic away from overloaded systems. |
| And if all systems are overloaded, it is not unheard of to simply |
| \emph{shed load}, that is, to drop the low-priority incoming requests. |
| Nor is this approach limited to computing, as those who have |
| suffered through a rolling blackout can attest. |
| But load-shedding is often considered a bad thing by those |
| whose load is being shed. |
| |
| As always, choose wisely! |
| }\QuickQuizEnd |
| |
| To reiterate, despite these differences, workarounds, and objections, |
| a number of NBS algorithms are extremely useful and are heavily used |
| in practice. |
| As always, use the right tool for the job! |
| |
| \subsubsection{NBS Underlying Operations} |
| \label{sec:advsync:NBS Underlying Operations} |
| |
| An NBS algorithm can be truly non-blocking only if the underlying |
| operations that it uses are also non-blocking. |
| In a surprising number of cases, this is not the case in practice. |
| |
| For example, non-blocking algorithms often allocate memory. |
| In theory, this is fine, given the existence of lock-free |
| memory allocators~\cite{MagedMichael04NBSmalloc}. |
| But in practice, most environments must eventually obtain memory from |
| operating-system kernels, which commonly use locking. |
| Therefore, unless all the memory that will ever be needed is somehow |
| preallocated, a ``non-blocking'' algorithm that allocates memory will not |
| be non-blocking when running on common-case real-world computer systems. |
| |
| This same point clearly also applies to algorithms performing I/O |
| operations or otherwise interacting with their environment. |
| |
| Perhaps surprisingly, this point also applies to ostensibly non-blocking |
| algorithms that do only plain loads and stores, such as the counters |
| discussed in \cref{sec:advsync:NBS Counters}. |
| And at first glance, those loads and stores that can be compiled into |
| single load and store instructions, respectively, would seem to be |
| not just non-blocking, but bounded population-oblivious wait free. |
| |
| Except that load and store instructions are not necessarily either |
| fast or deterministic. |
| For example, as noted in \cref{chp:Hardware and its Habits}, cache misses |
| can consume thousands of CPU cycles. |
| Worse yet, the measured cache-miss latencies can be a function of the |
| number of CPUs, as illustrated in |
| \cref{fig:count:Atomic Increment Scalability on x86}. |
| It is only reasonable to assume that these latencies also depend on |
| the details of the system's interconnect. |
| In addition, given that hardware vendors generally do not publish |
| upper bounds for cache-miss latencies, it seems brave to assume that |
| memory-reference instructions are in fact wait-free in modern computer |
| systems. |
| And the antique systems for which such bounds are available suffer from |
| profound overall slowness. |
| |
| Furthermore, hardware is not the only source of slowness for |
| memory-reference instructions. |
| For example, when running on typical computer systems, both loads and |
| stores can result in page faults. |
| Which cause in-kernel page-fault handlers to be invoked. |
| Which might acquire locks, or even do I/O, potentially even using |
| something like network file system (NFS)\@. |
| All of which are most emphatically blocking operations. |
| |
| Nor are page faults the only kernel-induced hazard. |
| A given CPU might be interrupted at any time, and the interrupt |
| handler might run for some time. |
| During this time, the user-mode ostensibly non-blocking algorithm |
| will not be running at all. |
| This situation raises interesting questions about the forward-progress |
| guarantees provided by system calls relying on interrupts, for example, |
| the \co{membarrier()} system call. |
| |
| Things do look bleak, but the non-blocking nature of such algorithms |
| can be at least partially redeemed using a number of approaches: |
| |
| \begin{enumerate} |
| \item Run on bare metal, with paging disabled. |
| If you are both brave and confident that you can write code that |
| is free of wild-pointer bugs, this approach might be for you. |
| \item Run on a non-blocking operating-system kernel~\cite{Cheriton96a}. |
| Such kernels are quite rare, in part because they have |
| traditionally completely failed to provide the hoped-for |
| performance and scalability advantages over lock-based kernels. |
| But perhaps you should write one. |
| \item Use facilities such as \co{mlockall()} to avoid page faults, |
| while also ensuring that your program preallocates all the |
| memory it will ever need at boot time. |
| This can work well, but at the expense of severe common-case |
| underutilization of memory. |
| In environments that are cost-constrained or power-limited, |
| this approach is not likely to be feasible. |
| \item Use facilities such as the Linux kernel's |
| \co{NO_HZ_FULL} tickless mode~\cite{JonCorbet2013NO-HZ-FULL}. |
| In recent versions of the Linux kernel, this mode directs |
| interrupts away from a designated set of CPUs. |
| However, this can sharply limit throughput for applications that |
| are I/O bound during even part of their operation. |
| \end{enumerate} |
| |
| Given these considerations, it is no surprise that non-blocking |
| synchronization is far more important in theory than it is in practice. |
| |
| \subsubsection{NBS Subdivided Operations} |
| \label{sec:advsync:NBS Subdivided Operations} |
| |
| One common trick that provides a given algorithm a loftier place on |
| the NBS ranking is to replace blocking operations with a polling API\@. |
| For example, instead of having a reliable dequeue operation that might be |
| merely lock-free or even blocking, instead provide a dequeue operation |
| that will spuriously fail in a wait-free manner rather than exhibiting |
| dreaded lock-free or blocking behaviors. |
| |
| This can work well in theory, but a common effect in practice is to |
| merely move the lock-free or blocking behavior out of that specific |
| algorithm and into the hapless code making use of that algorithm. |
| In such cases, not only has nothing has been gained by this trick, but |
| this trick has increased the complexity of all users of this algorithm. |
| |
| With concurrent algorithms as elsewhere, maximizing a specific metric |
| is no substitute for thinking carefully about the needs of one's users. |
| |
| \subsubsection{NBS Fail-Stop Tolerance} |
| \label{sec:advsync:NBS Fail-Stop Tolerance} |
| |
| Of the classes of NBS algorithms, wait-free synchronization (bounded or |
| otherwise), lock-free synchronization, obstruction-free synchronization, |
| and clash-free synchronization guarantee forward progress even in the |
| presence of fail-stop bugs. |
| An example fail-stop bug might cause some thread to be preempted indefinitely. |
| As we will see, this fail-stop-tolerant property can be useful, but the |
| fact is that composing a set of fail-stop-tolerant mechanisms does not |
| necessarily result in a fail-stop-tolerant system. |
| To see this, consider a system made up of a series of wait-free queues, |
| where an element is removed from one queue in the series, processed, |
| and then added to the next queue. |
| |
| If a thread is preempted in the midst of a queuing operation, in theory |
| all is well because the wait-free nature of the queue will guarantee |
| forward progress. |
| But in practice, the element being processed is lost because the |
| fail-stop-tolerant nature of the wait-free queues does not extend to |
| the code using those queues. |
| |
| Nevertheless, there are a few applications where NBS's rather limited |
| fail-stop-tolerance is useful. |
| For example, in some network-based or web applications, a fail-stop |
| event will eventually result in a retransmission, which will restart |
| any work that was lost due to the fail-stop event. |
| Systems running such applications can therefore be heavily loaded, even |
| to the point where the scheduler can no longer provide any reasonable |
| fairness guarantee. |
| In constrast, if a thread fail-stops while holding a lock, the application |
| might need to be restarted. |
| Nevertheless, NBS is not a panacea even within this restricted area, |
| due to the possibility of spurious retransmissions due to pure scheduling |
| delays. |
| In some cases, it may be more efficient to reduce the load to avoid |
| queueing delays, which will also improve the scheduler's ability to |
| provide fair access, reducing or even eliminating the fail-stop events, |
| thus reducing the number of retry operations, in turn further reducing |
| the load. |
| |
| \subsubsection{NBS Linearizability} |
| \label{sec:advsync:NBS Linearizability} |
| |
| It is important to note that linearizability can be quite useful, |
| especially when analyzing concurrent code made up of strict locking |
| and fully ordered atomic operations.\footnote{ |
| For example, the Linux kernel's value-returning atomic operations.} |
| Furthermore, this handling of fully ordered atomic operations |
| automatically covers simple NBS algorithms. |
| |
| However, the linearization points of a complex NBS algorithms are often |
| buried deep within that algorithm, and thus not visible to users of |
| a library function implementing a part of such an algorithm. |
| Therefore, any claims that users benefit from the linearizability properties |
| of complex NBS algorithms should be regarded with deep |
| suspicion~\cite{AndreasHaas2012FIFOisnt}. |
| |
| It is sometimes asserted that linearizability is necessary for developers |
| to produce proofs of correctness for their concurrent code. |
| However, such proofs are the exception rather than the rule, and modern |
| developers who do produce proofs often use modern proof techniques that |
| do not depend on linearizability. |
| Furthermore, developers frequently use modern proof techniques that do |
| not require a full specification, given that developers often learn |
| their specification after the fact, one bug at a time. |
| A few such proof techniques were discussed in |
| \cref{chp:Formal Verification}.\footnote{ |
| A memorable verbal discussion with an advocate of linearizability |
| resulted in question: |
| ``So the reason linearizability is important is to rescue 1980s |
| proof techniques?'' |
| The advocate immediately replied in the affirmative, then spent |
| some time disparaging a particular modern proof technique. |
| Oddly enough, that technique was one of those successfully |
| applied to Linux-kernel RCU\@.} |
| |
| It is often asserted that linearizability maps well to sequential |
| specifications, which are said to be more natural than are concurrent |
| specifications~\cite{SergioRajsbaum2020HistoryLinearizability}. |
| But this assertion fails to account for our highly concurrent objective |
| universe. |
| This universe can only be expected to select for ability to cope with |
| concurrency, especially for those participating in team sports or |
| overseeing small children. |
| In addition, given that the teaching of sequential |
| computing is still believed to be somewhat of a black |
| art~\cite{ElizabethPatitsas2020GradesNotBimodal}, it is reasonable |
| to expect that teaching of concurrent computing is in a similar state |
| of disarray. |
| Therefore, focusing on only one proof technique is unlikely to be a |
| good way forward. |
| |
| Again, please understand that linearizability is quite useful in many |
| situations. |
| Then again, so is that venerable tool, the hammer. |
| But there comes a point in the field of computing where one should put |
| down the hammer and pick up a keyboard. |
| Similarly, it appears that there are times when linearizability is not |
| the best tool for the job. |
| |
| To their credit, there are some linearizability advocates who are aware |
| of some of its shortcomings~\cite{SergioRajsbaum2020HistoryLinearizability}. |
| There are also proposals to extend linearizability, for example, |
| interval-linearizability, which is intended to handle the common case |
| of operations that require non-zero time to |
| complete~\cite{10.1145/3266457}. |
| It remains to be seen whether these proposals will result in theories |
| able to handle modern concurrent software artifacts, especially given |
| that several of the proof techniques discussed in \cref{chp:Formal |
| Verification} already handle many modern concurrent software artifacts. |
| |
| \subsection{NBS Discussion} |
| \label{sec:advsync:NBS Discussion} |
| |
| It is possible to create fully non-blocking queues~\cite{MichaelScott96}, |
| however, such queues are much more complex than the half-NBS algorithm |
| outlined above. |
| The lesson here is to carefully consider your actual requirements. |
| Relaxing irrelevant requirements can often result in great |
| improvements in simplicity, performance, and scalability. |
| |
| Recent research points to another important way to relax requirements. |
| It turns out that systems providing fair scheduling can enjoy most |
| of the benefits of wait-free synchronization even when running |
| algorithms that provide only non-blocking |
| synchronization, both in theory~\cite{DanAlitarh2013PracticalProgress} |
| and in practice~\cite{SamyAlBahra2013NBS}. |
| Because most schedulers used in production do in fact provide fairness, |
| the more-complex algorithms providing wait-free synchronization usually |
| provide no practical advantages over simpler and faster non-wait-free |
| algorithms. |
| |
| Interestingly enough, fair scheduling is but one beneficial |
| constraint that is often respected in practice. |
| Other sets of constraints can permit blocking algorithms to |
| achieve deterministic real-time response. |
| For example, given: |
| \begin{enumerate*}[(1)] |
| \item Fair locks granted in FIFO order within a given priority level, |
| \item Priority inversion avoidance (for example, priority |
| inheritance~\cite{Takada:1995:RSN:527074.828566,Cai-DongWang1996PrioInherLock} |
| or priority ceiling), |
| \item A bounded number of threads, |
| \item Bounded critical section durations, |
| \item Bounded load, |
| and |
| \item Absence of fail-stop bugs, |
| \end{enumerate*} |
| lock-based applications can provide deterministic |
| response times~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}. |
| This approach of course blurs the distinction between blocking and wait-free |
| synchronization, which is all to the good. |
| Hopefully theoretical frameworks will continue to improve their ability |
| to describe software actually used in practice. |
| |
| Those who feel that theory should lead the way are referred to the |
| inimitable Peter Denning, who said of operating systems: |
| ``Theory follows practice''~\cite{Denning:2015:POF:2830903.2830904}, |
| or to the eminent Tony Hoare, who said of the whole of engineering: |
| ``In all branches of engineering science, the engineering starts before |
| the science; indeed, without the early products of engineering, there |
| would be nothing for the scientist to |
| study!''~\cite{RichardMorris2007TonyHoareInterview}. |
| Of course, once an appropriate body of theory becomes available, it is |
| wise to make use of it. |
| However, note well that the first \emph{appropriate} body of theory |
| is often one thing and the first \emph{proposed} body of theory quite |
| another. |
| |
| \QuickQuiz{ |
| It seems like the various members of the NBS hierarchy are |
| rather useless. |
| So why bother with them at all??? |
| }\QuickQuizAnswer{ |
| One advantage of the members of the NBS hierarchy is that they |
| are reasonably simple to define and use from a theoretical viewpoint. |
| We can hope that work done in the NBS arena will help lay the |
| groundwork for analysis of real-world forward-progress guarantees |
| for concurrent real-time programs. |
| However, as of 2022 it appears that trace-based methodologies |
| are in the lead~\cite{DanielBristot2019RTtrace}. |
| |
| So why bother learning about NBS at all? |
| |
| Because a great many people know of it, and are vaguely aware that |
| it is somehow related to real-time computing. |
| Their response to your carefully designed real-time constraints |
| might well be of the form ``Bah, just use wait-free algorithms!''. |
| In the all-too-common case where they are very convincing to your |
| management, you will need to understand NBS in order to bring |
| the discussion back to reality. |
| I hope that this section has provided you with the required |
| depth of understanding. |
| |
| Another thing to note is that learning about the NBS hierarchy |
| is probably no more harmful than learning about transfinite |
| numbers of the computational-complexity hierarchy. |
| In all three cases, it is important to avoid over-applying |
| the theory. |
| Which is in and of itself good practice! |
| }\QuickQuizEnd |
| |
| Proponents of NBS algorithms sometimes call out real-time computing as |
| an important NBS beneficiary. |
| The next section looks more deeply at the forward-progress needs of |
| real-time systems. |
| |
| \input{advsync/rt} |
| |
| \QuickQuizAnswersChp{qqzadvsync} |