advsync/advsync.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % advsync/advsync.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{sec:advsync:Advanced Synchronization}{Advanced Synchronization}{qqzadvsync}
 %
 \Epigraph{If a little knowledge is a dangerous thing, just think what
 	  you could do with a lot of knowledge!}{Unknown}

 This chapter covers synchronization techniques used for lockless
 algorithms and parallel real-time systems.

 Although lockless algorithms can be quite helpful when faced with
 extreme requirements, they are no panacea.
 For example, as noted at the end of \cref{chp:Counting},
 you should thoroughly apply partitioning, batching, and
 well-tested packaged weak APIs
 (see \cref{chp:Data Ownership,chp:Deferred Processing})
 before even thinking about lockless algorithms.

 But after doing all that, you still might find yourself needing the
 advanced techniques described in this chapter.
 To that end,
 \cref{sec:advsync:Avoiding Locks}
 summarizes techniques used thus far for avoiding locks and
 \cref{sec:advsync:Non-Blocking Synchronization}
 gives a brief overview of non-blocking synchronization.
 Memory ordering is also quite important, but it warrants its own
 \lcnamecref{chp:Advanced Synchronization: Memory Ordering}, namely
 \cref{chp:Advanced Synchronization: Memory Ordering}.

 The second form of advanced synchronization provides the stronger
 forward-progress guarantees needed for parallel real-time computing,
 which is the topic of
 \cref{sec:advsync:Parallel Real-Time Computing}.

 \section{Avoiding Locks}
 \label{sec:advsync:Avoiding Locks}
 %
 \epigraph{We are confronted with insurmountable opportunities.}
 	 {Walt Kelly}

 Although locking is the workhorse of parallelism in production, in
 many situations performance, scalability, and real-time response can
 all be greatly improved through use of lockless techniques.
 A particularly impressive example of such a lockless technique is
 the statistical counters described in
 \cref{sec:count:Statistical Counters},
 which avoids not only locks, but also read-modify-write atomic operations,
 memory barriers, and even cache misses for counter increments.
 Other examples we have covered include:

 \begin{enumerate}
 \item	The fastpaths through a number of other counting algorithms
 	in \cref{chp:Counting}.
 \item	The fastpath through resource allocator caches in
 	\cref{sec:SMPdesign:Resource Allocator Caches}.
 \item	The maze solver in \cref{sec:SMPdesign:Beyond Partitioning}.
 \item	The data-ownership techniques in \cref{chp:Data Ownership}.
 \item	The reference-counting, hazard-pointer, and RCU techniques
 	in \cref{chp:Deferred Processing}.
 \item	The lookup code paths in \cref{chp:Data Structures}.
 \item	Many of the techniques in \cref{chp:Putting It All Together}.
 \end{enumerate}

 In short, lockless techniques are quite useful and are heavily used.
 However, it is best if lockless techniques are hidden behind a
 well-defined API, such as the \co{inc_count()}, \co{memblock_alloc()},
 \co{rcu_read_lock()}, and so on.
 The reason for this is that undisciplined use of lockless techniques
 is a good way to create difficult bugs.
 If you believe that finding and fixing such bugs is easier than avoiding
 them, please re-read
 \cref{chp:Validation,chp:Formal Verification}.

 \section{Non-Blocking Synchronization}
 \label{sec:advsync:Non-Blocking Synchronization}
 %
 \epigraph{Never worry about theory as long as the machinery does what
 	  it's supposed to do.}
 	 {Robert A. Heinlein}

 The term \IXBacrfst{nbs}~\cite{MauriceHerlihy90a}
 describes eight classes of \IX{linearizable} algorithms with differing
 \emph{\IXBpl{forward-progress guarantee}}~\cite{DanAlitarh2013PracticalProgress},
 which are as follows:

 \begin{enumerate}
 \item	\emph{\IXalth{Bounded population-oblivious wait-free}{bounded population-oblivious}{wait free} synchronization}:
 	Every thread will make progress within a specific finite period
 	of time, where this period of time is independent of the number
 	of threads~\cite{HerlihyShavit2008Textbook}.
 	This level is widely considered to be even less achievable than
 	bounded wait-free synchronization.
 \item	\emph{\IXalth{Bounded wait-free}{bounded}{wait free} synchronization}:
 	Every thread will make progress within
 	a specific finite period of time~\cite{Herlihy91}.
 	This level is widely considered to be unachievable, which might be why
 	Alitarh et al.\ omitted it~\cite{DanAlitarh2013PracticalProgress}.
 \item	\emph{\IXalt{Wait-free}{wait free} synchronization}:
 	Every thread will make progress
 	in finite time~\cite{Herlihy93}.
 \item	\emph{\IXalt{Lock-free}{lock free} synchronization}:
 	At least one thread will
 	make progress in finite time~\cite{Herlihy93}.
 \item	\emph{\IXalt{Obstruction-free}{obstruction free} synchronization}:
 	Every thread will make progress in finite time in the absence of
 	contention~\cite{HerlihyLM03}.
 \item	\emph{\IXalt{Clash-free}{clash free} synchronization}:
 	At least one thread will make progress in finite time in the absence of
 	contention~\cite{DanAlitarh2013PracticalProgress}.
 \item	\emph{\IXalt{Starvation-free}{starvation free} synchronization}:
 	Every thread will make progress in finite time in the absence of
 	failures~\cite{DanAlitarh2013PracticalProgress}.
 \item	\emph{\IXalt{Deadlock-free}{deadlock free} synchronization}:
 	At least one thread will make progress in finite time in the absence of
 	failures~\cite{DanAlitarh2013PracticalProgress}.
 \end{enumerate}

 NBS class~1 was formulated some time before 2015,
 classes~2, 3, and~4 were first formulated in the early 1990s,
 class~5 was first formulated in the early 2000s,
 and class~6 was first formulated in 2013.
 The final two classes have seen informal use for a great many decades,
 but were reformulated in 2013.

 \QuickQuiz{
 	Given that there will always be a sharply limited number of
 	CPUs available, is population obliviousness really useful?
 }\QuickQuizAnswer{
 	Given the surprisingly limited scalability of any number of
 	NBS algorithms, population obliviousness can be surprisingly
 	useful.
 	Nevertheless, the overall point of the question is valid.
 	It is not normally helpful for an algorithm to scale beyond the
 	size of the largest system it is ever going to run on.
 }\QuickQuizEnd

 In theory, any parallel algorithm can be cast into wait-free form,
 but there are a relatively small subset of NBS algorithms that are
 in common use.
 A few of these are listed in the following section.

 \subsection{Simple NBS}
 \label{sec:advsync:Simple NBS}

 Perhaps the simplest NBS algorithm is atomic update of an integer
 counter using fetch-and-add (\co{atomic_add_return()}) primitives.
 This section lists a few additional commonly used NBS algorithms in
 roughly increasing order of complexity.

 \subsubsection{NBS Sets}
 \label{sec:advsync:NBS Sets}

 One simple NBS algorithm implements a set of integers in an array.
 Here the array index indicates a value that might be a member of the set
 and the array element indicates whether or not that value actually is
 a set member.
 The linearizability criterion for NBS algorithms requires that reads from
 and updates to the array either use atomic instructions or be accompanied
 by memory barriers, but in the not-uncommon case where linearizability
 is not important, simple volatile loads and stores suffice, for example,
 using \co{READ_ONCE()} and \co{WRITE_ONCE()}.

 An NBS set may also be implemented using a bitmap, where each value that
 might be a member of the set corresponds to one bit.
 Reads and updates must normally be carried out via atomic bit-manipulation
 instructions, although compare-and-swap (\co{cmpxchg()} or CAS)
 instructions can also be used.

 \subsubsection{NBS Counters}
 \label{sec:advsync:NBS Counters}

 The statistical counters algorithm discussed in
 \cref{sec:count:Statistical Counters}
 can be considered to be bounded-wait-free, but only by using a cute
 definitional trick in which the sum is considered to be approximate
 rather than exact.\footnote{
 	Citation needed.
 	I heard of this trick verbally from Mark Moir.}
 Given sufficiently wide error bounds that are a function of the length
 of time that the \co{read_count()} function takes to sum the counters,
 it is not possible to prove that any non-linearizable behavior occurred.
 This definitely (if a bit artificially) classifies the statistical-counters
 algorithm as bounded wait-free.
 This algorithm is probably the most heavily used NBS algorithm in the
 Linux kernel.

 \subsubsection{Half-NBS Queue}
 \label{sec:advsync:Half-NBS Queue}

 \begin{fcvref}[ln:advsync:NBS Enqueue Algorithm]
 Another common NBS algorithm is the atomic queue where elements are
 enqueued using an atomic exchange instruction~\cite{MagedMichael1993JPDC},
 followed by a store into the \co{->next} pointer of the new element's
 predecessor, as shown in \cref{lst:advsync:NBS Enqueue Algorithm},
 which shows the userspace-RCU library
 implementation~\cite{MathieuDesnoyers2009URCU}.
 \Clnref{tail} updates the tail pointer to reference the new element while
 returning a reference to its predecessor, which is stored in
 local variable \co{old_tail}.
 \Clnref{pred} then updates the predecessor's \co{->next} pointer to
 reference the newly added element, and finally \clnref{ret}
 returns an indication as to whether or not the queue was initially
 empty.

 \begin{listing}
 \begin{fcvlabel}[ln:advsync:NBS Enqueue Algorithm]
 \begin{VerbatimL}[commandchars=\\\[\]]
 static inline bool
 ___cds_wfcq_append(struct cds_wfcq_head *head,
                    struct cds_wfcq_tail *tail,
                    struct cds_wfcq_node *new_head,
                    struct cds_wfcq_node *new_tail)
 {
 	struct cds_wfcq_node *old_tail;

 	old_tail = uatomic_xchg(&tail->p, new_tail);	\lnlbl[tail]
 	CMM_STORE_SHARED(old_tail->next, new_head);     \lnlbl[pred]
 	return old_tail != &head->node;			\lnlbl[ret]
 }

 static inline bool
 _cds_wfcq_enqueue(struct cds_wfcq_head *head,
                   struct cds_wfcq_tail *tail,
                   struct cds_wfcq_node *new_tail)
 {
 	return ___cds_wfcq_append(head, tail,
 	                          new_tail, new_tail);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{NBS Enqueue Algorithm}
 \label{lst:advsync:NBS Enqueue Algorithm}
 \end{listing}

 Although mutual exclusion is required to dequeue a single element
 (so that dequeue is blocking), it is possible to carry out a non-blocking
 removal of the entire contents of the queue.
 What is not possible is to dequeue any given element in a non-blocking
 manner:
 The enqueuer might have failed between \clnref{tail,pred} of the
 listing, so that the element in question is only partially enqueued.
 This results in a half-NBS algorithm where enqueues are NBS but
 dequeues are blocking.
 This algorithm is nevertheless heavily used in practice, in part because
 most production software is not required to tolerate arbitrary fail-stop
 errors.
 \end{fcvref}

 \QuickQuiz{
 	Wait!
 	In order to dequeue all elements, both the \co{->head} and
 	\co{->tail} pointers must be changed, which cannot be done
 	atomically on typical computer systems.
 	So how is this supposed to work???
 }\QuickQuizAnswer{
 	One pointer at a time!

 	First, atomically exchange the \co{->head} pointer with \co{NULL}.
 	If the return value from the atomic exchange operation is \co{NULL},
 	the queue was empty and you are done.
 	And if someone else attempts a dequeue-all at this point,
 	they will get back a \co{NULL} pointer.

 	Otherwise, atomically exchange the \co{->tail} pointer with a
 	pointer to the now-\co{NULL} \co{->head} pointer.
 	The return value from the atomic exchange operation is a pointer
 	to the \co{->next} field of the eventual last element on the list.

 	Producing and testing actual code is left as an exercise for the
 	interested and enthusiastic reader, as are strategies for handling
 	half-enqueued elements.
 }\QuickQuizEnd

 \subsubsection{NBS Stack}
 \label{sec:advsync:NBS Stack}

 \begin{fcvref}[ln:advsync:lifo_push:whole]
 \Cref{lst:advsync:NBS Stack Algorithm}
 shows the LIFO push algorithm, which boasts lock-free push and
 bounded wait-free pop (\path{lifo-push.c}), forming an NBS stack.
 The origins of this algorithm are unknown, but it was referred to in
 a patent granted in 1975~\cite{PaulJBrown1975LIFOpush}.
 This patent was filed in 1973, a few months before your editor
 saw his first computer, which had but one CPU\@.

 \begin{listing}
 \input{CodeSamples/advsync/lifo_push@whole.fcv}
 \caption{NBS Stack Algorithm}
 \label{lst:advsync:NBS Stack Algorithm}
 \end{listing}

 \Clnrefrange{struct:b}{struct:e} show the \co{node_t} structure,
 which contains an arbitrary value and a pointer to the next structure
 on the stack and
 \clnref{top} shows the top-of-stack pointer.

 The \co{list_push()} function spans \clnrefrange{push:b}{push:e}.
 \Clnref{push:alloc} allocates a new node and
 \clnref{push:initialize} initializes it.
 \Clnref{push:next} initializes the newly allocated node's \co{->next}
 pointer, and \clnref{push:cmpxchg} attempts to push it on the stack.
 If \clnref{push:check} detects \co{cmpxchg()} failure, another pass
 through the loop retries.
 Otherwise, the new node has been successfully pushed, and this function
 returns to its caller.
 Note that \clnref{push:check} resolves races in which two concurrent
 instances of \co{list_push()} attempt to push onto the stack.
 The \co{cmpxchg()} will succeed for one and fail for the other,
 causing the other to retry, thereby selecting an arbitrary order for
 the two node on the stack.

 The \co{list_pop_all()} function spans \clnrefrange{popall:b}{popall:e}.
 The \co{xchg()} statement on \clnref{popall:xchg} atomically removes
 all nodes on the stack, placing the head of the resulting list in local
 variable \co{p} and setting \co{top} to \co{NULL}.
 This atomic operation serializes concurrent calls to \co{list_pop_all()}:
 One of them will get the list, and the other a \co{NULL} pointer, at
 least assuming that there were no concurrent calls to \co{list_push()}.

 An instance of \co{list_pop_all()} that obtains a non-empty list in
 \co{p} processes this list in the loop spanning
 \clnrefrange{popall:loop:b}{popall:loop:e}.
 \Clnref{popall:next} prefetches the \co{->next} pointer,
 \clnref{popall:foo} invokes the function referenced by \co{foo()} on the
 current node,
 \clnref{popall:free} frees the current node, and
 \clnref{popall:pnext} sets up \co{p} for the next pass through the loop.

 But suppose that a pair of \co{list_push()} instances run concurrently
 with a \co{list_pop_all()} with a list initially containing a single
 \Node{A}.
 Here is one way that this scenario might play out:

 \begin{enumerate}
 \item	The first \co{list_push()} instance pushes a new \Node{B},
 	executing through \clnref{push:next}, having just stored
 	a pointer to \Node{A} into \Node{B}'s \co{->next} pointer.
 \item	The \co{list_pop_all()} instance runs to completion,
 	setting \co{top} to \co{NULL} and freeing \Node{A}.
 \item	The second \co{list_push()} instance runs to completion,
 	pushing a new \Node{C}, but happens to allocate the memory
 	that used to belong to \Node{A}.
 \item	The first \co{list_push()} instance executes the \co{cmpxchg()}
 	on \clnref{push:cmpxchg}.
 	Because new \Node{C} has the same address as the newly freed \Node{A},
 	this \co{cmpxchg()} succeeds and this \co{list_push()} instance
 	runs to completion.
 \end{enumerate}

 Note that both pushes and the popall all ran successfully despite the
 reuse of \Node{A}'s memory.
 This is an unusual property:
 Most data structures require protection against what is often called
 the ABA problem.

 But this property holds only for algorithm written in assembly
 language.
 The sad fact is that most languages (including C and C++) do not support
 pointers to lifetime-ended objects, such as the pointer to the old \Node{A}
 contained in \Node{B}'s \co{->next} pointer.
 In fact, compilers are within their rights to assume that if two pointers
 (call them \co{p} and \co{q}) were returned from two different calls to
 \co{malloc()}, then those pointers must not be equal.
 Real compilers really will generate the constant \co{false} in
 response to a \co{p==q} comparison.
 A pointer to an object that has been freed, but whose memory has been
 reallocated for a compatibly typed object is termed a \emph{zombie pointer}.

 Many concurrent applications avoid this problem by carefully hiding the
 memory allocator from the compiler, thus preventing the compiler from
 making inappropriate assumptions.
 This obfuscatory approach currently works in practice, but might well
 one day fall victim to increasingly aggressive optimizers.
 There is work underway in both the C and C++ standards committees
 to address this
 problem~\cite{PaulEMcKenney2019PointerLifetimeEndZap,PaulEMcKenney2020PointerLifetimeEndZapCpp}.
 In the meantime, please exercise great care when coding ABA-tolerant
 algorithms.
 \end{fcvref}

 \QuickQuiz{
 	So why not ditch antique languages like C and C++ for something
 	more modern?
 }\QuickQuizAnswer{
 	That won't help unless the more-modern languages proponents
 	are energetic enough to write their own compiler backends.
 	The usual practice of re-using existing backends also reuses
 	charming properties such as refusal to support pointers to
 	lifetime-ended objects.
 }\QuickQuizEnd

 \subsection{Applicability of NBS Benefits}
 \label{sec:advsync:Applicability of NBS Benefits}

 The most heavily cited NBS benefits stem from its forward-progress
 guarantees, its tolerance of fail-stop bugs, and from its linearizability.
 Each of these is discussed in one of the following sections.

 \subsubsection{NBS Forward Progress Guarantees}
 \label{sec:advsync:NBS Forward Progress Guarantees}

 NBS's forward-progress guarantees have caused many to suggest its use in
 real-time systems, and NBS algorithms are in fact used in a great many
 such systems.
 However, it is important to note that forward-progress guarantees are
 largely orthogonal to those that form the basis of real-time programming:

 \begin{enumerate}
 \item	Real-time forward-progress guarantees usually have some
 	definite time associated with them, for example,
 	``\IXh{scheduling}{latency} must be less than 100 microseconds.''
 	In contrast, the most popular forms of NBS only guarantees
 	that progress will be made in finite time, with no definite
 	bound.
 \item	Real-time forward-progress guarantees are often
 	probabilistic, as in the soft-real-time guarantee that
 	``at least 99.9\pct\ of the time, scheduling latency must
 	be less than 100 microseconds.''
 	In contrast, many of NBS's forward-progress guarantees are
 	unconditional, which can impose additional unnecessary complexity
 	and performance degradation.
 \item	Real-time forward-progress guarantees are often conditioned on
 	environmental constraints, for example, only being honored:
 	\begin{enumerate*}[(1)]
 	\item For the highest-priority tasks,
 	\item When each CPU spends at least a certain fraction of its time idle,
 	and
 	\item When I/O rates are below some specified maximum.
 	\end{enumerate*}
 	In contrast, NBS's forward-progress guarantees are often
 	unconditional, again incurring unnecessary complexity and
 	performance degradation.
 	On the other hand, some recent NBS work accommodates some
 	conditional
 	guarantees~\cite{DanAlitarh2013PracticalProgress}.
 \item	An important component of a real-time program's environment
 	is the scheduler.
 	NBS algorithms assume a worst-case \emph{demonic scheduler},
 	though for whatever reason, not a scheduler so demonic that
 	it simply refuses to ever run the application housing the NBS
 	algorithm.
 	In contrast, real-time systems assume that the scheduler is
 	doing its level best to satisfy any scheduling constraints
 	it knows about, and, in the absence of such constraints,
 	its level best to honor process priorities and to provide
 	fair scheduling to processes of the same priority.
 	Non-demonic schedulers allow real-time programs to use simpler
 	algorithms than those required for
 	NBS~\cite{DanAlitarh2013PracticalProgress,BjoernBrandenburgPhD}.
 \item	The progress guarantees provided by non-blocking synchronization
 	are most useful in overload conditions.
 	However, overload conditions are most likely to cause queueing
 	delays that, in many real-world systems, will cause the program
 	(or even the system running that program) to be terminated
 	with prejudice.
 	As a rough rule of thumb, latency increases as the reciprocal
 	of the idle time.
 	In other words, if a nearly idle system has
 	latency $T$, then a 50\pct\ idle system will have latency $2 T$, a
 	10\pct\ idle (90\pct\ utilized) system will have latency $10 T$, a 1\pct\
 	idle system (99\pct\ utilized) will have latency $100 T$, and so on.
 	This situation means that many latency-sensitive systems will
 	actively limit load, for example, to 50\pct.
 \item	In the not-uncommon case where a given computed result is nice
 	to have rather than critically important, use of timeouts can
 	cause a blocking operation to have non-blocking properties that
 	are often sufficient in practice.
 \item	NBS forward-progress guarantee classes assume that a number of
 	underlying operations (such as system calls) are lock-free or
 	even wait-free, when in fact these operations are blocking on
 	common-case computer systems.
 \item	NBS forward-progress guarantees are often achieved by subdividing
 	operations.
 	For example, in order to avoid a blocking dequeue operation,
 	an NBS algorithm might substitute a non-blocking polling
 	operation.
 	This is fine in theory, but not helpful in practice to real-world
 	programs that require an element to propagate through the queue
 	in a timely fashion.
 \item	Real-time forward-progress guarantees usually apply only
 	in the absence of software bugs.
 	In contrast, many classes of NBS guarantees apply even in the
 	face of fail-stop bugs, though obviously not in the face of
 	software bugs in the general arbitrarily misbehaving case.
 \item	NBS forward-progress guarantee classes imply linearizability.
 	In contrast, real-time forward progress guarantees usually do
 	not require ordering constraints such as linearizability.
 	Once again, providing linearizability when it is not needed
 	can unnecessarily increase complexity and decrease performance.
 \item	Although the simpler NBS algorithms combine exemplary simplicity,
 	forward-progress, guarantees, and performance, other
 	NBS algorithms are quite complex.
 	This added complexity can result in increased development
 	and maintenance costs, decreased performance, and decreased
 	reliability.
 \end{enumerate}

 \QuickQuiz{
 	Why does anyone care about demonic schedulers?
 }\QuickQuizAnswer{
 	A demonic scheduler is one way to model an insanely overloaded
 	system.
 	After all, if you have an algorithm that you can prove runs
 	reasonably given a demonic scheduler, mere overload should
 	be no problem, right?

 	On the other hand, it is only reasonable to ask if a demonic
 	scheduler is really the best way to model overload conditions.
 	And perhaps it is time for more accurate models.
 	For one thing, a system might be overloaded in any of a number
 	of ways.
 	After all, an NBS algorithm that works fine on a demonic scheduler
 	might or might not do well in out-of-memory conditions, when
 	mass storage fills, or when the network is congested.

 	Except that systems' core counts have been increasing, which
 	means that an overloaded system is quite likely to be running
 	more than one concurrent program.\footnote{
 		As a point of reference, back in the mid-1990s, Paul
 		witnessed a 16-CPU system running about 20 instances
 		of a certain high-end proprietary database.}
 	In that case, even if a demonic scheduler is not so demonic
 	as to inject idle cycles while there are runnable tasks,
 	it is easy to imagine such a scheduler consistently favoring
 	the other program over yours.
 	If both programs could consume all available CPU, then this
 	scheduler might not run your program at all.

 	One way to avoid these issues is to simply avoid overload
 	conditions.
 	This is often the preferred approach in production, where load
 	balancers direct traffic away from overloaded systems.
 	And if all systems are overloaded, it is not unheard of to simply
 	\emph{shed load}, that is, to drop the low-priority incoming requests.
 	Nor is this approach limited to computing, as those who have
 	suffered through a rolling blackout can attest.
 	But load-shedding is often considered a bad thing by those
 	whose load is being shed.

 	As always, choose wisely!
 }\QuickQuizEnd

 To reiterate, despite these differences, workarounds, and objections,
 a number of NBS algorithms are extremely useful and are heavily used
 in practice.
 As always, use the right tool for the job!

 \subsubsection{NBS Underlying Operations}
 \label{sec:advsync:NBS Underlying Operations}

 An NBS algorithm can be truly non-blocking only if the underlying
 operations that it uses are also non-blocking.
 In a surprising number of cases, this is not the case in practice.

 For example, non-blocking algorithms often allocate memory.
 In theory, this is fine, given the existence of lock-free
 memory allocators~\cite{MagedMichael04NBSmalloc}.
 But in practice, most environments must eventually obtain memory from
 operating-system kernels, which commonly use locking.
 Therefore, unless all the memory that will ever be needed is somehow
 preallocated, a ``non-blocking'' algorithm that allocates memory will not
 be non-blocking when running on common-case real-world computer systems.

 This same point clearly also applies to algorithms performing I/O
 operations or otherwise interacting with their environment.

 Perhaps surprisingly, this point also applies to ostensibly non-blocking
 algorithms that do only plain loads and stores, such as the counters
 discussed in \cref{sec:advsync:NBS Counters}.
 And at first glance, those loads and stores that can be compiled into
 single load and store instructions, respectively, would seem to be
 not just non-blocking, but bounded population-oblivious wait free.

 Except that load and store instructions are not necessarily either
 fast or deterministic.
 For example, as noted in \cref{chp:Hardware and its Habits}, cache misses
 can consume thousands of CPU cycles.
 Worse yet, the measured cache-miss latencies can be a function of the
 number of CPUs, as illustrated in
 \cref{fig:count:Atomic Increment Scalability on x86}.
 It is only reasonable to assume that these latencies also depend on
 the details of the system's interconnect.
 In addition, given that hardware vendors generally do not publish
 upper bounds for cache-miss latencies, it seems brave to assume that
 memory-reference instructions are in fact wait-free in modern computer
 systems.
 And the antique systems for which such bounds are available suffer from
 profound overall slowness.

 Furthermore, hardware is not the only source of slowness for
 memory-reference instructions.
 For example, when running on typical computer systems, both loads and
 stores can result in page faults.
 Which cause in-kernel page-fault handlers to be invoked.
 Which might acquire locks, or even do I/O, potentially even using
 something like network file system (NFS)\@.
 All of which are most emphatically blocking operations.

 Nor are page faults the only kernel-induced hazard.
 A given CPU might be interrupted at any time, and the interrupt
 handler might run for some time.
 During this time, the user-mode ostensibly non-blocking algorithm
 will not be running at all.
 This situation raises interesting questions about the forward-progress
 guarantees provided by system calls relying on interrupts, for example,
 the \co{membarrier()} system call.

 Things do look bleak, but the non-blocking nature of such algorithms
 can be at least partially redeemed using a number of approaches:

 \begin{enumerate}
 \item	Run on bare metal, with paging disabled.
 	If you are both brave and confident that you can write code that
 	is free of wild-pointer bugs, this approach might be for you.
 \item	Run on a non-blocking operating-system kernel~\cite{Cheriton96a}.
 	Such kernels are quite rare, in part because they have
 	traditionally completely failed to provide the hoped-for
 	performance and scalability advantages over lock-based kernels.
 	But perhaps you should write one.
 \item	Use facilities such as \co{mlockall()} to avoid page faults,
 	while also ensuring that your program preallocates all the
 	memory it will ever need at boot time.
 	This can work well, but at the expense of severe common-case
 	underutilization of memory.
 	In environments that are cost-constrained or power-limited,
 	this approach is not likely to be feasible.
 \item	Use facilities such as the Linux kernel's
 	\co{NO_HZ_FULL} tickless mode~\cite{JonCorbet2013NO-HZ-FULL}.
 	In recent versions of the Linux kernel, this mode directs
 	interrupts away from a designated set of CPUs.
 	However, this can sharply limit throughput for applications that
 	are I/O bound during even part of their operation.
 \end{enumerate}

 Given these considerations, it is no surprise that non-blocking
 synchronization is far more important in theory than it is in practice.

 \subsubsection{NBS Subdivided Operations}
 \label{sec:advsync:NBS Subdivided Operations}

 One common trick that provides a given algorithm a loftier place on
 the NBS ranking is to replace blocking operations with a polling API\@.
 For example, instead of having a reliable dequeue operation that might be
 merely lock-free or even blocking, instead provide a dequeue operation
 that will spuriously fail in a wait-free manner rather than exhibiting
 dreaded lock-free or blocking behaviors.

 This can work well in theory, but a common effect in practice is to
 merely move the lock-free or blocking behavior out of that specific
 algorithm and into the hapless code making use of that algorithm.
 In such cases, not only has nothing has been gained by this trick, but
 this trick has increased the complexity of all users of this algorithm.

 With concurrent algorithms as elsewhere, maximizing a specific metric
 is no substitute for thinking carefully about the needs of one's users.

 \subsubsection{NBS Fail-Stop Tolerance}
 \label{sec:advsync:NBS Fail-Stop Tolerance}

 Of the classes of NBS algorithms, wait-free synchronization (bounded or
 otherwise), lock-free synchronization, obstruction-free synchronization,
 and clash-free synchronization guarantee forward progress even in the
 presence of fail-stop bugs.
 An example fail-stop bug might cause some thread to be preempted indefinitely.
 As we will see, this fail-stop-tolerant property can be useful, but the
 fact is that composing a set of fail-stop-tolerant mechanisms does not
 necessarily result in a fail-stop-tolerant system.
 To see this, consider a system made up of a series of wait-free queues,
 where an element is removed from one queue in the series, processed,
 and then added to the next queue.

 If a thread is preempted in the midst of a queuing operation, in theory
 all is well because the wait-free nature of the queue will guarantee
 forward progress.
 But in practice, the element being processed is lost because the
 fail-stop-tolerant nature of the wait-free queues does not extend to
 the code using those queues.

 Nevertheless, there are a few applications where NBS's rather limited
 fail-stop-tolerance is useful.
 For example, in some network-based or web applications, a fail-stop
 event will eventually result in a retransmission, which will restart
 any work that was lost due to the fail-stop event.
 Systems running such applications can therefore be heavily loaded, even
 to the point where the scheduler can no longer provide any reasonable
 fairness guarantee.
 In constrast, if a thread fail-stops while holding a lock, the application
 might need to be restarted.
 Nevertheless, NBS is not a panacea even within this restricted area,
 due to the possibility of spurious retransmissions due to pure scheduling
 delays.
 In some cases, it may be more efficient to reduce the load to avoid
 queueing delays, which will also improve the scheduler's ability to
 provide fair access, reducing or even eliminating the fail-stop events,
 thus reducing the number of retry operations, in turn further reducing
 the load.

 \subsubsection{NBS Linearizability}
 \label{sec:advsync:NBS Linearizability}

 It is important to note that linearizability can be quite useful,
 especially when analyzing concurrent code made up of strict locking
 and fully ordered atomic operations.\footnote{
 	For example, the Linux kernel's value-returning atomic operations.}
 Furthermore, this handling of fully ordered atomic operations
 automatically covers simple NBS algorithms.

 However, the linearization points of a complex NBS algorithms are often
 buried deep within that algorithm, and thus not visible to users of
 a library function implementing a part of such an algorithm.
 Therefore, any claims that users benefit from the linearizability properties
 of complex NBS algorithms should be regarded with deep
 suspicion~\cite{AndreasHaas2012FIFOisnt}.

 It is sometimes asserted that linearizability is necessary for developers
 to produce proofs of correctness for their concurrent code.
 However, such proofs are the exception rather than the rule, and modern
 developers who do produce proofs often use modern proof techniques that
 do not depend on linearizability.
 Furthermore, developers frequently use modern proof techniques that do
 not require a full specification, given that developers often learn
 their specification after the fact, one bug at a time.
 A few such proof techniques were discussed in
 \cref{chp:Formal Verification}.\footnote{
 	A memorable verbal discussion with an advocate of linearizability
 	resulted in question:
 	``So the reason linearizability is important is to rescue 1980s
 	proof techniques?''
 	The advocate immediately replied in the affirmative, then spent
 	some time disparaging a particular modern proof technique.
 	Oddly enough, that technique was one of those successfully
 	applied to Linux-kernel RCU\@.}

 It is often asserted that linearizability maps well to sequential
 specifications, which are said to be more natural than are concurrent
 specifications~\cite{SergioRajsbaum2020HistoryLinearizability}.
 But this assertion fails to account for our highly concurrent objective
 universe.
 This universe can only be expected to select for ability to cope with
 concurrency, especially for those participating in team sports or
 overseeing small children.
 In addition, given that the teaching of sequential
 computing is still believed to be somewhat of a black
 art~\cite{ElizabethPatitsas2020GradesNotBimodal}, it is reasonable
 to expect that teaching of concurrent computing is in a similar state
 of disarray.
 Therefore, focusing on only one proof technique is unlikely to be a
 good way forward.

 Again, please understand that linearizability is quite useful in many
 situations.
 Then again, so is that venerable tool, the hammer.
 But there comes a point in the field of computing where one should put
 down the hammer and pick up a keyboard.
 Similarly, it appears that there are times when linearizability is not
 the best tool for the job.

 To their credit, there are some linearizability advocates who are aware
 of some of its shortcomings~\cite{SergioRajsbaum2020HistoryLinearizability}.
 There are also proposals to extend linearizability, for example,
 interval-linearizability, which is intended to handle the common case
 of operations that require non-zero time to
 complete~\cite{10.1145/3266457}.
 It remains to be seen whether these proposals will result in theories
 able to handle modern concurrent software artifacts, especially given
 that several of the proof techniques discussed in \cref{chp:Formal
 Verification} already handle many modern concurrent software artifacts.

 \subsection{NBS Discussion}
 \label{sec:advsync:NBS Discussion}

 It is possible to create fully non-blocking queues~\cite{MichaelScott96},
 however, such queues are much more complex than the half-NBS algorithm
 outlined above.
 The lesson here is to carefully consider your actual requirements.
 Relaxing irrelevant requirements can often result in great
 improvements in simplicity, performance, and scalability.

 Recent research points to another important way to relax requirements.
 It turns out that systems providing fair scheduling can enjoy most
 of the benefits of wait-free synchronization even when running
 algorithms that provide only non-blocking
 synchronization, both in theory~\cite{DanAlitarh2013PracticalProgress}
 and in practice~\cite{SamyAlBahra2013NBS}.
 Because most schedulers used in production do in fact provide fairness,
 the more-complex algorithms providing wait-free synchronization usually
 provide no practical advantages over simpler and faster non-wait-free
 algorithms.

 Interestingly enough, fair scheduling is but one beneficial
 constraint that is often respected in practice.
 Other sets of constraints can permit blocking algorithms to
 achieve deterministic real-time response.
 For example, given:
 \begin{enumerate*}[(1)]
 \item Fair locks granted in FIFO order within a given priority level,
 \item Priority inversion avoidance (for example, priority
 inheritance~\cite{Takada:1995:RSN:527074.828566,Cai-DongWang1996PrioInherLock}
 or priority ceiling),
 \item A bounded number of threads,
 \item Bounded critical section durations,
 \item Bounded load,
 and
 \item Absence of fail-stop bugs,
 \end{enumerate*}
 lock-based applications can provide deterministic
 response times~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.
 This approach of course blurs the distinction between blocking and wait-free
 synchronization, which is all to the good.
 Hopefully theoretical frameworks will continue to improve their ability
 to describe software actually used in practice.

 Those who feel that theory should lead the way are referred to the
 inimitable Peter Denning, who said of operating systems:
 ``Theory follows practice''~\cite{Denning:2015:POF:2830903.2830904},
 or to the eminent Tony Hoare, who said of the whole of engineering:
 ``In all branches of engineering science, the engineering starts before
 the science; indeed, without the early products of engineering, there
 would be nothing for the scientist to
 study!''~\cite{RichardMorris2007TonyHoareInterview}.
 Of course, once an appropriate body of theory becomes available, it is
 wise to make use of it.
 However, note well that the first \emph{appropriate} body of theory
 is often one thing and the first \emph{proposed} body of theory quite
 another.

 \QuickQuiz{
 	It seems like the various members of the NBS hierarchy are
 	rather useless.
 	So why bother with them at all???
 }\QuickQuizAnswer{
 	One advantage of the members of the NBS hierarchy is that they
 	are reasonably simple to define and use from a theoretical viewpoint.
 	We can hope that work done in the NBS arena will help lay the
 	groundwork for analysis of real-world forward-progress guarantees
 	for concurrent real-time programs.
 	However, as of 2022 it appears that trace-based methodologies
 	are in the lead~\cite{DanielBristot2019RTtrace}.

 	So why bother learning about NBS at all?

 	Because a great many people know of it, and are vaguely aware that
 	it is somehow related to real-time computing.
 	Their response to your carefully designed real-time constraints
 	might well be of the form ``Bah, just use wait-free algorithms!''.
 	In the all-too-common case where they are very convincing to your
 	management, you will need to understand NBS in order to bring
 	the discussion back to reality.
 	I hope that this section has provided you with the required
 	depth of understanding.

 	Another thing to note is that learning about the NBS hierarchy
 	is probably no more harmful than learning about transfinite
 	numbers of the computational-complexity hierarchy.
 	In all three cases, it is important to avoid over-applying
 	the theory.
 	Which is in and of itself good practice!
 }\QuickQuizEnd

 Proponents of NBS algorithms sometimes call out real-time computing as
 an important NBS beneficiary.
 The next section looks more deeply at the forward-progress needs of
 real-time systems.

 \input{advsync/rt}

 \QuickQuizAnswersChp{qqzadvsync}