blob: df0d44f506dd15fd67d9ae353600016541e24f36 [file] [log] [blame]
% advsync/advsync.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{sec:advsync:Advanced Synchronization}{Advanced Synchronization}{qqzadvsync}
%
\Epigraph{If a little knowledge is a dangerous thing, just think what
you could do with a lot of knowledge!}{Unknown}
This chapter covers synchronization techniques used for lockless
algorithms and parallel real-time systems.
Although lockless algorithms can be quite helpful when faced with
extreme requirements, they are no panacea.
For example, as noted at the end of \cref{chp:Counting},
you should thoroughly apply partitioning, batching, and
well-tested packaged weak APIs
(see \cref{chp:Data Ownership,chp:Deferred Processing})
before even thinking about lockless algorithms.
But after doing all that, you still might find yourself needing the
advanced techniques described in this chapter.
To that end,
\cref{sec:advsync:Avoiding Locks}
summarizes techniques used thus far for avoiding locks and
\cref{sec:advsync:Non-Blocking Synchronization}
gives a brief overview of non-blocking synchronization.
Memory ordering is also quite important, but it warrants its own
\lcnamecref{chp:Advanced Synchronization: Memory Ordering}, namely
\cref{chp:Advanced Synchronization: Memory Ordering}.
The second form of advanced synchronization provides the stronger
forward-progress guarantees needed for parallel real-time computing,
which is the topic of
\cref{sec:advsync:Parallel Real-Time Computing}.
\section{Avoiding Locks}
\label{sec:advsync:Avoiding Locks}
%
\epigraph{We are confronted with insurmountable opportunities.}
{Walt Kelly}
Although locking is the workhorse of parallelism in production, in
many situations performance, scalability, and real-time response can
all be greatly improved through use of lockless techniques.
A particularly impressive example of such a lockless technique is
the statistical counters described in
\cref{sec:count:Statistical Counters},
which avoids not only locks, but also read-modify-write atomic operations,
memory barriers, and even cache misses for counter increments.
Other examples we have covered include:
\begin{enumerate}
\item The fastpaths through a number of other counting algorithms
in \cref{chp:Counting}.
\item The fastpath through resource allocator caches in
\cref{sec:SMPdesign:Resource Allocator Caches}.
\item The maze solver in \cref{sec:SMPdesign:Beyond Partitioning}.
\item The data-ownership techniques in \cref{chp:Data Ownership}.
\item The reference-counting, hazard-pointer, and RCU techniques
in \cref{chp:Deferred Processing}.
\item The lookup code paths in \cref{chp:Data Structures}.
\item Many of the techniques in \cref{chp:Putting It All Together}.
\end{enumerate}
In short, lockless techniques are quite useful and are heavily used.
However, it is best if lockless techniques are hidden behind a
well-defined API, such as the \co{inc_count()}, \co{memblock_alloc()},
\co{rcu_read_lock()}, and so on.
The reason for this is that undisciplined use of lockless techniques
is a good way to create difficult bugs.
If you believe that finding and fixing such bugs is easier than avoiding
them, please re-read
\cref{chp:Validation,chp:Formal Verification}.
\section{Non-Blocking Synchronization}
\label{sec:advsync:Non-Blocking Synchronization}
%
\epigraph{Never worry about theory as long as the machinery does what
it's supposed to do.}
{Robert A. Heinlein}
The term \IXBacrfst{nbs}~\cite{MauriceHerlihy90a}
describes eight classes of \IX{linearizable} algorithms with differing
\emph{\IXBpl{forward-progress guarantee}}~\cite{DanAlitarh2013PracticalProgress},
which are as follows:
\begin{enumerate}
\item \emph{\IXalth{Bounded population-oblivious wait-free}{bounded population-oblivious}{wait free} synchronization}:
Every thread will make progress within a specific finite period
of time, where this period of time is independent of the number
of threads~\cite{HerlihyShavit2008Textbook}.
This level is widely considered to be even less achievable than
bounded wait-free synchronization.
\item \emph{\IXalth{Bounded wait-free}{bounded}{wait free} synchronization}:
Every thread will make progress within
a specific finite period of time~\cite{Herlihy91}.
This level is widely considered to be unachievable, which might be why
Alitarh et al.\ omitted it~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{\IXalt{Wait-free}{wait free} synchronization}:
Every thread will make progress
in finite time~\cite{Herlihy93}.
\item \emph{\IXalt{Lock-free}{lock free} synchronization}:
At least one thread will
make progress in finite time~\cite{Herlihy93}.
\item \emph{\IXalt{Obstruction-free}{obstruction free} synchronization}:
Every thread will make progress in finite time in the absence of
contention~\cite{HerlihyLM03}.
\item \emph{\IXalt{Clash-free}{clash free} synchronization}:
At least one thread will make progress in finite time in the absence of
contention~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{\IXalt{Starvation-free}{starvation free} synchronization}:
Every thread will make progress in finite time in the absence of
failures~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{\IXalt{Deadlock-free}{deadlock free} synchronization}:
At least one thread will make progress in finite time in the absence of
failures~\cite{DanAlitarh2013PracticalProgress}.
\end{enumerate}
NBS class~1 was formulated some time before 2015,
classes~2, 3, and~4 were first formulated in the early 1990s,
class~5 was first formulated in the early 2000s,
and class~6 was first formulated in 2013.
The final two classes have seen informal use for a great many decades,
but were reformulated in 2013.
\QuickQuiz{
Given that there will always be a sharply limited number of
CPUs available, is population obliviousness really useful?
}\QuickQuizAnswer{
Given the surprisingly limited scalability of any number of
NBS algorithms, population obliviousness can be surprisingly
useful.
Nevertheless, the overall point of the question is valid.
It is not normally helpful for an algorithm to scale beyond the
size of the largest system it is ever going to run on.
}\QuickQuizEnd
In theory, any parallel algorithm can be cast into wait-free form,
but there are a relatively small subset of NBS algorithms that are
in common use.
A few of these are listed in the following section.
\subsection{Simple NBS}
\label{sec:advsync:Simple NBS}
Perhaps the simplest NBS algorithm is atomic update of an integer
counter using fetch-and-add (\co{atomic_add_return()}) primitives.
This section lists a few additional commonly used NBS algorithms in
roughly increasing order of complexity.
\subsubsection{NBS Sets}
\label{sec:advsync:NBS Sets}
One simple NBS algorithm implements a set of integers in an array.
Here the array index indicates a value that might be a member of the set
and the array element indicates whether or not that value actually is
a set member.
The linearizability criterion for NBS algorithms requires that reads from
and updates to the array either use atomic instructions or be accompanied
by memory barriers, but in the not-uncommon case where linearizability
is not important, simple volatile loads and stores suffice, for example,
using \co{READ_ONCE()} and \co{WRITE_ONCE()}.
An NBS set may also be implemented using a bitmap, where each value that
might be a member of the set corresponds to one bit.
Reads and updates must normally be carried out via atomic bit-manipulation
instructions, although compare-and-swap (\co{cmpxchg()} or CAS)
instructions can also be used.
\subsubsection{NBS Counters}
\label{sec:advsync:NBS Counters}
The statistical counters algorithm discussed in
\cref{sec:count:Statistical Counters}
can be considered to be bounded-wait-free, but only by using a cute
definitional trick in which the sum is considered to be approximate
rather than exact.\footnote{
Citation needed.
I heard of this trick verbally from Mark Moir.}
Given sufficiently wide error bounds that are a function of the length
of time that the \co{read_count()} function takes to sum the counters,
it is not possible to prove that any non-linearizable behavior occurred.
This definitely (if a bit artificially) classifies the statistical-counters
algorithm as bounded wait-free.
This algorithm is probably the most heavily used NBS algorithm in the
Linux kernel.
\subsubsection{Half-NBS Queue}
\label{sec:advsync:Half-NBS Queue}
\begin{fcvref}[ln:advsync:NBS Enqueue Algorithm]
Another common NBS algorithm is the atomic queue where elements are
enqueued using an atomic exchange instruction~\cite{MagedMichael1993JPDC},
followed by a store into the \co{->next} pointer of the new element's
predecessor, as shown in \cref{lst:advsync:NBS Enqueue Algorithm},
which shows the userspace-RCU library
implementation~\cite{MathieuDesnoyers2009URCU}.
\Clnref{tail} updates the tail pointer to reference the new element while
returning a reference to its predecessor, which is stored in
local variable \co{old_tail}.
\Clnref{pred} then updates the predecessor's \co{->next} pointer to
reference the newly added element, and finally \clnref{ret}
returns an indication as to whether or not the queue was initially
empty.
\begin{listing}
\begin{fcvlabel}[ln:advsync:NBS Enqueue Algorithm]
\begin{VerbatimL}[commandchars=\\\[\]]
static inline bool
___cds_wfcq_append(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail,
struct cds_wfcq_node *new_head,
struct cds_wfcq_node *new_tail)
{
struct cds_wfcq_node *old_tail;
old_tail = uatomic_xchg(&tail->p, new_tail); \lnlbl[tail]
CMM_STORE_SHARED(old_tail->next, new_head); \lnlbl[pred]
return old_tail != &head->node; \lnlbl[ret]
}
static inline bool
_cds_wfcq_enqueue(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail,
struct cds_wfcq_node *new_tail)
{
return ___cds_wfcq_append(head, tail,
new_tail, new_tail);
}
\end{VerbatimL}
\end{fcvlabel}
\caption{NBS Enqueue Algorithm}
\label{lst:advsync:NBS Enqueue Algorithm}
\end{listing}
Although mutual exclusion is required to dequeue a single element
(so that dequeue is blocking), it is possible to carry out a non-blocking
removal of the entire contents of the queue.
What is not possible is to dequeue any given element in a non-blocking
manner:
The enqueuer might have failed between \clnref{tail,pred} of the
listing, so that the element in question is only partially enqueued.
This results in a half-NBS algorithm where enqueues are NBS but
dequeues are blocking.
This algorithm is nevertheless heavily used in practice, in part because
most production software is not required to tolerate arbitrary fail-stop
errors.
\end{fcvref}
\QuickQuiz{
Wait!
In order to dequeue all elements, both the \co{->head} and
\co{->tail} pointers must be changed, which cannot be done
atomically on typical computer systems.
So how is this supposed to work???
}\QuickQuizAnswer{
One pointer at a time!
First, atomically exchange the \co{->head} pointer with \co{NULL}.
If the return value from the atomic exchange operation is \co{NULL},
the queue was empty and you are done.
And if someone else attempts a dequeue-all at this point,
they will get back a \co{NULL} pointer.
Otherwise, atomically exchange the \co{->tail} pointer with a
pointer to the now-\co{NULL} \co{->head} pointer.
The return value from the atomic exchange operation is a pointer
to the \co{->next} field of the eventual last element on the list.
Producing and testing actual code is left as an exercise for the
interested and enthusiastic reader, as are strategies for handling
half-enqueued elements.
}\QuickQuizEnd
\subsubsection{NBS Stack}
\label{sec:advsync:NBS Stack}
\begin{fcvref}[ln:advsync:lifo_push:whole]
\Cref{lst:advsync:NBS Stack Algorithm}
shows the LIFO push algorithm, which boasts lock-free push and
bounded wait-free pop (\path{lifo-push.c}), forming an NBS stack.
The origins of this algorithm are unknown, but it was referred to in
a patent granted in 1975~\cite{PaulJBrown1975LIFOpush}.
This patent was filed in 1973, a few months before your editor
saw his first computer, which had but one CPU\@.
\begin{listing}
\input{CodeSamples/advsync/lifo_push@whole.fcv}
\caption{NBS Stack Algorithm}
\label{lst:advsync:NBS Stack Algorithm}
\end{listing}
\Clnrefrange{struct:b}{struct:e} show the \co{node_t} structure,
which contains an arbitrary value and a pointer to the next structure
on the stack and
\clnref{top} shows the top-of-stack pointer.
The \co{list_push()} function spans \clnrefrange{push:b}{push:e}.
\Clnref{push:alloc} allocates a new node and
\clnref{push:initialize} initializes it.
\Clnref{push:next} initializes the newly allocated node's \co{->next}
pointer, and \clnref{push:cmpxchg} attempts to push it on the stack.
If \clnref{push:check} detects \co{cmpxchg()} failure, another pass
through the loop retries.
Otherwise, the new node has been successfully pushed, and this function
returns to its caller.
Note that \clnref{push:check} resolves races in which two concurrent
instances of \co{list_push()} attempt to push onto the stack.
The \co{cmpxchg()} will succeed for one and fail for the other,
causing the other to retry, thereby selecting an arbitrary order for
the two node on the stack.
The \co{list_pop_all()} function spans \clnrefrange{popall:b}{popall:e}.
The \co{xchg()} statement on \clnref{popall:xchg} atomically removes
all nodes on the stack, placing the head of the resulting list in local
variable \co{p} and setting \co{top} to \co{NULL}.
This atomic operation serializes concurrent calls to \co{list_pop_all()}:
One of them will get the list, and the other a \co{NULL} pointer, at
least assuming that there were no concurrent calls to \co{list_push()}.
An instance of \co{list_pop_all()} that obtains a non-empty list in
\co{p} processes this list in the loop spanning
\clnrefrange{popall:loop:b}{popall:loop:e}.
\Clnref{popall:next} prefetches the \co{->next} pointer,
\clnref{popall:foo} invokes the function referenced by \co{foo()} on the
current node,
\clnref{popall:free} frees the current node, and
\clnref{popall:pnext} sets up \co{p} for the next pass through the loop.
But suppose that a pair of \co{list_push()} instances run concurrently
with a \co{list_pop_all()} with a list initially containing a single
\Node{A}.
Here is one way that this scenario might play out:
\begin{enumerate}
\item The first \co{list_push()} instance pushes a new \Node{B},
executing through \clnref{push:next}, having just stored
a pointer to \Node{A} into \Node{B}'s \co{->next} pointer.
\item The \co{list_pop_all()} instance runs to completion,
setting \co{top} to \co{NULL} and freeing \Node{A}.
\item The second \co{list_push()} instance runs to completion,
pushing a new \Node{C}, but happens to allocate the memory
that used to belong to \Node{A}.
\item The first \co{list_push()} instance executes the \co{cmpxchg()}
on \clnref{push:cmpxchg}.
Because new \Node{C} has the same address as the newly freed \Node{A},
this \co{cmpxchg()} succeeds and this \co{list_push()} instance
runs to completion.
\end{enumerate}
Note that both pushes and the popall all ran successfully despite the
reuse of \Node{A}'s memory.
This is an unusual property:
Most data structures require protection against what is often called
the ABA problem.
But this property holds only for algorithm written in assembly
language.
The sad fact is that most languages (including C and C++) do not support
pointers to lifetime-ended objects, such as the pointer to the old \Node{A}
contained in \Node{B}'s \co{->next} pointer.
In fact, compilers are within their rights to assume that if two pointers
(call them \co{p} and \co{q}) were returned from two different calls to
\co{malloc()}, then those pointers must not be equal.
Real compilers really will generate the constant \co{false} in
response to a \co{p==q} comparison.
A pointer to an object that has been freed, but whose memory has been
reallocated for a compatibly typed object is termed a \emph{zombie pointer}.
Many concurrent applications avoid this problem by carefully hiding the
memory allocator from the compiler, thus preventing the compiler from
making inappropriate assumptions.
This obfuscatory approach currently works in practice, but might well
one day fall victim to increasingly aggressive optimizers.
There is work underway in both the C and C++ standards committees
to address this
problem~\cite{PaulEMcKenney2019PointerLifetimeEndZap,PaulEMcKenney2020PointerLifetimeEndZapCpp}.
In the meantime, please exercise great care when coding ABA-tolerant
algorithms.
\end{fcvref}
\QuickQuiz{
So why not ditch antique languages like C and C++ for something
more modern?
}\QuickQuizAnswer{
That won't help unless the more-modern languages proponents
are energetic enough to write their own compiler backends.
The usual practice of re-using existing backends also reuses
charming properties such as refusal to support pointers to
lifetime-ended objects.
}\QuickQuizEnd
\subsection{Applicability of NBS Benefits}
\label{sec:advsync:Applicability of NBS Benefits}
The most heavily cited NBS benefits stem from its forward-progress
guarantees, its tolerance of fail-stop bugs, and from its linearizability.
Each of these is discussed in one of the following sections.
\subsubsection{NBS Forward Progress Guarantees}
\label{sec:advsync:NBS Forward Progress Guarantees}
NBS's forward-progress guarantees have caused many to suggest its use in
real-time systems, and NBS algorithms are in fact used in a great many
such systems.
However, it is important to note that forward-progress guarantees are
largely orthogonal to those that form the basis of real-time programming:
\begin{enumerate}
\item Real-time forward-progress guarantees usually have some
definite time associated with them, for example,
``\IXh{scheduling}{latency} must be less than 100 microseconds.''
In contrast, the most popular forms of NBS only guarantees
that progress will be made in finite time, with no definite
bound.
\item Real-time forward-progress guarantees are often
probabilistic, as in the soft-real-time guarantee that
``at least 99.9\pct\ of the time, scheduling latency must
be less than 100 microseconds.''
In contrast, many of NBS's forward-progress guarantees are
unconditional, which can impose additional unnecessary complexity
and performance degradation.
\item Real-time forward-progress guarantees are often conditioned on
environmental constraints, for example, only being honored:
\begin{enumerate*}[(1)]
\item For the highest-priority tasks,
\item When each CPU spends at least a certain fraction of its time idle,
and
\item When I/O rates are below some specified maximum.
\end{enumerate*}
In contrast, NBS's forward-progress guarantees are often
unconditional, again incurring unnecessary complexity and
performance degradation.
On the other hand, some recent NBS work accommodates some
conditional
guarantees~\cite{DanAlitarh2013PracticalProgress}.
\item An important component of a real-time program's environment
is the scheduler.
NBS algorithms assume a worst-case \emph{demonic scheduler},
though for whatever reason, not a scheduler so demonic that
it simply refuses to ever run the application housing the NBS
algorithm.
In contrast, real-time systems assume that the scheduler is
doing its level best to satisfy any scheduling constraints
it knows about, and, in the absence of such constraints,
its level best to honor process priorities and to provide
fair scheduling to processes of the same priority.
Non-demonic schedulers allow real-time programs to use simpler
algorithms than those required for
NBS~\cite{DanAlitarh2013PracticalProgress,BjoernBrandenburgPhD}.
\item The progress guarantees provided by non-blocking synchronization
are most useful in overload conditions.
However, overload conditions are most likely to cause queueing
delays that, in many real-world systems, will cause the program
(or even the system running that program) to be terminated
with prejudice.
As a rough rule of thumb, latency increases as the reciprocal
of the idle time.
In other words, if a nearly idle system has
latency $T$, then a 50\pct\ idle system will have latency $2 T$, a
10\pct\ idle (90\pct\ utilized) system will have latency $10 T$, a 1\pct\
idle system (99\pct\ utilized) will have latency $100 T$, and so on.
This situation means that many latency-sensitive systems will
actively limit load, for example, to 50\pct.
\item In the not-uncommon case where a given computed result is nice
to have rather than critically important, use of timeouts can
cause a blocking operation to have non-blocking properties that
are often sufficient in practice.
\item NBS forward-progress guarantee classes assume that a number of
underlying operations (such as system calls) are lock-free or
even wait-free, when in fact these operations are blocking on
common-case computer systems.
\item NBS forward-progress guarantees are often achieved by subdividing
operations.
For example, in order to avoid a blocking dequeue operation,
an NBS algorithm might substitute a non-blocking polling
operation.
This is fine in theory, but not helpful in practice to real-world
programs that require an element to propagate through the queue
in a timely fashion.
\item Real-time forward-progress guarantees usually apply only
in the absence of software bugs.
In contrast, many classes of NBS guarantees apply even in the
face of fail-stop bugs, though obviously not in the face of
software bugs in the general arbitrarily misbehaving case.
\item NBS forward-progress guarantee classes imply linearizability.
In contrast, real-time forward progress guarantees usually do
not require ordering constraints such as linearizability.
Once again, providing linearizability when it is not needed
can unnecessarily increase complexity and decrease performance.
\item Although the simpler NBS algorithms combine exemplary simplicity,
forward-progress, guarantees, and performance, other
NBS algorithms are quite complex.
This added complexity can result in increased development
and maintenance costs, decreased performance, and decreased
reliability.
\end{enumerate}
\QuickQuiz{
Why does anyone care about demonic schedulers?
}\QuickQuizAnswer{
A demonic scheduler is one way to model an insanely overloaded
system.
After all, if you have an algorithm that you can prove runs
reasonably given a demonic scheduler, mere overload should
be no problem, right?
On the other hand, it is only reasonable to ask if a demonic
scheduler is really the best way to model overload conditions.
And perhaps it is time for more accurate models.
For one thing, a system might be overloaded in any of a number
of ways.
After all, an NBS algorithm that works fine on a demonic scheduler
might or might not do well in out-of-memory conditions, when
mass storage fills, or when the network is congested.
Except that systems' core counts have been increasing, which
means that an overloaded system is quite likely to be running
more than one concurrent program.\footnote{
As a point of reference, back in the mid-1990s, Paul
witnessed a 16-CPU system running about 20 instances
of a certain high-end proprietary database.}
In that case, even if a demonic scheduler is not so demonic
as to inject idle cycles while there are runnable tasks,
it is easy to imagine such a scheduler consistently favoring
the other program over yours.
If both programs could consume all available CPU, then this
scheduler might not run your program at all.
One way to avoid these issues is to simply avoid overload
conditions.
This is often the preferred approach in production, where load
balancers direct traffic away from overloaded systems.
And if all systems are overloaded, it is not unheard of to simply
\emph{shed load}, that is, to drop the low-priority incoming requests.
Nor is this approach limited to computing, as those who have
suffered through a rolling blackout can attest.
But load-shedding is often considered a bad thing by those
whose load is being shed.
As always, choose wisely!
}\QuickQuizEnd
To reiterate, despite these differences, workarounds, and objections,
a number of NBS algorithms are extremely useful and are heavily used
in practice.
As always, use the right tool for the job!
\subsubsection{NBS Underlying Operations}
\label{sec:advsync:NBS Underlying Operations}
An NBS algorithm can be truly non-blocking only if the underlying
operations that it uses are also non-blocking.
In a surprising number of cases, this is not the case in practice.
For example, non-blocking algorithms often allocate memory.
In theory, this is fine, given the existence of lock-free
memory allocators~\cite{MagedMichael04NBSmalloc}.
But in practice, most environments must eventually obtain memory from
operating-system kernels, which commonly use locking.
Therefore, unless all the memory that will ever be needed is somehow
preallocated, a ``non-blocking'' algorithm that allocates memory will not
be non-blocking when running on common-case real-world computer systems.
This same point clearly also applies to algorithms performing I/O
operations or otherwise interacting with their environment.
Perhaps surprisingly, this point also applies to ostensibly non-blocking
algorithms that do only plain loads and stores, such as the counters
discussed in \cref{sec:advsync:NBS Counters}.
And at first glance, those loads and stores that can be compiled into
single load and store instructions, respectively, would seem to be
not just non-blocking, but bounded population-oblivious wait free.
Except that load and store instructions are not necessarily either
fast or deterministic.
For example, as noted in \cref{chp:Hardware and its Habits}, cache misses
can consume thousands of CPU cycles.
Worse yet, the measured cache-miss latencies can be a function of the
number of CPUs, as illustrated in
\cref{fig:count:Atomic Increment Scalability on x86}.
It is only reasonable to assume that these latencies also depend on
the details of the system's interconnect.
In addition, given that hardware vendors generally do not publish
upper bounds for cache-miss latencies, it seems brave to assume that
memory-reference instructions are in fact wait-free in modern computer
systems.
And the antique systems for which such bounds are available suffer from
profound overall slowness.
Furthermore, hardware is not the only source of slowness for
memory-reference instructions.
For example, when running on typical computer systems, both loads and
stores can result in page faults.
Which cause in-kernel page-fault handlers to be invoked.
Which might acquire locks, or even do I/O, potentially even using
something like network file system (NFS)\@.
All of which are most emphatically blocking operations.
Nor are page faults the only kernel-induced hazard.
A given CPU might be interrupted at any time, and the interrupt
handler might run for some time.
During this time, the user-mode ostensibly non-blocking algorithm
will not be running at all.
This situation raises interesting questions about the forward-progress
guarantees provided by system calls relying on interrupts, for example,
the \co{membarrier()} system call.
Things do look bleak, but the non-blocking nature of such algorithms
can be at least partially redeemed using a number of approaches:
\begin{enumerate}
\item Run on bare metal, with paging disabled.
If you are both brave and confident that you can write code that
is free of wild-pointer bugs, this approach might be for you.
\item Run on a non-blocking operating-system kernel~\cite{Cheriton96a}.
Such kernels are quite rare, in part because they have
traditionally completely failed to provide the hoped-for
performance and scalability advantages over lock-based kernels.
But perhaps you should write one.
\item Use facilities such as \co{mlockall()} to avoid page faults,
while also ensuring that your program preallocates all the
memory it will ever need at boot time.
This can work well, but at the expense of severe common-case
underutilization of memory.
In environments that are cost-constrained or power-limited,
this approach is not likely to be feasible.
\item Use facilities such as the Linux kernel's
\co{NO_HZ_FULL} tickless mode~\cite{JonCorbet2013NO-HZ-FULL}.
In recent versions of the Linux kernel, this mode directs
interrupts away from a designated set of CPUs.
However, this can sharply limit throughput for applications that
are I/O bound during even part of their operation.
\end{enumerate}
Given these considerations, it is no surprise that non-blocking
synchronization is far more important in theory than it is in practice.
\subsubsection{NBS Subdivided Operations}
\label{sec:advsync:NBS Subdivided Operations}
One common trick that provides a given algorithm a loftier place on
the NBS ranking is to replace blocking operations with a polling API\@.
For example, instead of having a reliable dequeue operation that might be
merely lock-free or even blocking, instead provide a dequeue operation
that will spuriously fail in a wait-free manner rather than exhibiting
dreaded lock-free or blocking behaviors.
This can work well in theory, but a common effect in practice is to
merely move the lock-free or blocking behavior out of that specific
algorithm and into the hapless code making use of that algorithm.
In such cases, not only has nothing has been gained by this trick, but
this trick has increased the complexity of all users of this algorithm.
With concurrent algorithms as elsewhere, maximizing a specific metric
is no substitute for thinking carefully about the needs of one's users.
\subsubsection{NBS Fail-Stop Tolerance}
\label{sec:advsync:NBS Fail-Stop Tolerance}
Of the classes of NBS algorithms, wait-free synchronization (bounded or
otherwise), lock-free synchronization, obstruction-free synchronization,
and clash-free synchronization guarantee forward progress even in the
presence of fail-stop bugs.
An example fail-stop bug might cause some thread to be preempted indefinitely.
As we will see, this fail-stop-tolerant property can be useful, but the
fact is that composing a set of fail-stop-tolerant mechanisms does not
necessarily result in a fail-stop-tolerant system.
To see this, consider a system made up of a series of wait-free queues,
where an element is removed from one queue in the series, processed,
and then added to the next queue.
If a thread is preempted in the midst of a queuing operation, in theory
all is well because the wait-free nature of the queue will guarantee
forward progress.
But in practice, the element being processed is lost because the
fail-stop-tolerant nature of the wait-free queues does not extend to
the code using those queues.
Nevertheless, there are a few applications where NBS's rather limited
fail-stop-tolerance is useful.
For example, in some network-based or web applications, a fail-stop
event will eventually result in a retransmission, which will restart
any work that was lost due to the fail-stop event.
Systems running such applications can therefore be heavily loaded, even
to the point where the scheduler can no longer provide any reasonable
fairness guarantee.
In constrast, if a thread fail-stops while holding a lock, the application
might need to be restarted.
Nevertheless, NBS is not a panacea even within this restricted area,
due to the possibility of spurious retransmissions due to pure scheduling
delays.
In some cases, it may be more efficient to reduce the load to avoid
queueing delays, which will also improve the scheduler's ability to
provide fair access, reducing or even eliminating the fail-stop events,
thus reducing the number of retry operations, in turn further reducing
the load.
\subsubsection{NBS Linearizability}
\label{sec:advsync:NBS Linearizability}
It is important to note that linearizability can be quite useful,
especially when analyzing concurrent code made up of strict locking
and fully ordered atomic operations.\footnote{
For example, the Linux kernel's value-returning atomic operations.}
Furthermore, this handling of fully ordered atomic operations
automatically covers simple NBS algorithms.
However, the linearization points of a complex NBS algorithms are often
buried deep within that algorithm, and thus not visible to users of
a library function implementing a part of such an algorithm.
Therefore, any claims that users benefit from the linearizability properties
of complex NBS algorithms should be regarded with deep
suspicion~\cite{AndreasHaas2012FIFOisnt}.
It is sometimes asserted that linearizability is necessary for developers
to produce proofs of correctness for their concurrent code.
However, such proofs are the exception rather than the rule, and modern
developers who do produce proofs often use modern proof techniques that
do not depend on linearizability.
Furthermore, developers frequently use modern proof techniques that do
not require a full specification, given that developers often learn
their specification after the fact, one bug at a time.
A few such proof techniques were discussed in
\cref{chp:Formal Verification}.\footnote{
A memorable verbal discussion with an advocate of linearizability
resulted in question:
``So the reason linearizability is important is to rescue 1980s
proof techniques?''
The advocate immediately replied in the affirmative, then spent
some time disparaging a particular modern proof technique.
Oddly enough, that technique was one of those successfully
applied to Linux-kernel RCU\@.}
It is often asserted that linearizability maps well to sequential
specifications, which are said to be more natural than are concurrent
specifications~\cite{SergioRajsbaum2020HistoryLinearizability}.
But this assertion fails to account for our highly concurrent objective
universe.
This universe can only be expected to select for ability to cope with
concurrency, especially for those participating in team sports or
overseeing small children.
In addition, given that the teaching of sequential
computing is still believed to be somewhat of a black
art~\cite{ElizabethPatitsas2020GradesNotBimodal}, it is reasonable
to expect that teaching of concurrent computing is in a similar state
of disarray.
Therefore, focusing on only one proof technique is unlikely to be a
good way forward.
Again, please understand that linearizability is quite useful in many
situations.
Then again, so is that venerable tool, the hammer.
But there comes a point in the field of computing where one should put
down the hammer and pick up a keyboard.
Similarly, it appears that there are times when linearizability is not
the best tool for the job.
To their credit, there are some linearizability advocates who are aware
of some of its shortcomings~\cite{SergioRajsbaum2020HistoryLinearizability}.
There are also proposals to extend linearizability, for example,
interval-linearizability, which is intended to handle the common case
of operations that require non-zero time to
complete~\cite{10.1145/3266457}.
It remains to be seen whether these proposals will result in theories
able to handle modern concurrent software artifacts, especially given
that several of the proof techniques discussed in \cref{chp:Formal
Verification} already handle many modern concurrent software artifacts.
\subsection{NBS Discussion}
\label{sec:advsync:NBS Discussion}
It is possible to create fully non-blocking queues~\cite{MichaelScott96},
however, such queues are much more complex than the half-NBS algorithm
outlined above.
The lesson here is to carefully consider your actual requirements.
Relaxing irrelevant requirements can often result in great
improvements in simplicity, performance, and scalability.
Recent research points to another important way to relax requirements.
It turns out that systems providing fair scheduling can enjoy most
of the benefits of wait-free synchronization even when running
algorithms that provide only non-blocking
synchronization, both in theory~\cite{DanAlitarh2013PracticalProgress}
and in practice~\cite{SamyAlBahra2013NBS}.
Because most schedulers used in production do in fact provide fairness,
the more-complex algorithms providing wait-free synchronization usually
provide no practical advantages over simpler and faster non-wait-free
algorithms.
Interestingly enough, fair scheduling is but one beneficial
constraint that is often respected in practice.
Other sets of constraints can permit blocking algorithms to
achieve deterministic real-time response.
For example, given:
\begin{enumerate*}[(1)]
\item Fair locks granted in FIFO order within a given priority level,
\item Priority inversion avoidance (for example, priority
inheritance~\cite{Takada:1995:RSN:527074.828566,Cai-DongWang1996PrioInherLock}
or priority ceiling),
\item A bounded number of threads,
\item Bounded critical section durations,
\item Bounded load,
and
\item Absence of fail-stop bugs,
\end{enumerate*}
lock-based applications can provide deterministic
response times~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.
This approach of course blurs the distinction between blocking and wait-free
synchronization, which is all to the good.
Hopefully theoretical frameworks will continue to improve their ability
to describe software actually used in practice.
Those who feel that theory should lead the way are referred to the
inimitable Peter Denning, who said of operating systems:
``Theory follows practice''~\cite{Denning:2015:POF:2830903.2830904},
or to the eminent Tony Hoare, who said of the whole of engineering:
``In all branches of engineering science, the engineering starts before
the science; indeed, without the early products of engineering, there
would be nothing for the scientist to
study!''~\cite{RichardMorris2007TonyHoareInterview}.
Of course, once an appropriate body of theory becomes available, it is
wise to make use of it.
However, note well that the first \emph{appropriate} body of theory
is often one thing and the first \emph{proposed} body of theory quite
another.
\QuickQuiz{
It seems like the various members of the NBS hierarchy are
rather useless.
So why bother with them at all???
}\QuickQuizAnswer{
One advantage of the members of the NBS hierarchy is that they
are reasonably simple to define and use from a theoretical viewpoint.
We can hope that work done in the NBS arena will help lay the
groundwork for analysis of real-world forward-progress guarantees
for concurrent real-time programs.
However, as of 2022 it appears that trace-based methodologies
are in the lead~\cite{DanielBristot2019RTtrace}.
So why bother learning about NBS at all?
Because a great many people know of it, and are vaguely aware that
it is somehow related to real-time computing.
Their response to your carefully designed real-time constraints
might well be of the form ``Bah, just use wait-free algorithms!''.
In the all-too-common case where they are very convincing to your
management, you will need to understand NBS in order to bring
the discussion back to reality.
I hope that this section has provided you with the required
depth of understanding.
Another thing to note is that learning about the NBS hierarchy
is probably no more harmful than learning about transfinite
numbers of the computational-complexity hierarchy.
In all three cases, it is important to avoid over-applying
the theory.
Which is in and of itself good practice!
}\QuickQuizEnd
Proponents of NBS algorithms sometimes call out real-time computing as
an important NBS beneficiary.
The next section looks more deeply at the forward-progress needs of
real-time systems.
\input{advsync/rt}
\QuickQuizAnswersChp{qqzadvsync}