blob: fc0736b4c737c1bdda8a8bd6df415208b6f40960 [file] [log] [blame]
% advsync/advsync.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{sec:advsync:Advanced Synchronization}{Advanced Synchronization}{qqzadvsync}
%
\Epigraph{If a little knowledge is a dangerous thing, just think what
you could do with a lot of knowledge!}{\emph{Unknown}}
This chapter covers synchronization techniques used for lockless
algorithms and parallel real-time systems.
Although lockless algorithms can be quite helpful when faced with
extreme requirements, they are no panacea.
For example, as noted at the end of \cref{chp:Counting},
you should thoroughly apply partitioning, batching, and
well-tested packaged weak APIs
(see \cref{chp:Data Ownership,chp:Deferred Processing})
before even thinking about lockless algorithms.
But after doing all that, you still might find yourself needing the
advanced techniques described in this chapter.
To that end,
\cref{sec:advsync:Avoiding Locks}
summarizes techniques used thus far for avoiding locks and
\cref{sec:advsync:Non-Blocking Synchronization}
gives a brief overview of non-blocking synchronization.
Memory ordering is also quite important, but it warrants its own
\cref{chp:Advanced Synchronization: Memory Ordering}.
The second form of advanced synchronization provides the stronger
forward-progress guarantees needed for parallel real-time computing,
which is the topic of
\cref{sec:advsync:Parallel Real-Time Computing}.
\section{Avoiding Locks}
\label{sec:advsync:Avoiding Locks}
%
\epigraph{We are confronted with insurmountable opportunities.}
{\emph{Walt Kelly}}
Although locking is the workhorse of parallelism in production, in
many situations performance, scalability, and real-time response can
all be greatly improved through use of lockless techniques.
A particularly impressive example of such a lockless technique are
the statistical counters described in
\cref{sec:count:Statistical Counters},
which avoids not only locks, but also atomic operations, memory barriers,
and even cache misses for counter increments.
Other examples we have covered include:
\begin{enumerate}
\item The fastpaths through a number of other counting algorithms
in \cref{chp:Counting}.
\item The fastpath through resource allocator caches in
\cref{sec:SMPdesign:Resource Allocator Caches}.
\item The maze solver in \cref{sec:SMPdesign:Beyond Partitioning}.
\item The data-ownership techniques in \cref{chp:Data Ownership}.
\item The reference-counting, hazard-pointer, and RCU techniques
in \cref{chp:Deferred Processing}.
\item The lookup code paths in \cref{chp:Data Structures}.
\item Many of the techniques in \cref{chp:Putting It All Together}.
\end{enumerate}
In short, lockless techniques are quite useful and are heavily used.
However, it is best if lockless techniques are hidden behind a
well-defined API, such as the \co{inc_count()}, \co{memblock_alloc()},
\co{rcu_read_lock()}, and so on.
The reason for this is that undisciplined use of lockless techniques
is a good way to create difficult bugs.
If you believe that finding and fixing such bugs is easier than avoiding
them, please re-read
\cref{chp:Validation,chp:Formal Verification}.
\section{Non-Blocking Synchronization}
\label{sec:advsync:Non-Blocking Synchronization}
%
\epigraph{Never worry about theory as long as the machinery does what
it's supposed to do.}
{\emph{Robert A. Heinlein}}
The term \emph{non-blocking synchronization} (NBS)~\cite{MauriceHerlihy90a}
describes seven classes of linearizable algorithms with differing
\emph{forward-progress guarantees}~\cite{DanAlitarh2013PracticalProgress},
which are as follows:
\begin{enumerate}
\item \emph{Bounded wait-free synchronization}: Every thread
will make progress within
a specific finite period of time~\cite{Herlihy91}.
This level is widely considered to be unachievable, which might be why
Alitarh et al.\ omitted it~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{Wait-free synchronization}: Every thread will make progress
in finite time~\cite{Herlihy93}.
\item \emph{Lock-free synchronization}: At least one thread will
make progress in finite time~\cite{Herlihy93}.
\item \emph{Obstruction-free synchronization}: Every thread will
make progress in finite time in the absence of
contention~\cite{HerlihyLM03}.
\item \emph{Clash-free synchronization}: At least one thread will
make progress in finite time in the absence of
contention~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{Starvation-free synchronization}: Every thread will
make progress in finite time in the absence of
failures~\cite{DanAlitarh2013PracticalProgress}.
\item \emph{Deadlock-free synchronization}: At least one thread will
make progress in finite time in the absence of
failures~\cite{DanAlitarh2013PracticalProgress}.
\end{enumerate}
NBS classes~1, 2 and~3 were first formulated in the early 1990s,
class~4 was first formulated in the early 2000s,
and class~5 was first formulated in 2013.
The final two classes have seen informal use for a great many decades,
but were reformulated in 2013.
In theory, any parallel algorithm can be cast into wait-free form,
but there are a relatively small subset of NBS algorithms that are
in common use.
A few of these are listed in the following section.
\subsection{Simple NBS}
\label{sec:advsync:Simple NBS}
Perhaps the simplest NBS algorithm is atomic update of an integer
counter using fetch-and-add (\co{atomic_add_return()}) primitives.
\subsubsection{NBS Sets}
\label{sec:advsync:NBS Sets}
Another simple NBS algorithm implements a set of integers in an array.
Here the array index indicates a value that might be a member of the set
and the array element indicates whether or not that value actually is
a set member.
The linearizability criterion for NBS algorithms requires that reads from
and updates to the array either use atomic instructions or be accompanied
by memory barriers, but in the not-uncommon case where linearizability
is not important, simple volatile loads and stores suffice, for example,
using \co{READ_ONCE()} and \co{WRITE_ONCE()}.
An NBS set may also be implemented using a bitmap, where each value that
might be a member of the set corresponds to one bit.
Reads and updates must normally be carried out via atomic bit-manipulation
instructions, although compare-and-swap (\co{cmpxchg()} or CAS)
instructions can also be used.
\subsubsection{NBS Counters}
\label{sec:advsync:NBS Counters}
The statistical counters algorithm discussed in
\cref{sec:count:Statistical Counters}
can be considered to be bounded-wait-free, but only by using a cute
definitional trick in which the sum is considered to be approximate
rather than exact.\footnote{
Citation needed.
I heard of this trick verbally from Mark Moir.}
Given sufficiently wide error bounds that are a function of the length
of time that the \co{read_count()} function takes to sum the counters,
it is not possible to prove that any non-linearizable behavior occurred.
This definitely (if a bit artificially) classifies the statistical-counters
algorithm as wait-free.
This algorithm is probably the most heavily used NBS algorithm in the
Linux kernel.
\subsubsection{Half-NBS Queue}
\label{sec:advsync:Half-NBS Queue}
\begin{fcvref}[ln:count:NBS Enqueue Algorithm]
Another common NBS algorithm is the atomic queue where elements are
enqueued using an atomic exchange instruction~\cite{MagedMichael1993JPDC},
followed by a store into the \co{->next} pointer of the new element's
predecessor, as shown in \cref{lst:count:NBS Enqueue Algorithm},
which shows the userspace-RCU library
implementation~\cite{MathieuDesnoyers2009URCU}.
\Clnref{tail} updates the tail pointer to reference the new element while
returning a reference to its predecessor, which is stored in
local variable \co{old_tail}.
\Clnref{pred} then updates the predecessor's \co{->next} pointer to
reference the newly added element, and finally \clnref{ret}
returns an indication as to whether or not the queue was initially
empty.
\begin{listing}[tbp]
\begin{fcvlabel}[ln:count:NBS Enqueue Algorithm]
\begin{VerbatimL}[commandchars=\\\[\]]
static inline bool
___cds_wfcq_append(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail,
struct cds_wfcq_node *new_head,
struct cds_wfcq_node *new_tail)
{
struct cds_wfcq_node *old_tail;
old_tail = uatomic_xchg(&tail->p, new_tail); \lnlbl[tail]
CMM_STORE_SHARED(old_tail->next, new_head); \lnlbl[pred]
return old_tail != &head->node; \lnlbl[ret]
}
static inline bool
_cds_wfcq_enqueue(struct cds_wfcq_head *head,
struct cds_wfcq_tail *tail,
struct cds_wfcq_node *new_tail)
{
return ___cds_wfcq_append(head, tail,
new_tail, new_tail);
}
\end{VerbatimL}
\end{fcvlabel}
\caption{NBS Enqueue Algorithm}
\label{lst:count:NBS Enqueue Algorithm}
\end{listing}
Although mutual exclusion is required to dequeue a single element
(so that dequeue is blocking), it is possible to carry out a non-blocking
removal of the entire contents of the queue.
What is not possible is to dequeue any given element in a non-blocking
manner: The enqueuer might have failed between \clnref{tail,pred} of the
listing, so that the element in question is only partially enqueued.
This results in a half-NBS algorithm where enqueues are NBS but
dequeues are blocking.
This algorithm is nevertheless heavily used in practice, in part because
most production software is not required to tolerate arbitrary fail-stop
errors.
\end{fcvref}
\subsubsection{NBS Stack}
\label{sec:advsync:NBS Stack}
\begin{fcvref}[ln:advsync:lifo_push:whole]
\Cref{lst:advsync:NBS Stack Algorithm}
shows the LIFO push algorithm, which boasts lock-free push and
bounded wait-free pop (\path{lifo-push.c}), forming an NBS stack.
The origins of this algorithm are unknown, but it was referred to in
a patent granted in 1975~\cite{PaulJBrown1975LIFOpush}.
This patent was filed in 1973, a few months before your editor
saw his first computer, which had but one CPU.
\begin{listing}[tbp]
\input{CodeSamples/advsync/lifo_push@whole.fcv}
\caption{NBS Stack Algorithm}
\label{lst:advsync:NBS Stack Algorithm}
\end{listing}
\Clnrefrange{struct:b}{struct:e} shows the \co{node_t} structure,
which contains an arbitrary value and a pointer to the next structure
on the stack and
\Clnref{top} shows the top-of-stack pointer.
The \co{list_push()} function spans \clnrefrange{push:b}{push:e}.
\Clnref{push:alloc} allocates a new node and
\clnref{push:initialize} initializes it.
\Clnref{push:next} initializes the newly allocated node's \co{->next}
pointer, and \clnref{push:cmpxchg} attempts to push it on the stack.
If \clnref{push:check} detects \co{cmpxchg()} failure, another pass
through the loop retries.
Otherwise, the new node has been successfully pushed, and this function
returns to its caller.
Note that \clnref{push:check} resolves races in which two concurrent
instances of \co{list_push()} attempt to push onto the stack.
The \co{cmpxchg()} will succeed for one and fail for the other,
causing the other to retry, thereby selecting an arbitrary order for
the two node on the stack.
The \co{list_pop_all()} function spans \clnrefrange{popall:b}{popall:e}.
The \co{xchg()} statement on \clnref{popall:xchg} atomically removes
all nodes on the stack, placing the head of the resulting list in local
variable \co{p} and setting \co{top} to \co{NULL}.
This atomic operation serializes concurrent calls to \co{list_pop_all()}:
One of them will get the list, and the other a \co{NULL} pointer, at
least assuming that there were no concurrent calls to \co{list_push()}.
An instance of \co{list_pop_all()} that obtains a non-empty list in
\co{p} processes this list in the loop spanning
\clnrefrange{popall:loop:b}{popall:loop:e}.
\Clnref{popall:next} prefetches the \co{->next} pointer,
\clnref{popall:foo} invokes the function referenced by \co{foo()} on the
current node,
\clnref{popall:free} frees the current node, and
\clnref{popall:pnext} sets up \co{p} for the next pass through the loop.
But suppose that a pair of \co{list_push()} instances run concurrently
with a \co{list_pop_all()} with a list initially containing a single
node~A.
Here is one way that this scenario might play out:
\begin{enumerate}
\item The first \co{list_push()} instance pushes a new node~B,
executing through \clnref{push:next}, having just stored
a pointer to node A into node~B's \co{->next} pointer.
\item The \co{list_pop_all()} instance runs to completion,
setting \co{top} to \co{NULL} and freeing node~A.
\item The second \co{list_push()} instance runs to completion,
pushing a new node~C, but happens to allocate the memory
that used to belong to node~A.
\item The first \co{list_push()} instance executes the \co{cmpxchg()}
on \clnref{push:cmpxchg}.
Because new node~C has the same address as the newly freed node~A,
this \co{cmpxchg()} succeeds and this \co{list_push()} instance
runs to completion.
\end{enumerate}
Note that both pushes and the popall all ran successfully despite the
reuse of node~A's memory.
This is an unusual property: Most data structures require protection
against what is often called the ABA problem.
But this property holds only for algorithm written in assembly
language.
The sad fact is that most languages (including C and C++) do not support
pointers to lifetime-ended objects, such as the pointer to the old node~A
contained in node~B's \co{->next} pointer.
In fact, compilers are within their rights to assume that if two pointers
(call them \co{p} and \co{q}) were returned from two different calls to
\co{malloc()}, then those pointers must not be equal.
Real compilers really will generate the constant \co{false} in
response to a \co{p==q} comparison.
This profoundly counter-intuitive behavior is termed \emph{lifetime-end
pointer zap}.
Many concurrent applications avoid this problem by carefully hiding the
memory allocator from the compiler, thus preventing the compiler from
making inappropriate assumptions.
This obfuscatory approach currently works in practice, but might well
one day fall victim to increasingly aggressive optimizers.
There is work underway in both the C and C++ standards committees
to address this
problem~\cite{PaulEMcKenney2019PointerLifetimeEndZap,PaulEMcKenney2020PointerLifetimeEndZapCpp}.
In the meantime, please exercise great care when coding ABA-tolerant
algorithms.
\end{fcvref}
\QuickQuiz{
So why not ditch antique languages like C and C++ for something
more modern?
}\QuickQuizAnswer{
That won't help unless the more-modern languages proponents
are energetic enough to write their own compiler backends.
The usual practice of re-using existing backends also reuses
charming properties such as lifetime-end pointer zap.
}\QuickQuizEnd
\subsection{Applicability of NBS Benefits}
\label{sec:advsync:Applicability of NBS Benefits}
The most heavily cited NBS benefits stem from its forward-progress
guarantees, its tolerance of fail-stop bugs, and from its linearizability.
Each of these is discussed in one of the following sections.
\subsubsection{NBS Forward Progress Guarantees}
\label{sec:advsync:NBS Forward Progress Guarantees}
NBS's forward-progress guarantees have caused many to suggest its use in
real-time systems, and NBS algorithms are in fact used in a great many
such systems.
However, it is important to note that forward-progress guarantees are
largely orthogonal to those that form the basis of real-time programming:
\begin{enumerate}
\item Real-time forward-progress guarantees usually have some
definite time associated with them, for example,
``scheduling latency must be less than 100 microseconds.''
In contrast, the most popular forms of NBS only guarantees
that progress will be made in finite time, with no definite
bound.
\item Real-time forward-progress guarantees are often
probabilistic, as in the soft-real-time guarantee that
``at least 99.9\,\% of the time, scheduling latency must
be less than 100 microseconds.''
In contrast, NBS's forward-progress
guarantees have traditionally been unconditional.
\item Real-time forward-progress guarantees are often conditioned on
environmental constraints, for example, only being honored:
(1)~for the highest-priority tasks,
(2)~when each CPU spends at least a certain fraction of its time idle,
or (3)~when I/O rates are below some specified maximum.
In contrast, NBS's forward-progress
guarantees are usually unconditional.\footnote{
As we will see below, some recent NBS work relaxes
this guarantee.}
\item An important component of a real-time program's environment
is the scheduler.
NBS algorithms assume a worst-case \emph{demonic scheduler}.
In contrast, real-time systems assume that the scheduler is
doing its level best to satisfy any scheduling constraints
it knows about, and, in the absence of such constraints,
its level best to honor process priorities and to provide
fair scheduling to processes of the same priority.
This assumption of a non-demonic scheduler allows real-time
programs to use simpler algorithms than those required for
NBS~\cite{DanAlitarh2013PracticalProgress,BjoernBrandenburgPhD}.
\item Real-time forward-progress guarantees usually apply only
in the absence of software bugs.
In contrast, most NBS guarantees apply even in the face of
fail-stop bugs.\footnote{
Again, some recent NBS work relaxes this guarantee.}
\item NBS forward-progress guarantee classes imply linearizability.
In contrast, real-time forward progress guarantees are often
independent of ordering constraints such as linearizability.
\end{enumerate}
To reiterate, despite these differences, a number of NBS algorithms are
extremely useful in real-time programs.
\subsubsection{NBS Fail-Stop Tolerance}
\label{sec:advsync:NBS Fail-Stop Tolerance}
Of the classes of NBS algorithms, wait-free synchronization (bounded or
otherwise), lock-free synchronization, obstruction-free synchronization,
and clash-free synchronization guarantee forward progress even in the
presence of fail-stop bugs.
An example fail-stop bug would be a scheduler bug that causes a random
thread to be preempted indefinitely.
As we will see, this fail-stop-tolerant property can be useful, but the
fact is that composing a set of fail-stop-tolerate mechanisms does not
necessarily result in a fail-stop-tolerant system.
To see this, consider a system made up of a series of wait-free queues,
where an element is removed from one queue in the series, processed,
and then added to the next queue.
If a thread is preempted in the midst of a queuing operation, all is well
because the wait-free nature of the queue will guarantee forward progress.
But suppose that preemption occurs not during the queuing, but instead
during the between-queues processing.
In this case, the element being processed is lost because
the fail-stop-tolerant nature of the wait-free queues does not
extend to the code using those queues.
Nevertheless, there are a few applications where NBS's rather limited
fail-stop-tolerance is useful.
For example, in some network-based or web applications, a fail-stop
event will eventually result in a retransmission, which will restart
any work that was lost due to the fail-stop event.
Systems running such applications can therefore be heavily loaded, even
to the point where the scheduler can no longer provide any reasonable
fairness guarantee.
In constrast, if a thread fail-stops while holding a lock, the application
will normally need to be restarted.
Nevertheless, NBS is not a panacea even within this restricted area,
due to the possibility of spurious retransmissions due to pure scheduling
delays.
In some cases, it may be more efficient to reduce the load to avoid
queueing delays, which will also improve the scheduler's ability to
provide fair access, which might turn reduce or even eliminate the
apparent fail-stop events, thus shedding the load due to the retry
operations initiated by retransmissions.
\subsubsection{NBS Linearizability}
\label{sec:advsync:NBS Linearizability}
It is important to note that linearizability can be quite useful,
especially when analyzing concurrent code made up of strict locking
and fully ordered atomic operations.\footnote{
For example, the Linux kernel's value-returning atomic operations.}
Furthermore, this handling of fully ordered atomic operations
automatically covers simple NBS algorithms.
However, the linearization points of a complex NBS algorithms are often
buried deep within that algorithm, and thus not visible to users of
a library function implementing a part of such an algorithm.
Therefore, any claims that users benefit from the linearizability properties
of complex NBS algorithms should be regarded with deep
suspicion~\cite{AndreasHaas2012FIFOisnt}.
It is sometimes asserted that linearizability is necessary for developers
to produce proofs of correctness for their concurrent code.
However, such proofs are the exception rather than the rule, and modern
developers who do produce proofs often use modern proof techniques that
do not depend on linearizability.
Furthermore, developers frequently use modern proof techniques that do
not require a full specification, given that developers often learn
their specification after the fact, one bug at a time.
A few such proof techniques were discussed in
\cref{chp:Formal Verification}.\footnote{
A memorable verbal discussion with an advocate of linearizability
resulted in this advocate being asked:
``So the reason linearizability is important is to rescue 1980s
proof techniques?''
The advocate immediately replied in the affirmative, then spent
some time disparaging a particular modern proof technique.
Oddly enough, that technique was one of those successfully
applied to Linux-kernel RCU.}
It is often asserted that linearizability maps well to sequential
specifications, which are said to be more natural than are concurrent
specifications~\cite{SergioRajsbaum2020HistoryLinearizability}.
But this assertion flies in the face of reality in the form of a highly
concurrent objective universe.
Therefore, nature would be expected to select for beings able to cope
with at least some level of concurrency, and anyone watching a team sport
or a person successfully overseeing a number of small children would be
hard-pressed to argue that nature has not done exactly that.
In addition, given that the teaching of sequential
computing is still believed to be somewhat of a black
art~\cite{ElizabethPatitsas2020GradesNotBimodal}, it is reasonable
to expect that teaching of concurrent computing is in a similar state
of disarray.
Therefore, focusing on only one proof technique is unlikely to be a
good way forward.
Again, please understand that linearizability is quite useful in many
situations.
Then again, so is that venerable tool, the hammer.
But there comes a point in the field of computing where one should put
down the hammer and pick up a keyboard.
Similarly, it appears that there are times when linearizability is not
the best tool for the job.
To their credit, there are some linearizability advocates who are aware
of some of its shortcomings~\cite{SergioRajsbaum2020HistoryLinearizability}.
There are also proposals to extend linearizability, for example,
interval-linearizability, which is intended to handle the common case
of operations that require non-zero time to
complete~\cite{10.1145/3266457}.
It remains to be seen whether these proposals will result in theories
able to handle modern concurrent software artifacts, especially given
that several of the proof techniques discussed in \cref{chp:Formal
Verification} already handle many modern concurrent software artifacts.
\subsection{NBS Discussion}
\label{sec:advsync:NBS Discussion}
It is possible to create fully non-blocking queues~\cite{MichaelScott96},
however, such queues are much more complex than the half-NBS algorithm
outlined above.
The lesson here is to carefully consider your actual requirements.
Relaxing irrelevant requirements can often result in great
improvements in simplicity, performance, and scalability.
Recent research points to another important way to relax requirements.
It turns out that systems providing fair scheduling can enjoy most
of the benefits of wait-free synchronization even when running
algorithms that provide only non-blocking
synchronization, both in theory~\cite{DanAlitarh2013PracticalProgress}
and in practice~\cite{SamyAlBahra2013NBS}.
Because a great many schedulers used in production do in fact
provide fairness,
the more-complex algorithms providing wait-free synchronization
usually provide no practical advantages over their simpler
and often faster non-blocking-synchronization counterparts.
Interestingly enough, fair scheduling is but one beneficial
constraint that is often respected in practice.
Other sets of constraints can permit blocking algorithms to
achieve deterministic real-time response.
For example, given fair locks that are granted to requesters in FIFO order at
a given priority level,
a method of avoiding priority inversion (such as priority
inheritance~\cite{Takada:1995:RSN:527074.828566,Cai-DongWang1996PrioInherLock}
or priority ceiling), a bounded number of threads,
bounded critical sections,
bounded load,
and avoidance of fail-stop bugs,
lock-based applications can provide deterministic
response times~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.
This approach of course blurs the distinction between blocking and wait-free
synchronization, which is all to the good.
Hopefully theoretical frameworks continue to grow, further increasing
their ability to
describe how software is actually constructed in practice.
Those who feel that theory should lead the way are referred to the
inimitable Peter Denning, who said of operating systems:
``Theory follows practice''~\cite{Denning:2015:POF:2830903.2830904},
or to the eminent Tony Hoare, who said of the whole of engineering:
``In all branches of engineering science, the engineering starts before
the science; indeed, without the early products of engineering, there
would be nothing for the scientist to
study!''~\cite{RichardMorris2007TonyHoareInterview}
However, once an appropriate body of theory becomes available,\footnote{
Admittedly often much later than the first \emph{proposed}
body of theory.}
it is wise to make use of it.
\input{advsync/rt}
\QuickQuizAnswersChp{qqzadvsync}