blob: adcddcb53e65210297902c402564e887d321b321 [file] [log] [blame]
% defer/whichtochoose.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Which to Choose?}
\label{sec:defer:Which to Choose?}
%
\epigraph{Choose always the way that seems the best, however rough it
may be; custom will soon render it easy and agreeable.}
{Pythagoras}
\Cref{sec:defer:Which to Choose? (Overview)}
provides a high-level overview and then
\cref{sec:defer:Which to Choose? (Details)}
provides a more detailed view
of the differences between the deferred-processing techniques presented
in this chapter.
This discussion assumes a linked data structure that is large enough
that readers do not hold references from one traversal to another,
and where elements might be added to and removed from the structure
at any location and at any time.
\Cref{sec:defer:Which to Choose? (Production Use)}
then points out a few publicly visible production uses of
\IXpl{hazard pointer}, sequence locking, and RCU\@.
This discussion should help you to make an informed choice between
these techniques.
\subsection{Which to Choose?
(Overview)}
\label{sec:defer:Which to Choose? (Overview)}
\begin{table*}
\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.25}
\footnotesize
\centering\OneColumnHSpace{-.3in}
\ebresizewidth{
\begin{tabularx}{5.3in}{>{\raggedright\arraybackslash}p{1.1in}
>{\raggedright\arraybackslash}p{1.0in}
>{\raggedright\arraybackslash}X
>{\raggedright\arraybackslash}X
>{\raggedright\arraybackslash}p{.9in}}
\toprule
Property
& Reference Counting
& Hazard Pointers
& Sequence Locks
& RCU \\
% RC HP SL RCU \\
\midrule
Readers
& Slow and unscalable
& Fast and scalable
& Fast and scalable
& Fast and scalable \\
Memory Overhead
& Counter per object
& Pointer per reader per object
& No protection
& None \\
Duration of Protection
& Can be long
& Can be long
& No protection
& User must bound duration \\
Need for Traversal Retries
& If object deleted
& If object deleted
& If any update
& Never \\
Reclamation Timing
& Immediate
& Batching delays
& N/A
& Pre-existing readers done plus
batching delays \\
\bottomrule
\end{tabularx}
}
\caption{Which Deferred Technique to Choose?
(Overview)}
\label{tab:defer:Which Deferred Technique to Choose? (Overview)}
\end{table*}
\Cref{tab:defer:Which Deferred Technique to Choose? (Overview)}
shows a few high-level properties that distinguish the deferred-reclamation
techniques from one another.
The ``Readers'' row summarizes the results presented in
\cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
which shows that all but \IXalt{reference counting}{reference count}
enjoy reasonably fast and scalable readers.
The ``Memory Overhead'' row evaluates each technique's need
for external storage with which to record reader protection.
RCU relies on quiescent states, and thus needs no storage to represent
readers, whether within or outside of the object.
Reference counting can use a single integer within each object in the
structure, and no additional storage is required.
Hazard pointers require external-to-object pointers be provisioned,
and that there be sufficient pointers for each CPU or thread to
track all the objects being referenced at any given time.
Given that most hazard-pointer-based traversals require only a few
hazard pointers, this is not normally a problem in practice.
Of course, sequence locks provides no pointer-traversal protection,
which is why it is normally used on static data.
\QuickQuiz{
Why can't users dynamically allocate the hazard pointers as they
are needed?
}\QuickQuizAnswer{
They can, but at the expense of additional reader-traversal
overhead and, in some environments, the need to handle
memory-allocation failure.
}\QuickQuizEnd
The ``Duration of Protection'' describes constraints (if any) on how
long a period of time a user may protect a given object.
Reference counting and hazard pointers can both protect objects for
extended time periods with no untoward side effects, but
maintaining an RCU reference to even one object prevents all other RCU
from being freed.
RCU readers must therefore be relatively short in order to avoid running
the system out of memory, with special-purpose implementations such
as SRCU, Tasks RCU, and Tasks Trace RCU being exceptions to this rule.
Again, sequence locks provide no pointer-traversal protection,
which is why it is normally used on static data.
The ``Need for Traversal Retries'' row tells whether a new reference to
a given object may be acquired unconditionally, as it can with RCU, or
whether the reference acquisition can fail, resulting in a retry
operation, which is the case for reference counting, hazard pointers,
and sequence locks.
In the case of reference counting and hazard pointers, retries are only
required if an attempt to acquire a reference to a given object while
that object is in the process of being deleted, a topic covered in more
detail in the next section.
Sequence locking must of course retry its critical section should it
run concurrently with any update.
\QuickQuiz{
But don't Linux-kernel \co{kref} reference counters allow
guaranteed unconditional reference acquisition?
}\QuickQuizAnswer{
Yes they do, but the guarantee only applies unconditionally
in cases where a reference is already held.
With this in mind, please review the paragraph at the beginning of
\cref{sec:defer:Which to Choose?}, especially the part
saying ``large enough that readers do not hold references from
one traversal to another''.
}\QuickQuizEnd
The ``Reclamation Timing'' gives the minimum delay from the time that
the last reader finishes with a removed object to the time that this
removed object may be freed.
Reference counting is the only technique capable of freeing immediately
after the last reader finishes.
This advantage is of course the flip side of the big reference-counting
disadvantage, that its readers are slow and unscalable.
In theory, hazard pointers could also reclaim immediately, but in practice
production-quality hazard-pointers implementations use batching to amortize
the overhead of scanning the hazard pointers over many updates, and this
batching incurs further delays.
RCU must wait for all pre-existing readers to complete, regardless
of whether or not those readers are in any way related to the newly
removed object, and then, as with hazard pointers, production-quality
RCU implementations incur additional delays due to batching.
Of course, different rows will have different levels of importance in
different situations.
For example, if your current code is having read-side scalability problems
with hazard pointers, then it does not matter that hazard pointers can require
retrying reference acquisition because your current code already handles
this.
Similarly, if response-time considerations already limit the duration
of reader traversals, as is often the case in kernels and low-level
applications, then it does not matter that RCU has duration-limit
requirements because your code already meets them.
In the same vein, if readers must already write to the objects that they
are traversing, the read-side overhead of reference counters might
not be so important.
Of course, if the data to be protected is in statically allocated variables,
then sequence locking's inability to protect pointers is irrelevant.
Finally, there is some work on dynamically switching between hazard
pointers and RCU based on dynamic sampling of
delays~\cite{Balmau:2016:FRM:2935764.2935790}.
This defers the choice between hazard pointers and RCU to runtime,
and delegates responsibility for the decision to the software.
Nevertheless, this table should be of great help when choosing between
these techniques.
But those wishing more detail should continue on to the next section.
\subsection{Which to Choose?
(Details)}
\label{sec:defer:Which to Choose? (Details)}
\begin{table*}
\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.25}
\footnotesize
\centering\OneColumnHSpace{-.3in}
\ebresizewidth{
\begin{tabularx}{5.3in}{>{\raggedright\arraybackslash}p{1.1in}
>{\raggedright\arraybackslash}p{1.2in}
>{\raggedright\arraybackslash}X
>{\raggedright\arraybackslash}X
>{\raggedright\arraybackslash}p{.9in}}
\toprule
Property
& Reference Counting
& Hazard Pointers
& Sequence Locks
& RCU \\
% RC HP SL RCU \\
\midrule
Existence Guarantees
& Complex
& Yes
& No
& Yes \\
Updates and Readers Progress Concurrently
& Yes
& Yes
& No
& Yes \\
Contention Among Readers
& High
& None
& None
& None \\
Reader Per\-/Critical\-/Section Overhead
& N/A
& N/A
& Two \tco{smp_mb()}
& Ranges from none to two
\tco{smp_mb()} \\
Reader Per-Object Traversal Overhead
& Read-modify-write atomic operations, memory\-/barrier
instructions, and cache misses
& \tco{smp_mb()}\parnote[*]{This \co{smp_mb()} can be
downgraded to a compiler \co{barrier()} by using
the Linux-kernel \co{membarrier()} system call.}
& None, but unsafe
& None (volatile accesses) \\
Reader Forward Progress Guarantee
& Lock free
& Lock free
& Blocking
& Bounded wait free \\
Reader Reference Acquisition
& Can fail (conditional)
& Can fail (conditional)
& Unsafe
& Cannot fail (unconditional) \\
Memory Footprint
& Bounded
& Bounded
& Bounded
& Unbounded \\
Reclamation Forward Progress
& Lock free
& Lock free
& N/A
& Blocking \\
Automatic Reclamation
& Yes
& Use Case
& N/A
& Use Case \\
Lines of Code
& 94
& 79
& 79
& 73 \\
\bottomrule
\end{tabularx}
}
\begin{minipage}{\onecolumntextwidth}
\vspace*{1ex}
\parnotes
\end{minipage}
\caption{Which Deferred Technique to Choose?
(Details)}
\label{tab:defer:Which Deferred Technique to Choose? (Details)}
\end{table*}
\Cref{tab:defer:Which Deferred Technique to Choose? (Details)}
provides more-detailed rules of thumb that can help you choose among the
four deferred-processing techniques presented in this chapter.
As shown in the ``Existence Guarantee'' row,
if you need \IXpl{existence guarantee} for linked
data elements, you must use reference counting, hazard pointers, or RCU\@.
Sequence locks do not provide existence guarantees, instead providing
detection of updates, retrying any read-side critical sections
that do encounter an update.
Of course, as shown in the ``Updates and Readers Progress Concurrently''
row, this detection of updates implies
that sequence locking does not permit updaters and readers to make forward
progress concurrently.
After all, preventing such forward progress is the whole point of using
sequence locking in the first place!
This situation points the way to using sequence locking in conjunction
with reference counting, hazard pointers, or RCU in order to provide
both existence guarantees and update detection.
In fact, the Linux kernel combines RCU and sequence locking in
this manner during pathname lookup.
The ``Contention Among Readers'', ``Reader Per-Critical-Section Overhead'',
and ``Reader Per-Object Traversal Overhead'' rows give a rough sense of
the read-side overhead of these techniques.
The overhead of reference counting can be quite large, with
contention among readers along with a fully ordered read-modify-write
atomic operation required for each and every object traversed.
Hazard pointers incur the overhead of a \IX{memory barrier}
for each data element
traversed, and sequence locks incur the overhead of a pair of memory barriers
for each attempt to execute the critical section.
The overhead of RCU implementations vary from nothing to that of a pair of
memory barriers for each read-side critical section, thus providing RCU
with the best performance, particularly for read-side critical sections
that traverse many data elements.
Of course, the read-side overhead of all deferred-processing variants can
be reduced by batching, so that each read-side operation covers more data.
\QuickQuiz{
But didn't the answer to one of the quick quizzes in
\cref{sec:defer:Hazard Pointers}
say that pairwise asymmetric barriers could eliminate the
read-side \co{smp_mb()} from hazard pointers?
}\QuickQuizAnswer{
Yes, it did.
However, doing this could be argued to change hazard-pointers
``Reclamation Forward Progress'' row (discussed later) from
lock-free to blocking because a CPU spinning with interrupts
disabled in the kernel would prevent the update-side portion of
the asymmetric barrier from completing.
In the Linux kernel, such blocking could in theory be prevented
by building the kernel with \co{CONFIG_NO_HZ_FULL}, designating
the relevant CPUs as \co{nohz_full} at boot time, ensuring that
only one thread was ever runnable on a given CPU at a given
time, and avoiding ever calling into the kernel.
Alternatively, you could ensure that the kernel was free of any
bugs that might cause CPUs to spin with interrupts disabled.
Given that CPUs spinning in the Linux kernel with interrupts
disabled seems to be rather rare, one might counter-argue that
asymmetric-barrier hazard-pointer updates are non-blocking
in practice, if not in theory.
}\QuickQuizEnd
The ``Reader Forward Progress Guarantee'' row shows that only RCU
has a \IXalth{bounded wait-free}{bounded}{wait free}
\IX{forward-progress guarantee}, which means that
it can carry out a finite traversal by executing a bounded number of
instructions.
The ``Reader Reference Acquisition'' row indicates that only RCU is
capable of unconditionally acquiring references.
The entry for sequence locks is ``Unsafe'' because, again, sequence locks
detect updates rather than acquiring references.
Reference counting and hazard pointers both require that traversals be
restarted from the beginning if a given acquisition fails.
To see this, consider a linked list containing objects~A, B, C, and~D,
in that order, and the following series of events:
\begin{enumerate}
\item A reader acquires a reference to object~B.
\item An updater removes object~B, but refrains from freeing it because
the reader holds a reference.
The list now contains objects~A, C, and~D, and
object~B's \co{->next} pointer is set to \co{HAZPTR_POISON}.
\item The updater removes object~C, so that the list now contains
objects~A and~D\@.
Because there is no reference to object~C, it is immediately freed.
\item The reader tries to advance to the successor of the object
following the now-removed object~B, but the poisoned
\co{->next} pointer prevents this.
Which is a good thing, because object~B's \co{->next} pointer
would otherwise point to the freelist.
\item The reader must therefore restart its traversal from the head
of the list.
\end{enumerate}
Thus, when failing to acquire a reference, a hazard-pointer or
reference-counter traversal must restart that traversal from the
beginning.
In the case of nested linked data structures, for example, a
tree containing linked lists, the traversal must be restarted from
the outermost data structure.
This situation gives RCU a significant ease-of-use advantage.
However, RCU's ease-of-use advantage does not come
for free, as can be seen in the ``Memory Footprint'' row.
RCU's support of unconditional reference acquisition means that
it must avoid freeing any object reachable by a given
RCU reader until that reader completes.
RCU therefore has an unbounded memory footprint, at least unless updates
are throttled.
In contrast, reference counting and hazard pointers need to retain only
those data elements actually referenced by concurrent readers.
This tension between memory footprint and acquisition
failures is sometimes resolved within the Linux kernel by combining use
of RCU and reference counters.
RCU is used for short-lived references, which means that RCU read-side
critical sections can be short.
These short RCU read-side critical sections in turn mean that the corresponding
RCU \IXpl{grace period} can also be short, which limits the memory footprint.
For the few data elements that need longer-lived references, reference
counting is used.
This means that the complexity of reference-acquisition failure only
needs to be dealt with for those few data elements:
The bulk of the reference acquisitions are unconditional, courtesy of RCU\@.
See \cref{sec:together:Refurbish Reference Counting}
for more information on combining reference counting with other
synchronization mechanisms.
The ``Reclamation Forward Progress'' row shows that hazard pointers
can provide non-blocking updates~\cite{MagedMichael04a,HerlihyLM02}.
Reference counting might or might not, depending on the implementation.
However, sequence locking cannot provide non-blocking updates, courtesy
of its update-side lock.
RCU updaters must wait on readers, which also rules out fully non-blocking
updates.
However, there are situations in which the only blocking operation is
a wait to free memory, which results in a situation that, for many
purposes, is as good as non-blocking~\cite{MathieuDesnoyers2012URCU}.
As shown in the ``Automatic Reclamation'' row, only reference
counting can automate freeing of memory, and even then only
for non-cyclic data structures.
Certain use cases for hazard pointers and RCU can provide automatic
reclamation using \emph{link counts}, which can be thought of as
reference counts, but applying only to incoming links from other
parts of the data structure~\cite{MagedMichael2018FollyHazptr}.
Finally, the ``Lines of Code'' row shows the size of the Pre-BSD
Routing Table implementations, giving a rough idea of relative ease of use.
That said, it is important to note that the reference-counting and
sequence-locking implementations are buggy, and that a correct
reference-counting implementation is considerably
more complex~\cite{Valois95a,MagedMichael95a}.
For its part, a correct sequence-locking implementation requires
the addition of some other synchronization mechanism, for example,
hazard pointers or RCU, so that sequence locking detects concurrent
updates and the other mechanism provides safe reference acquisition.
As more experience is gained using these techniques, both separately
and in combination, the rules of thumb laid out in this section will
need to be refined.
However, this section does reflect the current state of the art.
\subsection{Which to Choose?
(Production Use)}
\label{sec:defer:Which to Choose? (Production Use)}
This section points out a few publicly visible production uses of
hazard pointers, sequence locking, and RCU\@.
Reference counting is omitted, not because it is unimportant, but rather
because it is not only used pervasively, but heavily documented in textbooks
going back a half century.
One of the hoped-for benefits of listing production uses of these other
techniques is to provide examples to study---or to find bugs in, as
the case may be.\footnote{
Kudos to Mathias Stearn, Matt Wilson, David Goldblatt, LiveJournal
user fanf, Nadav Har'El, Avi Kivity, Dmitry Vyukov, Raul Guitterez
S., Twitter user @peo3, Paolo Bonzini, and Thomas Monjalon for
locating a great many of these use cases.}
\subsubsection{Production Uses of Hazard Pointers}
\label{sec:defer:Production Uses of Hazard Pointers}
In 2010, Keith Bostic added a variant of hazard pointers to
WiredTiger~\cite{KeithBostic2010WiredTigerhazptr}.
MongoDB 3.0, released in 2015, included WiredTiger and thus hazard pointers.
In 2011, Samy Al Bahra added hazard pointers to the Concurrency Kit
library~\cite{SamyAlBahra2011ckhp}.
In 2014, Maxim Khizhinsky added hazard pointers to
libcds~\cite{MaximKhizhinsky2014libcdsHazptr}.
In 2015, David Gwynne introduced shared reference pointers, a form
of hazard pointers, to OpenBSD~\cite{DavidGwynne2015srp}.
In 2017--2018, the Rust-language
\co{arc-swap}~\cite{MichalVaner2018arc-swapHazptr} and
\co{conc}~\cite{crates.io.user.ticki2017concHazptr}
crates rolled their own implementations of hazard pointers.
In 2018, Maged Michael added hazard pointers to Facebook's Folly
library~\cite{MagedMichael2018FollyHazptr}, where it is used heavily.
\subsubsection{Production Uses of Sequence Locking}
\label{sec:defer:Production Uses of Sequence Locking}
The Linux kernel added sequence locking to v2.5.60 in
2003~\cite{JonathanCorbet2003seqlock}, having been generalized from
an ad-hoc technique used in x86's implementation of the
\co{gettimeofday()} system call.
In 2011, Samy Al Bahra added sequence locking to the Concurrency Kit
library~\cite{SamyAlBahra2011ckseqlock}.
Paolo Bonzini added a simple sequence-lock to the QEMU emulator in
2013~\cite{PaoloBonzini2013QEMUseqlock}.
Alexis Menard abstracted a sequence-lock implementation in Chromium
in 2016~\cite{AlexisMenard2016ChromiumSeqLock}.
A simple sequence locking implementation was added to \co{jemalloc()}
in 2018~\cite{DavidGoldblatt2018seqlock}.
The eigen library also has a special-purpose queue that is managed by
a mechanism resembling sequence locking.
\subsubsection{Production Uses of RCU}
\label{sec:defer:Production Uses of RCU}
IBM's VM/XA is adopted passive serialization, a mechanism similar to
RCU, some time in the 1980s~\cite{Hennessy89}.
DYNIX/ptx adopted RCU in 1993~\cite{McKenney98,Slingwine95}.
The Linux kernel adopted Dipankar Sarma's implementation of RCU in
2002~\cite{Torvalds2.5.43}.
The userspace RCU project started in 2009~\cite{MathieuDesnoyers2009URCU}.
The Knot DNS project started using the userspace RCU
library in 2010~\cite{LubosSlovak2010KnotDNSRCU}.
That same year, the OSv kernel added an RCU
implementation~\cite{AviKivity2013OSvRCU},
later adding an RCU-protected linked list~\cite{AviKivity2013OSvRCUlist}
and an RCU-protected hash table~\cite{AviKivity2013OSvRCUhash}.
In 2011, Samy Al Bahra added epochs
(a form of RCU~\cite{UCAM-CL-TR-579,KeirFraser2007withoutLocks})
to the Concurrency Kit
library~\cite{SamyAlBahra2011ckEpoch}.
NetBSD began using the aforementioned passive serialization with v6.0 in
2012~\cite{NetBSD2012pserialize}.
Among other things, passive serialization is used in
NetBSD packet filter (NPF)~\cite{MindaugasRasiukevicius2014NPFRCU}.
Paolo Bonzini added RCU support to the QEMU emulator in 2015 via a
friendly fork of the userspace RCU
library~\cite{MikeDay2013RCUqemu,PaoloBonzini2013QEMURCU}.
In 2015, Maxim Khizhinsky added RCU to
libcds~\cite{MaxKhiszinsky2015C++RCU}.
Mindaugas Rasiukevicius implemented libqsbr in 2016, which features
\IXacr{qsbr} and \IXacrf{ebr}~\cite{MindaugasRasiukevicius2016libqsbr},
both of which are types of implementations of RCU\@.
Sheth et al.~\cite{HarshalSheth2016goRCU}
demonstrated the value of leveraging Go's garbage
collector to provide RCU-like functionality, and
the Go programming language provides a \co{Value} type that can
provide this functionality.\footnote{
See \url{https://golang.org/pkg/sync/atomic/\#Value}, particularly
the ``Example (ReadMostly)''.}
Matt Klein describes an RCU-like mechanism that is used in the Envoy
Proxy~\cite{MattKlein2017EnvoyRCU}.
Honnappa Nagarahalli added an RCU library to the Data Plane Development
Kit (DPDK) in 2018~\cite{HonnappaNagarahalli2018dpdkRCU}.
Stjepan Glavina merged an epoch-based RCU implementation into the
\co{crossbeam} set of concurrency-support ``crates'' for the Rust
language~\cite{StjepanGlavina2018RustRCU}.
Jason Donenfeld produced an RCU implementations as part of his port of
WireGuard to Windows~NT kernel~\cite{JasonDonenfeld2021:WindowsNTwireguardRCU}.
Finally, any garbage-collected concurrent language (not just Go!\@) gets
the update side of an RCU implementation at zero incremental cost.
\subsubsection{Summary of Production Uses}
\label{sec:defer:Summary of Production Uses}
Perhaps the time will come when sequence locking, hazard pointers, and
RCU are all as heavily used and as well known as are reference counters.
Until that time comes, the current production uses of these mechanisms
should help guide the choice of mechanism as well as showing how best
to apply each of them.
And with that, we have uncovered the last of the mysteries put forth on
\cpageref{sec:defer:Mysteries Which to choose}.
The next section discusses updates, a ticklish issue for many of the
read-mostly mechanisms described in this chapter.