defer/whichtochoose.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/whichtochoose.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Which to Choose?}
 \label{sec:defer:Which to Choose?}
 %
 \epigraph{Choose always the way that seems the best, however rough it
 	  may be; custom will soon render it easy and agreeable.}
 	  {Pythagoras}

 \Cref{sec:defer:Which to Choose? (Overview)}
 provides a high-level overview and then
 \cref{sec:defer:Which to Choose? (Details)}
 provides a more detailed view
 of the differences between the deferred-processing techniques presented
 in this chapter.
 This discussion assumes a linked data structure that is large enough
 that readers do not hold references from one traversal to another,
 and where elements might be added to and removed from the structure
 at any location and at any time.
 \Cref{sec:defer:Which to Choose? (Production Use)}
 then points out a few publicly visible production uses of
 \IXpl{hazard pointer}, sequence locking, and RCU\@.
 This discussion should help you to make an informed choice between
 these techniques.

 \subsection{Which to Choose?
 			      (Overview)}
 \label{sec:defer:Which to Choose? (Overview)}

 \begin{table*}
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.25}
 \footnotesize
 \centering\OneColumnHSpace{-.3in}
 \ebresizewidth{
 \begin{tabularx}{5.3in}{>{\raggedright\arraybackslash}p{1.1in}
     >{\raggedright\arraybackslash}p{1.0in}
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}p{.9in}}
 	\toprule
 	Property
 		& Reference Counting
 			& Hazard Pointers
 				& Sequence Locks
 					& RCU \\
 %		  RC	  HP	  SL	  RCU \\
 	\midrule
 	Readers
 		& Slow and unscalable
 			& Fast and scalable
 				& Fast and scalable
 					& Fast and scalable \\
 	Memory Overhead
 		& Counter per object
 			& Pointer per reader per object
 				& No protection
 					& None \\
 	Duration of Protection
 		& Can be long
 			& Can be long
 				& No protection
 					& User must bound duration \\
 	Need for Traversal Retries
 		& If object deleted
 			& If object deleted
 				& If any update
 					& Never \\
 	Reclamation Timing
 		& Immediate
 			& Batching delays
 				& N/A
 					& Pre-existing readers done plus
 					  batching delays \\
 	\bottomrule
 \end{tabularx}
 }
 \caption{Which Deferred Technique to Choose?
 					     (Overview)}
 \label{tab:defer:Which Deferred Technique to Choose? (Overview)}
 \end{table*}

 \Cref{tab:defer:Which Deferred Technique to Choose? (Overview)}
 shows a few high-level properties that distinguish the deferred-reclamation
 techniques from one another.

 The ``Readers'' row summarizes the results presented in
 \cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
 which shows that all but \IXalt{reference counting}{reference count}
 enjoy reasonably fast and scalable readers.

 The ``Memory Overhead'' row evaluates each technique's need
 for external storage with which to record reader protection.
 RCU relies on quiescent states, and thus needs no storage to represent
 readers, whether within or outside of the object.
 Reference counting can use a single integer within each object in the
 structure, and no additional storage is required.
 Hazard pointers require external-to-object pointers be provisioned,
 and that there be sufficient pointers for each CPU or thread to
 track all the objects being referenced at any given time.
 Given that most hazard-pointer-based traversals require only a few
 hazard pointers, this is not normally a problem in practice.
 Of course, sequence locks provides no pointer-traversal protection,
 which is why it is normally used on static data.

 \QuickQuiz{
 	Why can't users dynamically allocate the hazard pointers as they
 	are needed?
 }\QuickQuizAnswer{
 	They can, but at the expense of additional reader-traversal
 	overhead and, in some environments, the need to handle
 	memory-allocation failure.
 }\QuickQuizEnd

 The ``Duration of Protection'' describes constraints (if any) on how
 long a period of time a user may protect a given object.
 Reference counting and hazard pointers can both protect objects for
 extended time periods with no untoward side effects, but
 maintaining an RCU reference to even one object prevents all other RCU
 from being freed.
 RCU readers must therefore be relatively short in order to avoid running
 the system out of memory, with special-purpose implementations such
 as SRCU, Tasks RCU, and Tasks Trace RCU being exceptions to this rule.
 Again, sequence locks provide no pointer-traversal protection,
 which is why it is normally used on static data.

 The ``Need for Traversal Retries'' row tells whether a new reference to
 a given object may be acquired unconditionally, as it can with RCU, or
 whether the reference acquisition can fail, resulting in a retry
 operation, which is the case for reference counting, hazard pointers,
 and sequence locks.
 In the case of reference counting and hazard pointers, retries are only
 required if an attempt to acquire a reference to a given object while
 that object is in the process of being deleted, a topic covered in more
 detail in the next section.
 Sequence locking must of course retry its critical section should it
 run concurrently with any update.

 \QuickQuiz{
 	But don't Linux-kernel \co{kref} reference counters allow
 	guaranteed unconditional reference acquisition?
 }\QuickQuizAnswer{
 	Yes they do, but the guarantee only applies unconditionally
 	in cases where a reference is already held.
 	With this in mind, please review the paragraph at the beginning of
 	\cref{sec:defer:Which to Choose?}, especially the part
 	saying ``large enough that readers do not hold references from
 	one traversal to another''.
 }\QuickQuizEnd

 The ``Reclamation Timing'' gives the minimum delay from the time that
 the last reader finishes with a removed object to the time that this
 removed object may be freed.
 Reference counting is the only technique capable of freeing immediately
 after the last reader finishes.
 This advantage is of course the flip side of the big reference-counting
 disadvantage, that its readers are slow and unscalable.
 In theory, hazard pointers could also reclaim immediately, but in practice
 production-quality hazard-pointers implementations use batching to amortize
 the overhead of scanning the hazard pointers over many updates, and this
 batching incurs further delays.
 RCU must wait for all pre-existing readers to complete, regardless
 of whether or not those readers are in any way related to the newly
 removed object, and then, as with hazard pointers, production-quality
 RCU implementations incur additional delays due to batching.

 Of course, different rows will have different levels of importance in
 different situations.
 For example, if your current code is having read-side scalability problems
 with hazard pointers, then it does not matter that hazard pointers can require
 retrying reference acquisition because your current code already handles
 this.
 Similarly, if response-time considerations already limit the duration
 of reader traversals, as is often the case in kernels and low-level
 applications, then it does not matter that RCU has duration-limit
 requirements because your code already meets them.
 In the same vein, if readers must already write to the objects that they
 are traversing, the read-side overhead of reference counters might
 not be so important.
 Of course, if the data to be protected is in statically allocated variables,
 then sequence locking's inability to protect pointers is irrelevant.

 Finally, there is some work on dynamically switching between hazard
 pointers and RCU based on dynamic sampling of
 delays~\cite{Balmau:2016:FRM:2935764.2935790}.
 This defers the choice between hazard pointers and RCU to runtime,
 and delegates responsibility for the decision to the software.

 Nevertheless, this table should be of great help when choosing between
 these techniques.
 But those wishing more detail should continue on to the next section.

 \subsection{Which to Choose?
 			     (Details)}
 \label{sec:defer:Which to Choose? (Details)}

 \begin{table*}
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.25}
 \footnotesize
 \centering\OneColumnHSpace{-.3in}
 \ebresizewidth{
 \begin{tabularx}{5.3in}{>{\raggedright\arraybackslash}p{1.1in}
     >{\raggedright\arraybackslash}p{1.2in}
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}p{.9in}}
 	\toprule
 	Property
 		& Reference Counting
 			& Hazard Pointers
 				& Sequence Locks
 					& RCU \\
 %		  RC	  HP	  SL	  RCU \\
 	\midrule
 	Existence Guarantees
 		& Complex
 			& Yes
 				& No
 					& Yes \\
 	Updates and Readers Progress Concurrently
 		& Yes
 			& Yes
 				& No
 					& Yes \\
 	Contention Among Readers
 		& High
 			& None
 				& None
 					& None \\
 	Reader Per\-/Critical\-/Section Overhead
 		& N/A
 			& N/A
 				& Two \tco{smp_mb()}
 					& Ranges from none to two
 					  \tco{smp_mb()} \\
 	Reader Per-Object Traversal Overhead
 		& Read-modify-write atomic operations, memory\-/barrier
 		  instructions, and cache misses
 			& \tco{smp_mb()}\parnote[*]{This \co{smp_mb()} can be
 				downgraded to a compiler \co{barrier()} by using
 				the Linux-kernel \co{membarrier()} system call.}
 				& None, but unsafe
 					& None (volatile accesses) \\
 	Reader Forward Progress Guarantee
 		& Lock free
 			& Lock free
 				& Blocking
 					& Bounded wait free \\
 	Reader Reference Acquisition
 		& Can fail (conditional)
 			& Can fail (conditional)
 				& Unsafe
 					& Cannot fail (unconditional) \\
 	Memory Footprint
 		& Bounded
 			& Bounded
 				& Bounded
 					& Unbounded \\
 	Reclamation Forward Progress
 		& Lock free
 			& Lock free
 				& N/A
 					& Blocking \\
 	Automatic Reclamation
 		& Yes
 			& Use Case
 				& N/A
 					& Use Case \\
 	Lines of Code
 		& 94
 			& 79
 				& 79
 					& 73 \\
 	\bottomrule
 \end{tabularx}
 }
 \begin{minipage}{\onecolumntextwidth}
 	\vspace*{1ex}
 	\parnotes
 \end{minipage}
 \caption{Which Deferred Technique to Choose?
 					     (Details)}
 \label{tab:defer:Which Deferred Technique to Choose?  (Details)}
 \end{table*}

 \Cref{tab:defer:Which Deferred Technique to Choose? (Details)}
 provides more-detailed rules of thumb that can help you choose among the
 four deferred-processing techniques presented in this chapter.

 As shown in the ``Existence Guarantee'' row,
 if you need \IXpl{existence guarantee} for linked
 data elements, you must use reference counting, hazard pointers, or RCU\@.
 Sequence locks do not provide existence guarantees, instead providing
 detection of updates, retrying any read-side critical sections
 that do encounter an update.

 Of course, as shown in the ``Updates and Readers Progress Concurrently''
 row, this detection of updates implies
 that sequence locking does not permit updaters and readers to make forward
 progress concurrently.
 After all, preventing such forward progress is the whole point of using
 sequence locking in the first place!
 This situation points the way to using sequence locking in conjunction
 with reference counting, hazard pointers, or RCU in order to provide
 both existence guarantees and update detection.
 In fact, the Linux kernel combines RCU and sequence locking in
 this manner during pathname lookup.

 The ``Contention Among Readers'', ``Reader Per-Critical-Section Overhead'',
 and ``Reader Per-Object Traversal Overhead'' rows give a rough sense of
 the read-side overhead of these techniques.
 The overhead of reference counting can be quite large, with
 contention among readers along with a fully ordered read-modify-write
 atomic operation required for each and every object traversed.
 Hazard pointers incur the overhead of a \IX{memory barrier}
 for each data element
 traversed, and sequence locks incur the overhead of a pair of memory barriers
 for each attempt to execute the critical section.
 The overhead of RCU implementations vary from nothing to that of a pair of
 memory barriers for each read-side critical section, thus providing RCU
 with the best performance, particularly for read-side critical sections
 that traverse many data elements.
 Of course, the read-side overhead of all deferred-processing variants can
 be reduced by batching, so that each read-side operation covers more data.

 \QuickQuiz{
 	But didn't the answer to one of the quick quizzes in
 	\cref{sec:defer:Hazard Pointers}
 	say that pairwise asymmetric barriers could eliminate the
 	read-side \co{smp_mb()} from hazard pointers?
 }\QuickQuizAnswer{
 	Yes, it did.
 	However, doing this could be argued to change hazard-pointers
 	``Reclamation Forward Progress'' row (discussed later) from
 	lock-free to blocking because a CPU spinning with interrupts
 	disabled in the kernel would prevent the update-side portion of
 	the asymmetric barrier from completing.
 	In the Linux kernel, such blocking could in theory be prevented
 	by building the kernel with \co{CONFIG_NO_HZ_FULL}, designating
 	the relevant CPUs as \co{nohz_full} at boot time, ensuring that
 	only one thread was ever runnable on a given CPU at a given
 	time, and avoiding ever calling into the kernel.
 	Alternatively, you could ensure that the kernel was free of any
 	bugs that might cause CPUs to spin with interrupts disabled.

 	Given that CPUs spinning in the Linux kernel with interrupts
 	disabled seems to be rather rare, one might counter-argue that
 	asymmetric-barrier hazard-pointer updates are non-blocking
 	in practice, if not in theory.
 }\QuickQuizEnd

 The ``Reader Forward Progress Guarantee'' row shows that only RCU
 has a \IXalth{bounded wait-free}{bounded}{wait free}
 \IX{forward-progress guarantee}, which means that
 it can carry out a finite traversal by executing a bounded number of
 instructions.

 The ``Reader Reference Acquisition'' row indicates that only RCU is
 capable of unconditionally acquiring references.
 The entry for sequence locks is ``Unsafe'' because, again, sequence locks
 detect updates rather than acquiring references.
 Reference counting and hazard pointers both require that traversals be
 restarted from the beginning if a given acquisition fails.
 To see this, consider a linked list containing objects~A, B, C, and~D,
 in that order, and the following series of events:

 \begin{enumerate}
 \item	A reader acquires a reference to object~B.
 \item	An updater removes object~B, but refrains from freeing it because
 	the reader holds a reference.
 	The list now contains objects~A, C, and~D, and
 	object~B's \co{->next} pointer is set to \co{HAZPTR_POISON}.
 \item	The updater removes object~C, so that the list now contains
 	objects~A and~D\@.
 	Because there is no reference to object~C, it is immediately freed.
 \item	The reader tries to advance to the successor of the object
 	following the now-removed object~B, but the poisoned
 	\co{->next} pointer prevents this.
 	Which is a good thing, because object~B's \co{->next} pointer
 	would otherwise point to the freelist.
 \item	The reader must therefore restart its traversal from the head
 	of the list.
 \end{enumerate}

 Thus, when failing to acquire a reference, a hazard-pointer or
 reference-counter traversal must restart that traversal from the
 beginning.
 In the case of nested linked data structures, for example, a
 tree containing linked lists, the traversal must be restarted from
 the outermost data structure.
 This situation gives RCU a significant ease-of-use advantage.

 However, RCU's ease-of-use advantage does not come
 for free, as can be seen in the ``Memory Footprint'' row.
 RCU's support of unconditional reference acquisition means that
 it must avoid freeing any object reachable by a given
 RCU reader until that reader completes.
 RCU therefore has an unbounded memory footprint, at least unless updates
 are throttled.
 In contrast, reference counting and hazard pointers need to retain only
 those data elements actually referenced by concurrent readers.

 This tension between memory footprint and acquisition
 failures is sometimes resolved within the Linux kernel by combining use
 of RCU and reference counters.
 RCU is used for short-lived references, which means that RCU read-side
 critical sections can be short.
 These short RCU read-side critical sections in turn mean that the corresponding
 RCU \IXpl{grace period} can also be short, which limits the memory footprint.
 For the few data elements that need longer-lived references, reference
 counting is used.
 This means that the complexity of reference-acquisition failure only
 needs to be dealt with for those few data elements:
 The bulk of the reference acquisitions are unconditional, courtesy of RCU\@.
 See \cref{sec:together:Refurbish Reference Counting}
 for more information on combining reference counting with other
 synchronization mechanisms.

 The ``Reclamation Forward Progress'' row shows that hazard pointers
 can provide non-blocking updates~\cite{MagedMichael04a,HerlihyLM02}.
 Reference counting might or might not, depending on the implementation.
 However, sequence locking cannot provide non-blocking updates, courtesy
 of its update-side lock.
 RCU updaters must wait on readers, which also rules out fully non-blocking
 updates.
 However, there are situations in which the only blocking operation is
 a wait to free memory, which results in a situation that, for many
 purposes, is as good as non-blocking~\cite{MathieuDesnoyers2012URCU}.

 As shown in the ``Automatic Reclamation'' row, only reference
 counting can automate freeing of memory, and even then only
 for non-cyclic data structures.
 Certain use cases for hazard pointers and RCU can provide automatic
 reclamation using \emph{link counts}, which can be thought of as
 reference counts, but applying only to incoming links from other
 parts of the data structure~\cite{MagedMichael2018FollyHazptr}.

 Finally, the ``Lines of Code'' row shows the size of the Pre-BSD
 Routing Table implementations, giving a rough idea of relative ease of use.
 That said, it is important to note that the reference-counting and
 sequence-locking implementations are buggy, and that a correct
 reference-counting implementation is considerably
 more complex~\cite{Valois95a,MagedMichael95a}.
 For its part, a correct sequence-locking implementation requires
 the addition of some other synchronization mechanism, for example,
 hazard pointers or RCU, so that sequence locking detects concurrent
 updates and the other mechanism provides safe reference acquisition.

 As more experience is gained using these techniques, both separately
 and in combination, the rules of thumb laid out in this section will
 need to be refined.
 However, this section does reflect the current state of the art.

 \subsection{Which to Choose?
 			     (Production Use)}
 \label{sec:defer:Which to Choose? (Production Use)}

 This section points out a few publicly visible production uses of
 hazard pointers, sequence locking, and RCU\@.
 Reference counting is omitted, not because it is unimportant, but rather
 because it is not only used pervasively, but heavily documented in textbooks
 going back a half century.
 One of the hoped-for benefits of listing production uses of these other
 techniques is to provide examples to study---or to find bugs in, as
 the case may be.\footnote{
 	Kudos to Mathias Stearn, Matt Wilson, David Goldblatt, LiveJournal
 	user fanf, Nadav Har'El, Avi Kivity, Dmitry Vyukov, Raul Guitterez
 	S., Twitter user @peo3, Paolo Bonzini, and Thomas Monjalon for
 	locating a great many of these use cases.}

 \subsubsection{Production Uses of Hazard Pointers}
 \label{sec:defer:Production Uses of Hazard Pointers}

 In 2010, Keith Bostic added a variant of hazard pointers to
 WiredTiger~\cite{KeithBostic2010WiredTigerhazptr}.
 MongoDB 3.0, released in 2015, included WiredTiger and thus hazard pointers.

 In 2011, Samy Al Bahra added hazard pointers to the Concurrency Kit
 library~\cite{SamyAlBahra2011ckhp}.

 In 2014, Maxim Khizhinsky added hazard pointers to
 libcds~\cite{MaximKhizhinsky2014libcdsHazptr}.

 In 2015, David Gwynne introduced shared reference pointers, a form
 of hazard pointers, to OpenBSD~\cite{DavidGwynne2015srp}.

 In 2017--2018, the Rust-language
 \co{arc-swap}~\cite{MichalVaner2018arc-swapHazptr} and
 \co{conc}~\cite{crates.io.user.ticki2017concHazptr}
 crates rolled their own implementations of hazard pointers.

 In 2018, Maged Michael added hazard pointers to Facebook's Folly
 library~\cite{MagedMichael2018FollyHazptr}, where it is used heavily.

 \subsubsection{Production Uses of Sequence Locking}
 \label{sec:defer:Production Uses of Sequence Locking}

 The Linux kernel added sequence locking to v2.5.60 in
 2003~\cite{JonathanCorbet2003seqlock}, having been generalized from
 an ad-hoc technique used in x86's implementation of the
 \co{gettimeofday()} system call.

 In 2011, Samy Al Bahra added sequence locking to the Concurrency Kit
 library~\cite{SamyAlBahra2011ckseqlock}.

 Paolo Bonzini added a simple sequence-lock to the QEMU emulator in
 2013~\cite{PaoloBonzini2013QEMUseqlock}.

 Alexis Menard abstracted a sequence-lock implementation in Chromium
 in 2016~\cite{AlexisMenard2016ChromiumSeqLock}.

 A simple sequence locking implementation was added to \co{jemalloc()}
 in 2018~\cite{DavidGoldblatt2018seqlock}.
 The eigen library also has a special-purpose queue that is managed by
 a mechanism resembling sequence locking.

 \subsubsection{Production Uses of RCU}
 \label{sec:defer:Production Uses of RCU}

 IBM's VM/XA is adopted passive serialization, a mechanism similar to
 RCU, some time in the 1980s~\cite{Hennessy89}.

 DYNIX/ptx adopted RCU in 1993~\cite{McKenney98,Slingwine95}.

 The Linux kernel adopted Dipankar Sarma's implementation of RCU in
 2002~\cite{Torvalds2.5.43}.

 The userspace RCU project started in 2009~\cite{MathieuDesnoyers2009URCU}.

 The Knot DNS project started using the userspace RCU
 library in 2010~\cite{LubosSlovak2010KnotDNSRCU}.
 That same year, the OSv kernel added an RCU
 implementation~\cite{AviKivity2013OSvRCU},
 later adding an RCU-protected linked list~\cite{AviKivity2013OSvRCUlist}
 and an RCU-protected hash table~\cite{AviKivity2013OSvRCUhash}.

 In 2011, Samy Al Bahra added epochs
 (a form of RCU~\cite{UCAM-CL-TR-579,KeirFraser2007withoutLocks})
 to the Concurrency Kit
 library~\cite{SamyAlBahra2011ckEpoch}.

 NetBSD began using the aforementioned passive serialization with v6.0 in
 2012~\cite{NetBSD2012pserialize}.
 Among other things, passive serialization is used in
 NetBSD packet filter (NPF)~\cite{MindaugasRasiukevicius2014NPFRCU}.

 Paolo Bonzini added RCU support to the QEMU emulator in 2015 via a
 friendly fork of the userspace RCU
 library~\cite{MikeDay2013RCUqemu,PaoloBonzini2013QEMURCU}.

 In 2015, Maxim Khizhinsky added RCU to
 libcds~\cite{MaxKhiszinsky2015C++RCU}.

 Mindaugas Rasiukevicius implemented libqsbr in 2016, which features
 \IXacr{qsbr} and \IXacrf{ebr}~\cite{MindaugasRasiukevicius2016libqsbr},
 both of which are types of implementations of RCU\@.

 Sheth et al.~\cite{HarshalSheth2016goRCU}
 demonstrated the value of leveraging Go's garbage
 collector to provide RCU-like functionality, and
 the Go programming language provides a \co{Value} type that can
 provide this functionality.\footnote{
 	See \url{https://golang.org/pkg/sync/atomic/\#Value}, particularly
 	the ``Example (ReadMostly)''.}

 Matt Klein describes an RCU-like mechanism that is used in the Envoy
 Proxy~\cite{MattKlein2017EnvoyRCU}.

 Honnappa Nagarahalli added an RCU library to the Data Plane Development
 Kit (DPDK) in 2018~\cite{HonnappaNagarahalli2018dpdkRCU}.

 Stjepan Glavina merged an epoch-based RCU implementation into the
 \co{crossbeam} set of concurrency-support ``crates'' for the Rust
 language~\cite{StjepanGlavina2018RustRCU}.

 Jason Donenfeld produced an RCU implementations as part of his port of
 WireGuard to Windows~NT kernel~\cite{JasonDonenfeld2021:WindowsNTwireguardRCU}.

 Finally, any garbage-collected concurrent language (not just Go!\@) gets
 the update side of an RCU implementation at zero incremental cost.

 \subsubsection{Summary of Production Uses}
 \label{sec:defer:Summary of Production Uses}

 Perhaps the time will come when sequence locking, hazard pointers, and
 RCU are all as heavily used and as well known as are reference counters.
 Until that time comes, the current production uses of these mechanisms
 should help guide the choice of mechanism as well as showing how best
 to apply each of them.
 And with that, we have uncovered the last of the mysteries put forth on
 \cpageref{sec:defer:Mysteries Which to choose}.

 The next section discusses updates, a ticklish issue for many of the
 read-mostly mechanisms described in this chapter.