defer/rcuusage.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/rcuusage.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \subsection{RCU Usage}
 \label{sec:defer:RCU Usage}
 \OriginallyPublished{sec:defer:RCU Usage}{RCU Usage}{Linux Weekly News}{PaulEMcKenney2008WhatIsRCUUsage}

 This section answers the question ``What is RCU?'' from the viewpoint
 of the uses to which RCU can be put.
 Because RCU is most frequently used to replace some existing mechanism,
 we look at it primarily in terms of its relationship to such mechanisms,
 as listed in \cref{tab:defer:RCU Usage}
 and as displayed in \cref{fig:defer:Relationships Between RCU Use Cases}.
 Following the sections listed in this table,
 \cref{sec:defer:RCU Usage Summary} provides a summary,
 which is expanded on in the following sections and
 elsewhere~\cite{PaulEMcKenney2021RCUMysteriesFundamental,PaulEMcKenney2022RCUMysteriesUseCases}.

 \begin{table}
 \renewcommand*{\arraystretch}{1.2}
 \centering
 \small
 \begin{tabular}{ll}
 \toprule
 Mechanism RCU Replaces & Page \\
 \midrule
 RCU for pre-BSD routing &
 	\pageref{sec:defer:RCU for Pre-BSD Routing} \\
 Wait for pre-existing things to finish &
 	\pageref{sec:defer:Wait for Pre-Existing Things to Finish} \\
 Phased state change &
 	\pageref{sec:defer:Phased State Change} \\
 Add-only list (publish/subscribe) &
 	\pageref{sec:defer:Add-Only List} \\
 Type-safe memory &
 	\pageref{sec:defer:Type-Safe Memory} \\
 Existence Guarantee &
 	\pageref{sec:defer:Existence Guarantee} \\
 Light-weight garbage collector &
 	\pageref{sec:defer:Light-Weight Garbage Collector} \\
 Delete-only list &
 	\pageref{sec:defer:Delete-Only List} \\
 Quasi reader-writer lock &
 	\pageref{sec:defer:Quasi Reader-Writer Lock} \\
 Quasi reference count &
 	\pageref{sec:defer:Quasi Reference Count} \\
 Quasi multi-version concurrency control (MVCC) &
 	\pageref{sec:defer:Quasi Multi-Version Concurrency Control} \\
 \bottomrule
 \end{tabular}
 \caption{RCU Usage}
 \label{tab:defer:RCU Usage}
 \end{table}

 \subsubsection{RCU for Pre-BSD Routing}
 \label{sec:defer:RCU for Pre-BSD Routing}

 In contrast to the later sections, this section focuses on a very
 specific use case for the purpose of comparison with other mechanisms.

 \Cref{lst:defer:RCU Pre-BSD Routing Table Lookup,%
 lst:defer:RCU Pre-BSD Routing Table Add/Delete}
 show code for an RCU-protected Pre-BSD routing table
 (\path{route_rcu.c}).
 The former shows data structures and \co{route_lookup()},
 and the latter shows \co{route_add()} and \co{route_del()}.

 \begin{listing}
 \input{CodeSamples/defer/route_rcu@lookup.fcv}
 \caption{RCU Pre-BSD Routing Table Lookup}
 \label{lst:defer:RCU Pre-BSD Routing Table Lookup}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/defer/route_rcu@add_del.fcv}
 \caption{RCU Pre-BSD Routing Table Add/Delete}
 \label{lst:defer:RCU Pre-BSD Routing Table Add/Delete}
 \end{listing}

 \begin{fcvref}[ln:defer:route_rcu:lookup]
 In \cref{lst:defer:RCU Pre-BSD Routing Table Lookup},
 \clnref{rh} adds the \co{->rh} field used by RCU reclamation,
 \clnref{re_freed} adds the \co{->re_freed} use-after-free-check field,
 \clnref{lock,unlock1,unlock2}
 add RCU read-side protection,
 and \clnref{chk_freed,abort} add the use-after-free check.
 \end{fcvref}
 \begin{fcvref}[ln:defer:route_rcu:add_del]
 In \cref{lst:defer:RCU Pre-BSD Routing Table Add/Delete},
 \clnref{add:lock,add:unlock,del:lock,%
 del:unlock1,del:unlock2} add update-side locking,
 \clnref{add:add_rcu,del:del_rcu} add RCU update-side protection,
 \clnref{del:call_rcu} causes \co{route_cb()} to be invoked after
 a grace period elapses,
 and \clnrefrange{cb:b}{cb:e} define \co{route_cb()}.
 This is minimal added code for a working concurrent implementation.
 \end{fcvref}

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.17a/perf-rcu}}
 \caption{Pre-BSD Routing Table Protected by RCU}
 \label{fig:defer:Pre-BSD Routing Table Protected by RCU}
 \end{figure}

 \Cref{fig:defer:Pre-BSD Routing Table Protected by RCU}
 shows the performance on the read-only workload.
 RCU scales quite well, and offers nearly ideal performance.
 However, this data was generated using the \co{RCU_SIGNAL}
 flavor of userspace
 RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU},
 for which \co{rcu_read_lock()} and \co{rcu_read_unlock()}
 generate a small amount of code.
 What happens for the \IXacr{qsbr} flavor of RCU, which generates no code at all
 for \co{rcu_read_lock()} and \co{rcu_read_unlock()}?
 (See \cref{sec:defer:Introduction to RCU},
 and especially
 \cref{fig:defer:QSBR: Waiting for Pre-Existing Readers},
 for a discussion of RCU QSBR\@.)

 The answer to this is shown in
 \cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
 which shows that RCU QSBR's performance and scalability actually exceeds
 that of the ideal synchronization-free workload.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Wait, what???
 	How can RCU QSBR possibly be better than ideal?
 	Just what rubbish definition of ideal would fail to be the best
 	of all possible results???
 }\QuickQuizAnswerB{
 	This is an excellent question, and the answer is that modern
 	CPUs and compilers are extremely complex.
 	But before getting into that, it is well worth noting that
 	RCU QSBR's performance advantage appears only in the
 	one-hardware-thread-per-core regime.
 	Once the system is fully loaded, RCU QSBR's performance drops
 	back to ideal.

 	The RCU variant of the \co{route_lookup()} search loop actually
 	has one more x86 instruction than does the sequential version,
 	namely the \co{lea} in the sequence
 	\co{cmp}, \co{je}, \co{mov}, \co{cmp}, \co{lea}, and \co{jne}.
 	This extra instruction is due to the \co{rcu_head} structure
 	at the beginning of the RCU variant's \co{route_entry} structure,
 	so that, unlike the sequential variant, the RCU variant's
 	\co{->re_next.next} pointer has a non-zero offset.
 	Back in the 1980s, this additional \co{lea} instruction might
 	have reliably resulted in the RCU variant being slower, but we
 	are now in the 21\textsuperscript{st} century, and the 1980s
 	are long gone.

 	But those of you who read
 	\cref{sec:cpu:Pipelined CPUs}
 	carefully already knew all of this!

 	These counter-intuitive results of course means that any
 	performance result on modern microprocessors must be subject to
 	some skepticism.
 	In theory, it really does not make sense to obtain performance
 	results that are better than ideal, but it really can happen
 	on modern microprocessors.
 	Such results can be thought of as similar to the celebrated
 	super-linear speedups (see
 	\cref{sec:SMPdesign:Beyond Partitioning}
 	for one such example), that is, of interest but also of limited
 	practical importance.
 	Nevertheless, one of the strengths of RCU is that its read-side
 	overhead is so low that tiny effects such as this one are visible
 	in real performance measurements.

 \begin{figure}
 \centering
 % Run with initial rcu_head structure in route_entry moved down.
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.17a/perf-rcu-qsbr}}
 \caption{Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial \tco{rcu_head}}
 \label{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head}
 \end{figure}

 	This raises the question as to what would happen if the
 	\co{rcu_head} structure were to be moved so that RCU's
 	\co{->re_next.next} pointer also had zero offset, just the
 	same as the sequential variant.
 	And the answer, as can be seen in
 	\cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head},
 	is that this causes RCU QSBR's performance to decrease to where
 	it is still very nearly ideal, but no longer super-ideal.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Given RCU QSBR's read-side performance, why bother with any
 	other flavor of userspace RCU?
 }\QuickQuizAnswerE{
 	Because RCU QSBR places constraints on the overall application
 	that might not be tolerable,
 	for example, requiring that each and every thread in the
 	application regularly pass through a quiescent state.
 	Among other things, this means that RCU QSBR is not helpful
 	to library writers, who might be better served by other
 	flavors of userspace RCU~\cite{PaulMcKenney2013LWNURCU}.
 }\QuickQuizEndE
 }

 \begin{figure}
 \centering
 % Run with initial rcu_head structure at beginning of route_entry structure.
 % This need to run twice is the reason for the oddball directory.
 % @@@ Make this generate both files with a single run?
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.02a/perf-rcu-qsbr}}
 \caption{Pre-BSD Routing Table Protected by RCU QSBR}
 \label{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR}
 \end{figure}

 \begin{figure*}
 \centering
 \IfTwoColumn{
   \resizebox{5.5in}{!}{\includegraphics{defer/RCUusecases}}
 }{
   \resizebox{.96\textwidth}{!}{\includegraphics{defer/RCUusecases}}
   % eb builds require .96
 }
 \caption{Relationships Between RCU Use Cases}
 \label{fig:defer:Relationships Between RCU Use Cases}
 \end{figure*}

 Although Pre-BSD routing is an excellent RCU use case, it is worthwhile
 looking at the relationships betweeen the wider spectrum of use cases
 shown in
 \cref{fig:defer:Relationships Between RCU Use Cases}.
 This task is taken up by the following sections.

 While reading these sections, please ask yourself which of these
 use cases best describes Pre-BSD routing.

 \subsubsection{Wait for Pre-Existing Things to Finish}
 \label{sec:defer:Wait for Pre-Existing Things to Finish}

 As noted in \cref{sec:defer:RCU Fundamentals}
 an important component
 of RCU is a way of waiting for RCU readers to finish.
 One of
 RCU's great strength is that it allows you to wait for each of
 thousands of different things to finish without having to explicitly
 track each and every one of them, and without incurring
 the performance degradation, scalability limitations, complex deadlock
 scenarios, and memory-leak hazards that are inherent in schemes that
 use explicit tracking.

 In this section, we will show how \co{synchronize_sched()}'s
 read-side counterparts (which include anything that disables preemption,
 along with hardware operations and
 primitives that disable interrupts) permit you to interaction with
 non-maskable interrupt
 (NMI) handlers, which is quite difficult using locking.
 This approach has been called ``Pure RCU''~\cite{PaulEdwardMcKenneyPhD},
 and it is used in a few places in the Linux kernel.

 The basic form of such ``Pure RCU'' designs is as follows:

 \begin{enumerate}
 \item	Make a change, for example, to the way that the OS reacts to an NMI\@.
 \item	Wait for all pre-existing read-side critical sections to
 	completely finish (for example, by using the
 	\co{synchronize_sched()} primitive).\footnote{
 		In Linux kernel v5.1 and later, \co{synchronize_sched()} has
 		been subsumed into \co{synchronize_rcu()}.}
 	The key observation here is that subsequent RCU read-side critical
 	sections are guaranteed to see whatever change was made.
 \item	Clean up, for example, return status indicating that the
 	change was successfully made.
 \end{enumerate}

 The remainder of this section presents example code adapted from
 the Linux kernel.
 In this example, the \co{nmi_stop()} function in the now-defunct
 oprofile facility uses \co{synchronize_sched()} to ensure that all
 in-flight NMI notifications have completed before freeing the associated
 resources.
 A simplified version of this code is shown in
 \cref{lst:defer:Using RCU to Wait for NMIs to Finish}.

 \begin{listing}
 \begin{fcvlabel}[ln:defer:Using RCU to Wait for NMIs to Finish]
 \begin{VerbatimL}[commandchars=\\\@\$]
 struct profile_buffer {				\lnlbl@struct:b$
 	long size;
 	atomic_t entry[0];
 };						\lnlbl@struct:e$
 static struct profile_buffer *buf = NULL;	\lnlbl@struct:buf$

 void nmi_profile(unsigned long pcvalue)		\lnlbl@nmi_profile:b$
 {
 	struct profile_buffer *p = rcu_dereference(buf);\lnlbl@nmi_profile:rcu_deref$

 	if (p == NULL)				\lnlbl@nmi_profile:if_NULL$
 		return;				\lnlbl@nmi_profile:ret:a$
 	if (pcvalue >= p->size)			\lnlbl@nmi_profile:if_oor$
 		return;				\lnlbl@nmi_profile:ret:b$
 	atomic_inc(&p->entry[pcvalue]);		\lnlbl@nmi_profile:inc$
 }						\lnlbl@nmi_profile:e$

 void nmi_stop(void)				\lnlbl@nmi_stop:b$
 {
 	struct profile_buffer *p = buf;		\lnlbl@nmi_stop:fetch$

 	if (p == NULL)				\lnlbl@nmi_stop:if_NULL$
 		return;				\lnlbl@nmi_stop:ret$
 	rcu_assign_pointer(buf, NULL);		\lnlbl@nmi_stop:NULL$
 	synchronize_sched();			\lnlbl@nmi_stop:sync_sched$
 	kfree(p);				\lnlbl@nmi_stop:kfree$
 }						\lnlbl@nmi_stop:e$
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Using RCU to Wait for NMIs to Finish}
 \label{lst:defer:Using RCU to Wait for NMIs to Finish}
 \end{listing}

 \begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:struct]
 \Clnrefrange{b}{e} define a \co{profile_buffer} structure, containing a
 size and an indefinite array of entries.
 \Clnref{buf} defines a pointer to a profile buffer, which is
 presumably initialized elsewhere to point to a dynamically allocated
 region of memory.
 \end{fcvref}

 \begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:nmi_profile]
 \Clnrefrange{b}{e} define the \co{nmi_profile()} function,
 which is called from within an NMI handler.
 As such, it cannot be preempted, nor can it be interrupted by a normal
 interrupt handler, however, it is still subject to delays due to cache misses,
 ECC errors, and cycle stealing by other hardware threads within the same
 core.
 \Clnref{rcu_deref} gets a local pointer to the profile buffer using the
 \co{rcu_dereference()} primitive to ensure memory ordering on
 DEC Alpha, and
 \clnref{if_NULL,ret:a} exit from this function if there is no
 profile buffer currently allocated, while \clnref{if_oor,ret:b}
 exit from this function if the \co{pcvalue} argument
 is out of range.
 Otherwise, \clnref{inc} increments the profile-buffer entry indexed
 by the \co{pcvalue} argument.
 Note that storing the size with the buffer guarantees that the
 range check matches the buffer, even if a large buffer is suddenly
 replaced by a smaller one.
 \end{fcvref}

 \begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:nmi_stop]
 \Clnrefrange{b}{e} define the \co{nmi_stop()} function,
 where the caller is responsible for mutual exclusion (for example,
 holding the correct lock).
 \Clnref{fetch} fetches a pointer to the profile buffer, and
 \clnref{if_NULL,ret} exit the function if there is no buffer.
 Otherwise, \clnref{NULL} \co{NULL}s out the profile-buffer pointer
 (using the \co{rcu_assign_pointer()} primitive to maintain
 memory ordering on weakly ordered machines),
 and \clnref{sync_sched} waits for an RCU Sched grace period to elapse,
 in particular, waiting for all non-preemptible regions of code,
 including NMI handlers, to complete.
 Once execution continues at \clnref{kfree}, we are guaranteed that
 any instance of \co{nmi_profile()} that obtained a
 pointer to the old buffer has returned.
 It is therefore safe to free the buffer, in this case using the
 \co{kfree()} primitive.
 \end{fcvref}

 \QuickQuiz{
 	Suppose that the \co{nmi_profile()} function was preemptible.
 	What would need to change to make this example work correctly?
 }\QuickQuizAnswer{
 	One approach would be to use
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}
 	in \co{nmi_profile()}, and to replace the
 	\co{synchronize_sched()} with \co{synchronize_rcu()},
 	perhaps as shown in
 	\cref{lst:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}.
 %
 \begin{listing}
 \begin{VerbatimL}
 struct profile_buffer {
 	long size;
 	atomic_t entry[0];
 };
 static struct profile_buffer *buf = NULL;

 void nmi_profile(unsigned long pcvalue)
 {
 	struct profile_buffer *p;

 	rcu_read_lock();
 	p = rcu_dereference(buf);
 	if (p == NULL) {
 		rcu_read_unlock();
 		return;
 	}
 	if (pcvalue >= p->size) {
 		rcu_read_unlock();
 		return;
 	}
 	atomic_inc(&p->entry[pcvalue]);
 	rcu_read_unlock();
 }

 void nmi_stop(void)
 {
 	struct profile_buffer *p = buf;

 	if (p == NULL)
 		return;
 	rcu_assign_pointer(buf, NULL);
 	synchronize_rcu();
 	kfree(p);
 }
 \end{VerbatimL}
 \caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish}
 \label{lst:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}
 \end{listing}

 	But why on earth would an NMI handler be preemptible???
 }\QuickQuizEnd

 In short, RCU makes it easy to dynamically switch among profile
 buffers (you just \emph{try} doing this efficiently with atomic
 operations, or at all with locking!).
 This is a rare use of RCU in its pure form.
 RCU is normally used at higher levels of abstraction, as
 will be shown in the following sections.

 \subsubsection{Phased State Change}
 \label{sec:defer:Phased State Change}

 \Cref{fig:defer:Phased State Change for Maintenance Operation}
 shows a timeline for an example phased state change to efficiently
 handle maintenance operations.
 If there is no maintenance operation in progress, common-case operations
 must proceed quickly, for example, without acquiring a reader-writer lock.
 However, if there is a maintenance operation in progress, the common-case
 operations must be undertaken carefully, taking into account added
 complexities due to their running concurrently with that maintenance
 operation.
 This means that common-case operations will incur higher overhead during
 maintenance operations, which is one reason that maintenance operations
 are normally scheduled to take place during times of low load.

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{defer/RCUphasedstatechange}}
 \caption{Phased State Change for Maintenance Operation}
 \label{fig:defer:Phased State Change for Maintenance Operation}
 \end{figure}

 In the figure, these apparently conflicting requirements are resolved by
 having a prepare phase prior to the maintenance operation and a cleanup
 phase after it, during which the common-case operations can proceed
 either quickly or carefully.

 \begin{listing}
 \begin{fcvlabel}[ln:defer:Phased State Change for Maintenance Operations]
 \begin{VerbatimL}[commandchars=\\\@\$]
 bool be_careful;

 void cco(void)
 {
 	rcu_read_lock();			\lnlbl@rrl$
 	if (READ_ONCE(be_careful))		\lnlbl@if$
 		cco_carefully();		\lnlbl@careful$
 	else
 		cco_quickly();			\lnlbl@quick$
 	rcu_read_unlock();			\lnlbl@rul$
 }

 void maint(void)
 {
 	WRITE_ONCE(be_careful, true);		\lnlbl@tocareful$
 	synchronize_rcu();			\lnlbl@sr1$
 	do_maint();				\lnlbl@maint$
 	synchronize_rcu();			\lnlbl@sr2$
 	WRITE_ONCE(be_careful, false);		\lnlbl@toquick$
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Phased State Change for Maintenance Operations}
 \label{lst:defer:Phased State Change for Maintenance Operations}
 \end{listing}

 \begin{fcvref}[ln:defer:Phased State Change for Maintenance Operations]
 Example pseudo-code for this phased state change is shown in
 \cref{lst:defer:Phased State Change for Maintenance Operations}.
 The common-case operations are carried out by \co{cco()} within an RCU
 read-side critical section extending from \clnref{rrl} to \clnref{rul}.
 Here, \clnref{if} checks a global \co{be_careful} flag, invoking
 \co{cco_carefully()} or \co{cco_quickly()}, as indicated.
 \end{fcvref}

 \begin{fcvref}[ln:defer:Phased State Change for Maintenance Operations]
 This allows the \co{maint()} function to set the \co{be_careful} flag
 on \clnref{tocareful} and wait for an RCU grace period on \clnref{sr1}.
 When control reaches \clnref{maint}, all \co{cco()} functions that saw a
 \co{false} value of \co{be_careful} (and thus which might invoke
 the \co{cco_quickly()} function) will have completed their operations,
 so that all currently executing \co{cco()} functions will be invoking
 \co{cco_carefully()}.
 This means that it is safe for the \co{do_maint()} function to be
 invoked.
 \Clnref{sr2} then waits for all \co{cco()} functions that might have
 run concurrently with \co{do_maint()} to complete, and finally
 \clnref{toquick} sets the \co{be_careful} flag back to \co{false}.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	What is the point of the second call to \co{synchronize_rcu()}
 	in function
 	\co{maint()} in \cref{lst:defer:Phased State Change for Maintenance Operations}?
 	Isn't it OK for any \co{cco()} invocations in the clean-up
 	phase to invoke either \co{cco_carefully()} or \co{cco_quickly()}?
 }\QuickQuizAnswerB{
 	The problem is that there is no ordering between the \co{cco()}
 	function's load from \co{be_careful} and any memory loads
 	executed by the \co{cco_quickly()} function.
 	Because there is no ordering, without that second call to
 	\co{syncrhonize_rcu()}, memory ordering could cause loads
 	in \co{cco_quickly()} to overlap with stores by \co{do_maint()}.

 	Another alternative would be to compensate for the removal of
 	that second call to \co{synchronize_rcu()} by changing the
 	\co{READ_ONCE()} to \co{smp_load_acquire()} and the
 	\co{WRITE_ONCE()} to \co{smp_store_release()}, thus restoring
 	the needed ordering.
 }\QuickQuizEndB

 \QuickQuizE{
 	How can you be sure that the code shown in
 	\co{maint()} in \cref{lst:defer:Phased State Change for Maintenance Operations}
 	really works?\@
 }\QuickQuizAnswerE{
 	By one popular school of thought, you cannot.

 	But in this case, those willing to jump ahead to
 	\cref{chp:Formal Verification}
 	and
 	\cref{chp:Advanced Synchronization: Memory Ordering}
 	might find a couple of LKMM litmus tests to be interesting
 	(\path{C-RCU-phased-state-change-1.litmus} and
 	\path{C-RCU-phased-state-change-2.litmus}).
 	These tests could be argued to demonstrate that this code
 	and a variant of it really do work.
 }\QuickQuizEndE
 }% End of \QuickQuizSeries

 Phased state change allows frequent operations to use light-weight
 checks, without the need for expensive lock acquisitions or atomic
 read-modify-write operations, and is used in the Linux kernel in the
 guise of \co{rcu_sync}~\cite{OlegNesterov2013rcusync} to implement a
 variant of reader-writer semaphores with lightweight readers.
 Phased state change adds only a checked state variable
 to the wait-to-finish use case
 (\cref{sec:defer:Wait for Pre-Existing Things to Finish}),
 thus also residing at a rather low level of abstraction.

 \subsubsection{Add-Only List}
 \label{sec:defer:Add-Only List}

 Add-only data structures, exemplified by the add-only list, can be used
 for a surprisingly common set of use cases, perhaps most commonly the
 logging of changes.
 Add-only data structures are a pure use of RCU's underlying
 publish/subscribe mechanism.

 An add-only variant of a pre-BSD routing table can be derived from
 \cref{lst:defer:RCU Pre-BSD Routing Table Lookup,lst:defer:RCU Pre-BSD Routing Table Add/Delete}.
 Because there is no deletion, the \co{route_del()} and \co{route_cb()}
 functions may be dispensed with, along with the \co{->rh}
 and \co{->re_freed} fields of the \co{route_entry} structure, the
 \co{rcu_read_lock()}, the \co{rcu_read_unlock()} invocations in the
 \co{route_lookup()} function, and all uses of the \co{->re_freed} field
 in all remaining functions.

 Of course, if there are many concurrent invocations of the \co{route_add()}
 function, there will be heavy contention on \co{routelock}, and if lockless
 techniques are used, heavy memory contention on \co{routelist}.
 The usual way to avoid this contention is to use a concurrency-friendly
 data structure such as a hash table (see \cref{chp:Data Structures}).
 Alternatively, per-CPU data structures might be periodically merged
 into a single global data structure.

 On the other hand, if there is never any deletion, extended time periods
 featuring many concurrent invocations of \co{route_add()} will eventually
 consume all available memory.
 Therefore, most RCU-protected data structures also implement deletion.

 \subsubsection{Type-Safe Memory}
 \label{sec:defer:Type-Safe Memory}

 A number of lockless algorithms do not require that a given data
 element keep the same identity through a given RCU read-side critical
 section referencing it---but only if that data element retains the
 same type.
 In other words, these lockless algorithms can tolerate a given data
 element being freed and reallocated as the same type of structure
 while they are referencing it, but must prohibit a change in type.
 This guarantee, called ``\IX{type-safe memory}'' in
 academic literature~\cite{Cheriton96a},
 is weaker than the \IXpl{existence guarantee} discussed
 in \cref{sec:defer:Existence Guarantee},
 and is therefore quite a bit harder to work with.
 Type-safe memory algorithms in the Linux kernel make use of slab caches,
 specially marking these caches with \co{SLAB_TYPESAFE_BY_RCU}
 so that RCU is used when returning a freed-up
 slab to system memory.
 This use of RCU guarantees that any in-use element of
 such a slab will remain in that slab, thus retaining its type,
 for the duration of any pre-existing RCU read-side critical sections.

 \QuickQuiz{
 	But what if there is an arbitrarily long series of RCU
 	read-side critical sections in multiple threads, so that at
 	any point in time there is at least one thread in the system
 	executing in an RCU read-side critical section?
 	Wouldn't that prevent any data from a \co{SLAB_TYPESAFE_BY_RCU}
 	slab ever being returned to the system, possibly resulting
 	in OOM events?
 }\QuickQuizAnswer{
 	There could certainly be an arbitrarily long period of time
 	during which at least one thread is always in an RCU read-side
 	critical section.
 	However, the key words in the description in
 	\cref{sec:defer:Type-Safe Memory}
 	are ``in-use'' and ``pre-existing''.
 	Keep in mind that a given RCU read-side critical section is
 	conceptually only permitted to gain references to data elements
 	that were visible to readers during that critical section.
 	Furthermore, remember that a slab cannot be returned to the
 	system until all of its data elements have been freed, in fact,
 	the RCU grace period cannot start until after they have all been
 	freed.

 	Therefore, the slab cache need only wait for those RCU read-side
 	critical sections that started before the freeing of the last element
 	of the slab.
 	This in turn means that any RCU grace period that begins after
 	the freeing of the last element will do---the slab may be returned
 	to the system after that grace period ends.
 }\QuickQuizEnd

 It is important to note that \co{SLAB_TYPESAFE_BY_RCU} will
 \emph{in no way}
 prevent \co{kmem_cache_alloc()} from immediately reallocating
 memory that was just now freed via \co{kmem_cache_free()}!
 In fact, the \co{SLAB_TYPESAFE_BY_RCU}-protected data structure
 just returned by \co{rcu_dereference()} might be freed and reallocated
 an arbitrarily large number of times, even when under the protection
 of \co{rcu_read_lock()}.
 Instead, \co{SLAB_TYPESAFE_BY_RCU} operates by preventing
 \co{kmem_cache_free()}
 from returning a completely freed-up slab of data structures
 to the system until after an RCU grace period elapses.
 In short, although a given RCU read-side critical section might see a
 given \co{SLAB_TYPESAFE_BY_RCU} data element being freed and reallocated
 arbitrarily often, the element's type is guaranteed not to change until
 that critical section has completed.

 These algorithms therefore typically use a validation step that checks
 to make sure that the newly referenced data structure really is the one
 that was requested~\cite[Section~2.5]{LaninShasha1986TSM}.
 These validation checks require that portions of the data structure
 remain untouched by the free-reallocate process.
 Such validation checks are usually very hard to get right, and can
 hide subtle and difficult bugs.

 Therefore, although type-safety-based lockless algorithms can be extremely
 helpful in a very few difficult situations, you should instead use existence
 guarantees where possible.
 Simpler is after all almost always better!
 On the other hand, type-safety-based lockless algorithms can
 provide improved cache locality, and thus improved performance.
 This improved cache locality is provided by the fact that such
 algorithms can immediately reallocate a newly freed block of memory.
 In contrast, algorithms based on existence guarantees must wait for
 all pre-existing readers before reallocating memory, by which time
 that memory may have been ejected from CPU caches.

 As can be seen in \cref{fig:defer:Relationships Between RCU Use Cases},
 RCU's type-safe-memory use case combines both the wait-to-finish
 and publish-subscribe components, but in the Linux kernel also includes
 the slab allocator's deferred reclamation specified by the
 \co{SLAB_TYPESAFE_BY_RCU} flag.

 \subsubsection{Existence Guarantee}
 \label{sec:defer:Existence Guarantee}

 Gamsa et al.~\cite{Gamsa99}
 discuss \IXpl{existence guarantee} and describe how a mechanism
 resembling RCU can be used to provide these existence guarantees
 (see Section~5 on page~7 of the PDF), and
 \cref{sec:locking:Lock-Based Existence Guarantees}
 discusses how to guarantee existence via locking, along with the
 ensuing disadvantages of doing so.
 The effect is that if any RCU-protected data element is accessed
 within an RCU read-side critical section, that data element is
 guaranteed to remain in existence for the duration of that RCU
 read-side critical section.

 \begin{listing}
 \begin{fcvlabel}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 \begin{VerbatimL}[commandchars=\\\@\$]
 int delete(int key)
 {
 	struct element *p;
 	int b;

 	b = hashfunction(key);			\lnlbl@hash$
 	rcu_read_lock();			\lnlbl@rdlock$
 	p = rcu_dereference(hashtable[b]);
 	if (p == NULL || p->key != key) {	\lnlbl@chkkey$
 		rcu_read_unlock();		\lnlbl@rdunlock1$
 		return 0;			\lnlbl@ret_0:a$
 	}
 	spin_lock(&p->lock);			\lnlbl@acq$
 	if (hashtable[b] == p && p->key == key) {\lnlbl@chkkey2$
 		rcu_read_unlock();		\lnlbl@rdunlock2$
 		rcu_assign_pointer(hashtable[b], NULL);\lnlbl@remove$
 		spin_unlock(&p->lock);		\lnlbl@rel1$
 		synchronize_rcu();		\lnlbl@sync_rcu$
 		kfree(p);			\lnlbl@kfree$
 		return 1;			\lnlbl@ret_1$
 	}
 	spin_unlock(&p->lock);			\lnlbl@rel2$
 	rcu_read_unlock();			\lnlbl@rdunlock3$
 	return 0;				\lnlbl@ret_0:b$
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Existence Guarantees Enable Per-Element Locking}
 \label{lst:defer:Existence Guarantees Enable Per-Element Locking}
 \end{listing}

 \begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 \Cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
 demonstrates how RCU-based existence guarantees can enable
 per-element locking via a function that deletes an element from
 a hash table.
 \Clnref{hash} computes a hash function, and \clnref{rdlock} enters an RCU
 read-side critical section.
 If \clnref{chkkey} finds that the corresponding bucket of the hash table is
 empty or that the element present is not the one we wish to delete,
 then \clnref{rdunlock1} exits the RCU read-side critical section and
 \clnref{ret_0:a}
 indicates failure.
 \end{fcvref}

 \QuickQuiz{
 	What if the element we need to delete is not the first element
 	of the list on
 	\clnrefr{ln:defer:Existence Guarantees Enable Per-Element Locking:chkkey} of
 	\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}?
 }\QuickQuizAnswer{
 	As with the (bug-ridden)
 	\cref{lst:locking:Per-Element Locking Without Existence Guarantees (Buggy!)},
 	this is a very simple hash table with no chaining, so the only
 	element in a given bucket is the first element.
 	The reader is again invited to adapt this example to a hash table with
 	full chaining.
 	Less energetic readers might wish to refer to
 	\cref{chp:Data Structures}.
 }\QuickQuizEnd

 \begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 Otherwise, \clnref{acq} acquires the update-side spinlock, and
 \clnref{chkkey2} then checks that the element is still the one that we want.
 If so, \clnref{rdunlock2} leaves the RCU read-side critical section,
 \clnref{remove} removes it from the table, \clnref{rel1} releases
 the lock, \clnref{sync_rcu} waits for all pre-existing RCU read-side critical
 sections to complete, \clnref{kfree} frees the newly removed element,
 and \clnref{ret_1} indicates success.
 If the element is no longer the one we want, \clnref{rel2} releases
 the lock, \clnref{rdunlock3} leaves the RCU read-side critical section,
 and \clnref{ret_0:b} indicates failure to delete the specified key.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 	Why is it OK to exit the RCU read-side critical section on
 	\clnref{rdunlock2} of
 	\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
 	before releasing the lock on \clnref{rel1}?
 	\end{fcvref}
 }\QuickQuizAnswerB{
 	\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 	First, please note that the second check on \clnref{chkkey2} is
 	necessary because some other
 	CPU might have removed this element while we were waiting
 	to acquire the lock.
 	However, the fact that we were in an RCU read-side critical section
 	while acquiring the lock guarantees that this element could not
 	possibly have been re-allocated and re-inserted into this
 	hash table.
 	Furthermore, once we acquire the lock, the lock itself guarantees
 	the element's existence, so we no longer need to be in an
 	RCU read-side critical section.
 	% The question as to whether it is necessary to re-check the
 	% element's key is left as an exercise to the reader.
 	% A re-check is necessary if the key can mutate or if it is
 	% necessary to reject deleted entries (in cases where deletion
 	% is recorded by mutating the key.
 	\end{fcvref}
 }\QuickQuizEndB
 %
 \QuickQuizM{
 	\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 	Why not exit the RCU read-side critical section on
 	\clnref{rdunlock3} of
 	\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
 	before releasing the lock on \clnref{rel2}?
 	\end{fcvref}
 }\QuickQuizAnswerM{
 	Suppose we reverse the order of these two lines.
 	Then this code is vulnerable to the following sequence of
 	events:
 	\begin{enumerate}
 	\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
 	\item	CPU~0 invokes \co{delete()}, and finds the element
 		to be deleted, executing through \clnref{rdunlock2}.
 		It has not yet actually deleted the element, but
 		is about to do so.
 	\item	CPU~1 concurrently invokes \co{delete()}, attempting
 		to delete this same element.
 		However, CPU~0 still holds the lock, so CPU~1 waits
 		for it at \clnref{acq}.
 	\item	CPU~0 executes \clnref{remove,rel1},
 		and blocks at \clnref{sync_rcu} waiting for CPU~1
 		to exit its RCU read-side critical section.
 	\item	CPU~1 now acquires the lock, but the test on \clnref{chkkey2}
 		fails because CPU~0 has already removed the element.
 		CPU~1 now executes \clnref{rel2}
 		(which we switched with \clnref{rdunlock3}
 		for the purposes of this Quick Quiz)
 		and exits its RCU read-side critical section.
 	\item	CPU~0 can now return from \co{synchronize_rcu()},
 		and thus executes \clnref{kfree}, sending the element to
 		the freelist.
 	\item	CPU~1 now attempts to release a lock for an element
 		that has been freed, and, worse yet, possibly
 		reallocated as some other type of data structure.
 		This is a fatal memory-corruption error.
 	\end{fcvref}
 	\end{enumerate}
 }\QuickQuizEndM
 %
 \QuickQuizE{
 	The RCU-based algorithm shown in
 	\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
 	locks very similar to that in
 	\cref{lst:locking:Per-Element Locking With Lock-Based Existence Guarantees},
 	so why should the RCU-based approach be any better?
 }\QuickQuizAnswerE{
 	\Cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
 	replaces the per-element \co{spin_lock()} and \co{spin_unlock()}
 	shown in
 	\cref{lst:locking:Per-Element Locking With Lock-Based Existence Guarantees}
 	with a much cheaper \co{rcu_read_lock()} and \co{rcu_read_unlock()},
 	thus greatly improving both performance and scalability.
 	For more detail, please see
 	\cref{sec:datastruct:RCU-Protected Hash Table Performance}.
 }\QuickQuizEndE
 }

 Alert readers will recognize this as only a slight variation on
 the original wait-to-finish theme
 (\cref{sec:defer:Wait for Pre-Existing Things to Finish}),
 adding publish/subscribe, linked structures, a heap allocator
 (typically), and deferred reclamation, as shown in
 \cref{fig:defer:Relationships Between RCU Use Cases}.
 They might also note the deadlock-immunity advantages over the lock-based
 existence guarantees discussed in
 \cref{sec:locking:Lock-Based Existence Guarantees}.

 \subsubsection{Light-Weight Garbage Collector}
 \label{sec:defer:Light-Weight Garbage Collector}

 A not-uncommon exclamation made by people first learning about
 RCU is ``RCU is sort of like a garbage
 collector!''~\cite{MattKline2023LockBasedLocklessFresh}.
 This exclamation has a large grain of truth, especially when using RCU
 to implement non-blocking algorithms that rely on garbage collection,
 but it can sometimes be misleading.

 Perhaps the best way to think of the relationship between RCU
 and automatic garbage collectors (GCs) is that RCU resembles
 a GC in that the \emph{timing} of collection is automatically
 determined, but that RCU differs from a GC in that:
 \begin{enumerate*}[(1)]
 \item The programmer must manually indicate when a given data
 structure is eligible to be collected and
 \item The programmer must manually mark the RCU read-side critical
 sections where references might be held.
 \end{enumerate*}

 Despite these differences, the resemblance does go quite deep.
 In fact, the first RCU-like mechanism I am aware of used a
 reference-count-based garbage collector to handle the grace
 periods~\cite{Kung80}, and the connection between RCU and
 garbage collection has been noted more
 recently~\cite{HarshalSheth2016goRCU}.

 The light-weight garbage collector use case is very similar to
 the existence-guarantee use case, adding only the desired non-blocking
 algorithm to the mix.
 This light-weight garbage collector use case can also be used in
 conjunction with the existence guarantees described in the next section.

 \subsubsection{Delete-Only List}
 \label{sec:defer:Delete-Only List}

 The delete-only list is the less-popular counterpart to the add-only list
 covered in \cref{sec:defer:Add-Only List}, and can be thought of as the
 existence-guarantee use case, but without the publish/subscribe component,
 as shown in \cref{fig:defer:Relationships Between RCU Use Cases}.
 A delete-only list can be used when the universe of possible members of
 the list is known at initialization, and where members can be removed.
 For example, elements of the list might represent hardware elements of
 the system that are subject to failure, but cannot be repaired or
 replaced without a reboot.

 An delete-only variant of a pre-BSD routing table can be derived from
 \cref{lst:defer:RCU Pre-BSD Routing Table Lookup,lst:defer:RCU Pre-BSD Routing Table Add/Delete}.
 Because there is no addition, the \co{route_add()} function may be
 dispensed with, or, alternatively, its use might be restricted to
 initialization time.
 In theory, the \co{route_lookup()} function can use a non-RCU iterator,
 though in the Linux kernel this will result in complaints from debug code.
 In addition, the incremental cost of an RCU iterator is usually
 negligible.

 As a result, delete-only situations typically use algorithms and data
 structures that are designed for addition as well as deletion.

 \subsubsection{Quasi Reader-Writer Lock}
 \label{sec:defer:Quasi Reader-Writer Lock}

 Perhaps the most common use of RCU within the Linux kernel is as
 a replacement for reader-writer locking in read-intensive situations.
 Nevertheless, this use of RCU was not immediately apparent to me
 at the outset.
 In fact, I chose to implement a lightweight reader-writer
 lock~\cite{WilsonCHsieh92a}\footnote{
 	Similar to \co{brlock} in the 2.4 Linux kernel and to
 	\co{lglock} in more recent Linux kernels.}
 before implementing a general-purpose RCU implementation
 back in the early 1990s.
 Each and every one of the uses I envisioned for the lightweight reader-writer
 lock was instead implemented using RCU\@.
 In fact, it was more than
 three years before the lightweight reader-writer lock saw its first use.
 Boy, did I feel foolish!

 The key similarity between RCU and reader-writer locking is that
 both have read-side critical sections that can execute concurrently.
 In fact, in some cases, it is possible to mechanically substitute RCU API
 members for the corresponding reader-writer lock API members.
 But first, why bother?

 Advantages of RCU include performance,
 deadlock immunity, and realtime latency.
 There are, of course, limitations to RCU, including the fact that
 readers and updaters run concurrently, that low-priority RCU readers
 can block high-priority threads waiting for a grace period to elapse,
 and that grace-period latencies can extend for many milliseconds.
 These advantages and limitations are discussed in the following paragraphs.

 \paragraph{Performance}

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperf}}
 \caption{Performance Advantage of RCU Over Reader-Writer Locking}
 \label{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}
 \end{figure}

 The read-side performance advantages of Linux-kernel RCU over
 reader-writer locking are shown in
 \cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking},
 which was generated on a 448-CPU 2.10\,GHz Intel x86 system.

 \QuickQuizSeries{%
 \QuickQuizB{
 	WTF\@?
 	How the heck do you expect me to believe that RCU can have less
 	than a 300-picosecond overhead when the clock period at 2.10\,GHz
 	is almost 500\,picoseconds?
 }\QuickQuizAnswerB{
 	First, consider that the inner loop used to
 	take this measurement is as follows:

 \begin{VerbatimN}
 	for (i = nloops; i >= 0; i--) {
 		rcu_read_lock();
 		rcu_read_unlock();
 	}
 \end{VerbatimN}

 	Next, consider the effective definitions of \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()}:

 \begin{VerbatimN}
 #define rcu_read_lock()   barrier()
 #define rcu_read_unlock() barrier()
 \end{VerbatimN}

 	These definitions constrain compiler code-movement optimizations
 	involving memory references, but emit no instructions in and
 	of themselves.
 	However, if the loop variable is maintained in a register,
 	the accesses to \co{i} will not count as memory references.
 	Furthermore, the compiler can do loop unrolling,
 	allowing the resulting code to ``execute'' multiple passes
 	through the loop body simply by incrementing \co{i} by
 	some value larger than the value 1.

 	So the ``measurement'' of 267 picoseconds is simply the fixed
 	overhead of the timing measurements divided by the number of
 	passes through the inner loop containing the calls
 	to \co{rcu_read_lock()} and \co{rcu_read_unlock()}, plus
 	the code to manipulate \co{i} divided by the loop-unrolling
 	factor.
 	And therefore, this measurement really is in error, in fact,
 	it exaggerates the overhead by an arbitrary number of orders
 	of magnitude.
 	After all, in terms of machine instructions emitted, the actual
 	overheads of \co{rcu_read_lock()} and of \co{rcu_read_unlock()}
 	are each precisely zero.

 	It is not just every day that a timing measurement of 267
 	picoseconds turns out to be an overestimate!
 }\QuickQuizEndB

 \QuickQuizM{
 	Didn't an earlier edition of this book show RCU read-side
 	overhead way down in the sub-picosecond range?
 	What happened???
 }\QuickQuizAnswerM{
 	Excellent memory!!!
 	The overhead in some early releases was in fact roughly
 	100~femtoseconds.

 	What happened was that RCU usage spread more broadly through the
 	Linux kernel, including into code that takes page faults.
 	Back at that time, \co{rcu_read_lock()} and \co{rcu_read_unlock()}
 	were complete no-ops in \co{CONFIG_PREEMPT=n} kernels.
 	Unfortunately, that situation allowed the compiler to reorder
 	page-faulting memory accesses into RCU read-side critical
 	sections.
 	Of course, page faults can block, which destroys those critical
 	sections.

 	Nor was this a theoretical problem:
 	A failure actually manifested in 2019.
 	\ppl{Herbert}{Xu} tracked down this failure down and
 	\ppl{Linus}{Torvalds}
 	therefore queued a commit to upgrade \co{rcu_read_lock()} and
 	\co{rcu_read_unlock()} to unconditionally include a call to
 	\co{barrier()}~\cite{LinusTorvalds2019:RCUreader.barrier}.
 	And although \co{barrier()} emits no code, it does constrain
 	compiler optimizations.
 	And so the price of widespread RCU usage is slightly higher
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()} overhead.
 	As such, Linux-kernel RCU has proven to be a victim of its
 	own success.

 	Of course, it is also the case that the older results were obtained
 	on a different system than were those shown in
 	\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}.
 	So which change had the most effect, Linus's commit or the change in
 	the system?
 	This question is left as an exercise to the reader.
 }\QuickQuizEndM

 \QuickQuizE{
 	Why is there such large variation for the \co{RCU} trace in
 	\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}?
 }\QuickQuizAnswerE{
 	Keep in mind that this is a log-log plot, so those large-seeming
 	\co{RCU} variances in reality span only a few hundred picoseconds.
 	And that is such a short time that anything could cause it.
 	However, given that the variance decreases with both small and
 	large numbers of CPUs, one hypothesis is that the variation is
 	due to migrations from one CPU to another.

 	Yes, these measurements were taken with interrupts disabled, but
 	they were also taken within a guest OS, so that preemption was
 	still possible at the hypervisor level.
 	In addition, the system featured hyperthreading and a single
 	hardware thread running this RCU workload is able to consume
 	more than half of the core's resources.
 	Therefore, the overall throughput varies depending on how many
 	of a given guest OS's CPUs share cores.
 	Attempting to reduce these variations by running the guest OSes
 	at real-time priority (as suggested by Joel Fernandes) is left
 	as an exercise for the reader.
 }\QuickQuizEndE
 }                 % End of \QuickQuizSeries

 Note that reader-writer locking is more than an order of magnitude slower
 than RCU on a single CPU, and is more than \emph{four} orders of magnitude
 slower on 192~CPUs.
 In contrast, RCU scales quite well.
 In both cases, the error bars cover the full range of the measurements
 from 30~runs, with the line being the median.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperfPREEMPT}}
 \caption{Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
 \label{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
 \end{figure}

 A more moderate view may be obtained from a \co{CONFIG_PREEMPT} kernel,
 though RCU still beats reader-writer locking by between a factor of seven
 on a single CPU and by three orders of magnitude on 192~CPUs, as shown in
 \cref{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking},
 which was generated on the same 448-CPU 2.10\,GHz x86 system.
 Note the high variability of reader-writer locking at larger numbers of CPUs.
 The error bars span the full range of data.

 \QuickQuiz{
 	Given that the system had no fewer than 448~hardware threads,
 	why only 192~CPUs?
 }\QuickQuizAnswer{
 	Because the script (\path{rcuscale.sh}) that generates this data
 	spawns a guest operating system for each set of points gathered,
 	and on this particular system, both \co{qemu} and KVM limit the
 	number of CPUs that may be configured into a given guest OS\@.
 	Yes, it would have been possible to run a few more CPUs, but
 	192 is a nice round number from a binary perspective, given
 	that 256 is infeasible.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperfwt}}
 \caption{Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration, 192 CPUs}
 \label{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}
 \end{figure}

 Of course, the low performance of reader-writer locking in
 \cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking,%
 fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
 is exaggerated by the unrealistic zero-length critical sections.
 The performance advantages of RCU decrease as the overhead of the critical
 sections increase, as shown in
 \cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration},
 which was run on the same system as the previous plots.
 Here, the y-axis represents the sum of the overhead of the read-side
 primitives and that of the critical section and the x-axis represents
 the critical-section overhead in nanoseconds.
 But please note the logscale y~axis, which means that the small
 separations between the traces still represent significant differences.
 This figure shows non-preemptible RCU, but given that preemptible RCU's
 read-side overhead is only about three nanoseconds, its plot would be
 nearly identical to
 \cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}.

 \QuickQuiz{
 	Why the larger error ranges for the submicrosecond durations in
 	\cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}?
 }\QuickQuizAnswer{
 	Because smaller disturbances result in greater relative errors
 	for smaller measurements.
 	Also, the Linux kernel's \co{ndelay()} nanosecond-scale primitive
 	is (as of 2020) less accurate than is the \co{udelay()} primitive
 	used for the data for durations of a microsecond or more.
 	It is instructive to compare to the zero-length case shown in
 	\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}.
 }\QuickQuizEnd

 There are three traces for reader-writer locking, with the upper trace
 being for 100~CPUs, the next for 10~CPUs, and the lowest for 1~CPU\@.
 The greater the number of CPUs and the shorter the critical sections,
 the greater is RCU's performance advantage.
 These performance advantages are underscored by the fact that 100-CPU
 systems are no longer uncommon and that a number of system calls (and
 thus any RCU read-side critical sections that they contain) complete
 within microseconds.

 In addition, as is discussed in the next paragraph,
 RCU read-side primitives are almost entirely deadlock-immune.


 \paragraph{Deadlock Immunity}

 Although RCU offers significant performance advantages for
 read-mostly workloads, one of the primary reasons for creating
 RCU in the first place was in fact its immunity to read-side
 deadlocks.
 This immunity stems from the fact that
 RCU read-side primitives do not block, spin, or even
 do backwards branches, so that their execution time is deterministic.
 It is therefore impossible for them to participate in a deadlock
 cycle.

 \QuickQuiz{
 	Is there an exception to this deadlock immunity, and if so,
 	what sequence of events could lead to deadlock?
 }\QuickQuizAnswer{
 	One way to cause a deadlock cycle involving
 	RCU read-side primitives is via the following (illegal) sequence
 	of statements:

 \begin{VerbatimU}
 rcu_read_lock();
 synchronize_rcu();
 rcu_read_unlock();
 \end{VerbatimU}

 	The \co{synchronize_rcu()} cannot return until all
 	pre-existing RCU read-side critical sections complete, but
 	is enclosed in an RCU read-side critical section that cannot
 	complete until the \co{synchronize_rcu()} returns.
 	The result is a classic self-deadlock---you get the same
 	effect when attempting to write-acquire a reader-writer lock
 	while read-holding it.

 	Note that this self-deadlock scenario does not apply to
 	RCU QSBR, because the context switch performed by the
 	\co{synchronize_rcu()} would act as a quiescent state
 	for this CPU, allowing a grace period to complete.
 	However, this is if anything even worse, because data used
 	by the RCU read-side critical section might be freed as a
 	result of the grace period completing.
 	Plus Linux kernel's lockdep facility will yell at you.

 	In short, do not invoke synchronous RCU update-side primitives, which
 	are listed in
 	\cref{tab:defer:RCU Wait-to-Finish APIs},
 	from within an RCU read-side critical section.

 	In addition, within the Linux kernel, RCU uses the scheduler
 	and the scheduler uses RCU\@.
 	In some cases, both RCU and the scheduler must take care to
 	avoid deadlock.
 }\QuickQuizEnd

 An interesting consequence of RCU's read-side deadlock immunity is
 that it is possible to unconditionally upgrade an RCU
 reader to an RCU updater.
 Attempting to do such an upgrade with reader-writer locking results
 in deadlock.
 A sample code fragment that does an RCU read-to-update upgrade follows:

 \begin{VerbatimN}[samepage=true]
 rcu_read_lock();
 list_for_each_entry_rcu(p, &head, list_field) {
 	do_something_with(p);
 	if (need_update(p)) {
 		spin_lock(my_lock);
 		do_update(p);
 		spin_unlock(&my_lock);
 	}
 }
 rcu_read_unlock();
 \end{VerbatimN}

 Note that \co{do_update()} is executed under
 the protection of the lock \emph{and} under RCU read-side protection.

 Another interesting consequence of RCU's deadlock immunity is its
 immunity to a large class of priority inversion problems.
 For example, low-priority RCU readers cannot prevent a high-priority
 RCU updater from acquiring the update-side lock.
 Similarly, a low-priority RCU updater cannot prevent high-priority
 RCU readers from entering an RCU read-side critical section.

 \QuickQuiz{
 	Immunity to both deadlock and priority inversion???
 	Sounds too good to be true.
 	Why should I believe that this is even possible?
 }\QuickQuizAnswer{
 	It really does work.
 	After all, if it didn't work, the Linux kernel would not run.
 }\QuickQuizEnd

 \paragraph{Realtime Latency}

 Because RCU read-side primitives neither spin nor block, they offer
 excellent realtime latencies.
 In addition, as noted earlier, this means that they are
 immune to priority inversion
 involving the RCU read-side primitives and locks.

 However, RCU is susceptible to more subtle priority-inversion scenarios,
 for example, a high-priority process blocked waiting for an RCU
 grace period to elapse can be blocked by low-priority RCU readers
 in \rt\ kernels.
 This can be solved by using RCU priority
 boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}.

 However, use of RCU priority boosting requires that \co{rcu_read_unlock()}
 do deboosting, which entails acquiring scheduler locks.
 Some care is therefore required within the scheduler and RCU to avoid
 deadlocks, which as of the v5.15 Linux kernel requires RCU to avoid
 invoking the scheduler while holding any of RCU's locks.

 This in turn means that \co{rcu_read_unlock()} is not always lockless
 when RCU priority boosting is enabled.
 However, \co{rcu_read_unlock()} will still be lockless if its
 critical section was not priority-boosted.
 Furthermore, critical sections will not be priority boosted unless they
 are preempted, or, in -rt kernels, they acquire non-raw spinlocks.
 This means that \co{rcu_read_unlock()} will normally be lockless from the
 perspective of the highest priority task running on any given CPU.

 \paragraph{RCU Readers and Updaters Run Concurrently}

 Because RCU readers never spin nor block, and because updaters are not
 subject to any sort of rollback or abort semantics, RCU readers and
 updaters really can run concurrently.
 This means that RCU readers might access stale data, and might even
 see inconsistencies, either of which can render conversion from reader-writer
 locking to RCU non-trivial.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/rwlockRCUupdate}}
 \caption{Response Time of RCU vs.\@ Reader-Writer Locking}
 \label{fig:defer:Response Time of RCU vs. Reader-Writer Locking}
 \end{figure}

 However, in a surprisingly large number of situations, inconsistencies and
 stale data are not problems.
 The classic example is the networking routing table.
 Because routing updates can take considerable time to reach a given
 system (seconds or even minutes), the system will have been sending
 packets the wrong way for quite some time when the update arrives.
 It is usually not a problem to continue sending updates the wrong
 way for a few additional milliseconds.
 Furthermore, because RCU updaters can make changes without waiting for
 RCU readers to finish,
 the RCU readers might well see the change more quickly than would
 batch-fair
 reader-writer-locking readers, as shown in
 \cref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}.
 This faster RCU response time has since been corroborated (if perhaps
 a bit unrealistically) by Markov-model
 analysis~\cite[Figures 3 and 5]{VishakhaRamani2023LockBasedLocklessFresh}.

 \QuickQuiz{
 	But how many other algorithms really tolerate stale and
 	inconsistent data?
 }\QuickQuizAnswer{
 	Quite a few!

 	Please keep in mind that the finite speed of light means that
 	data reaching a given computer system is at least slightly stale
 	at the time that it arrives, and extremely stale in the case
 	of astronomical data.
 	The finite speed of light also places a sharp limit on the
 	consistency of data arriving from different sources of via
 	different paths.

 	You might as well face the fact that the laws of physics
 	are incompatible with naive notions of perfect freshness and
 	consistency.
 }\QuickQuizEnd

 Once the update is received, the rwlock writer cannot proceed until the
 last reader completes, and subsequent readers cannot proceed until the
 writer completes.
 However, these subsequent readers are guaranteed to see the new value,
 as indicated by the green shading of the rightmost boxes.
 In contrast, RCU readers and updaters do not block each other, which permits
 the RCU readers to see the updated values sooner.
 Of course, because their execution overlaps that of the RCU updater,
 \emph{all} of the RCU readers might well see updated values, including
 the three readers that started before the update.
 Nevertheless only the green-shaded rightmost RCU readers
 are \emph{guaranteed} to see the updated values.

 Reader-writer locking and RCU simply provide different guarantees.
 With reader-writer locking, any reader that begins after the writer begins
 is guaranteed to see new values, and any reader that attempts to
 begin while the writer is spinning might or might not see new values,
 depending on the reader/writer preference of the rwlock implementation in
 question.
 In contrast, with RCU, any reader that begins after the updater completes
 is guaranteed to see new values, and any reader that completes after the
 updater begins might or might not see new values, depending on timing.

 The key point here is that, although reader-writer locking does
 indeed guarantee consistency within the confines of the computer system,
 there are situations where this consistency comes at the price of
 increased \emph{inconsistency} with the outside world, courtesy of
 the finite speed of light and the non-zero size of atoms.
 In other words, reader-writer locking obtains internal consistency at the
 price of silently stale data with respect to the outside world.

 Note that if a value is computed while read-holding a reader-writer
 lock, and then that value is used after that lock is released, then
 this reader-writer-locking use case is using stale data.
 After all, the quantities that this value is based on could change
 at any time after that lock is released.
 This sort of reader-writer-locking use case is often easy to
 convert to RCU, as will be shown in
 \cref{lst:defer:Converting Reader-Writer Locking to RCU: Data,%
 lst:defer:Converting Reader-Writer Locking to RCU: Search,%
 lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
 and the accompanying text.

 \paragraph{Low-Priority RCU Readers Can Block High-Priority Reclaimers}

 In Realtime RCU~\cite{DinakarGuniguntala2008IBMSysJ} or
 SRCU~\cite{PaulEMcKenney2006c},
 a preempted reader will prevent a grace period from completing, even if
 a high-priority task is blocked waiting for that grace period to complete.
 Realtime RCU can avoid this problem by substituting \co{call_rcu()}
 for \co{synchronize_rcu()} or by using RCU priority
 boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}.
 It might someday be necessary to augment SRCU and RCU Tasks Trace with
 priority boosting, but not before a clear real-world need is demonstrated.

 \QuickQuiz{
 	If Tasks RCU Trace might someday be priority boosted, why
 	not also Tasks RCU and Tasks RCU Rude?
 }\QuickQuizAnswer{
 	Maybe, but these are less likely.

 	In the case of Tasks RCU, recall that the quiescent state is
 	a voluntary context switch.
 	Thus, all tasks not blocked after a voluntary context switch
 	might need to be boosted, and the mechanics of deboosting would
 	not likely be at all pretty.

 	In the case of Tasks RCU Rude, as was the case with the old
 	RCU Sched, any preemptible region of code is a quiescent state.
 	Thus, the only tasks that might need boosting are those currently
 	running with preemption disabled.
 	But boosting the priority of a preemption-disabled task has no
 	effect.
 	It therefore seems doubly unlikely that priority boosting will
 	ever be introduced to Tasks RCU Rude, at least in its current
 	form.
 }\QuickQuizEnd

 \paragraph{RCU Grace Periods Extend for Many Milliseconds}

 With the exception of userspace
 RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU},
 expedited grace periods, and several of the ``toy''
 RCU implementations described in
 \cref{chp:app:``Toy'' RCU Implementations},
 RCU grace periods extend milliseconds.
 Although there are a number of techniques to render such long
 delays harmless, including use of the asynchronous interfaces
 (\co{call_rcu()} and \co{call_rcu_bh()}) or of the polling interfaces
 (\co{get_state_synchronize_rcu()}, \co{start_poll_synchronize_rcu()},
 and \co{poll_state_synchronize_rcu()}), this situation is a major reason
 for the rule of thumb that RCU be used in read-mostly situations.

 As noted in \cref{sec:defer:RCU Linux-Kernel API}, within the Linux
 kernel, shorter grace periods may be obtained via expedited grace
 periods, for example, by invoking \co{synchronize_rcu_expedited()}
 instead of \co{synchronize_rcu()}.
 Expedited grace periods can reduce delays to as little as a few tens of
 microseconds, albeit at the expense of higher CPU utilization and IPIs.
 The added IPIs can be especially unwelcome in some real-time workloads.

 \paragraph{Code:
 		 Reader-Writer Locking vs.\@ RCU}

 In the best case, the conversion from reader-writer locking to RCU
 is quite simple, as shown in
 \cref{lst:defer:Converting Reader-Writer Locking to RCU: Data,%
 lst:defer:Converting Reader-Writer Locking to RCU: Search,%
 lst:defer:Converting Reader-Writer Locking to RCU: Deletion},
 all taken from
 Wikipedia~\cite{WikipediaRCU}.

 \begin{listing*}
 { \scriptsize
 \begin{verbbox}
  1 struct el {                           1 struct el {
  2   struct list_head lp;                2   struct list_head lp;
  3   long key;                           3   long key;
  4   spinlock_t mutex;                   4   spinlock_t mutex;
  5   int data;                           5   int data;
  6   /* Other data fields */             6   /* Other data fields */
  7 };                                    7 };
  8 DEFINE_RWLOCK(listmutex);             8 DEFINE_SPINLOCK(listmutex);
  9 LIST_HEAD(head);                      9 LIST_HEAD(head);
 \end{verbbox}
 }
 \hspace*{0.9in}\OneColumnHSpace{-0.5in}
 \IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
 \caption{Converting Reader-Writer Locking to RCU\@:
 						    Data}
 \label{lst:defer:Converting Reader-Writer Locking to RCU: Data}
 \end{listing*}

 \begin{listing*}
 { \scriptsize
 \begin{verbbox}
  1 int search(long key, int *result)     1 int search(long key, int *result)
  2 {                                     2 {
  3   struct el *p;                       3   struct el *p;
  4                                       4
  5   read_lock(&listmutex);              5   rcu_read_lock();
  6   list_for_each_entry(p, &head, lp) { 6   list_for_each_entry_rcu(p, &head, lp) {
  7     if (p->key == key) {              7     if (p->key == key) {
  8       *result = p->data;              8       *result = p->data;
  9       read_unlock(&listmutex);        9       rcu_read_unlock();
 10       return 1;                      10       return 1;
 11     }                                11     }
 12   }                                  12   }
 13   read_unlock(&listmutex);           13   rcu_read_unlock();
 14   return 0;                          14   return 0;
 15 }                                    15 }
 \end{verbbox}
 }
 \hspace*{0.9in}\OneColumnHSpace{-0.5in}
 \IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
 \caption{Converting Reader-Writer Locking to RCU\@:
 						    Search}
 \label{lst:defer:Converting Reader-Writer Locking to RCU: Search}
 \end{listing*}

 \begin{listing*}
 { \scriptsize
 \begin{verbbox}
  1 int delete(long key)                  1 int delete(long key)
  2 {                                     2 {
  3   struct el *p;                       3   struct el *p;
  4                                       4
  5   write_lock(&listmutex);             5   spin_lock(&listmutex);
  6   list_for_each_entry(p, &head, lp) { 6   list_for_each_entry(p, &head, lp) {
  7     if (p->key == key) {              7     if (p->key == key) {
  8       list_del(&p->lp);               8       list_del_rcu(&p->lp);
  9       write_unlock(&listmutex);       9       spin_unlock(&listmutex);
                                         10       synchronize_rcu();
 10       kfree(p);                      11       kfree(p);
 11       return 1;                      12       return 1;
 12     }                                13     }
 13   }                                  14   }
 14   write_unlock(&listmutex);          15   spin_unlock(&listmutex);
 15   return 0;                          16   return 0;
 16 }                                    17 }
 \end{verbbox}
 }
 \hspace*{0.9in}\OneColumnHSpace{-0.5in}
 \IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
 \caption{Converting Reader-Writer Locking to RCU\@:
 						    Deletion}
 \label{lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
 \end{listing*}

 However, the transformation is not always this straightforward.
 This is because neither the \co{spin_lock()} nor the
 \co{synchronize_rcu()} in
 \cref{lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
 exclude the readers in
 \cref{lst:defer:Converting Reader-Writer Locking to RCU: Search}.
 First, the \co{spin_lock()} does not interact in any way with
 \co{rcu_read_lock()} and \co{rcu_read_unlock()}, thus not excluding them.
 Second, although both \co{write_lock()} and \co{synchronize_rcu()}
 wait for pre-existing readers, only \co{write_lock()} prevents
 subsequent readers from commencing.\footnote{
 	Kudos to whoever pointed this out to Paul.}
 Thus, \co{synchronize_rcu()} cannot exclude readers.
 Nevertheless, a great many situations using reader-writer locking can
 be converted to RCU\@.

 More-elaborate cases of replacing reader-writer locking with RCU
 may be found
 elsewhere~\cite{NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk}.

 \paragraph{Semantics:
 		      Reader-Writer Locking vs.\@ RCU}

 Expanding on the previous section, reader-writer locking semantics can
 be roughly and informally summarized by the following three temporal
 constraints:

 \begin{enumerate}
 \item	Write-side acquisitions wait for any read-holders to release
 	the lock.
 \item	Writer-side acquisitions wait for any write-holder to release
 	the lock.
 \item	Read-side acquisitions wait for any write-holder to release
 	the lock.
 \end{enumerate}

 RCU dispenses entirely with constraint~\#3 and weakens the other two
 as follows:

 \begin{enumerate}
 \item	Writers wait for any pre-existing read-holders before progressing
 	to the destructive phase of their update (usually the freeing of
 	memory).
 \item	Writers synchronize with each other as needed.
 \end{enumerate}

 It is of course this weakening that permits RCU implementations to attain
 excellent performance and scalability.
 It also allows RCU to implement the aforementioned unconditional
 read-to-write upgrade that is so attractive and so deadlock-prone in
 reader-writer locking.
 Code using RCU can compensate for this weakening in a surprisingly large
 number of ways, but most commonly by imposing spatial constraints:

 \begin{enumerate}
 \item	New data is placed in newly allocated memory.
 \item	Old data is freed, but only after:
 	\begin{enumerate}
 	\item	That data has been unlinked so as to be inaccessible
 		to later readers, and
 	\item	A subsequent RCU grace period has elapsed.
 	\end{enumerate}
 \end{enumerate}

 Of course, there are some reader-writer-locking use cases for which
 RCU's weakened semantics are inappropriate, but experience in the Linux
 kernel indicates that more than 80\pct\ of reader-writer locks can in fact
 be replaced by RCU\@.
 For example, a common reader-writer-locking use case computes some value
 while holding the lock and then uses that value after releasing that lock.
 This use case results in stale data, and therefore often accommodates
 RCU's weaker semantics.

 \begin{listing}
 \input{CodeSamples/defer/singleton@get.fcv}
 \caption{RCU Singleton Get}
 \label{lst:defer:Singleton Get}
 \end{listing}

 \begin{listing}
 \input{CodeSamples/defer/singleton@set.fcv}
 \caption{RCU Singleton Set}
 \label{lst:defer:Singleton Set}
 \end{listing}

 \begin{fcvref}[ln:defer:singleton:get]
 This interaction of temporal and spatial constraints is illustrated
 by the RCU singleton data structure illustrated in
 \cref{fig:defer:Insertion With Concurrent Readers,fig:defer:Deletion With Concurrent Readers}.
 This structure is defined on \clnrefrange{myconfig.b}{myconfig.e} of
 \cref{lst:defer:Singleton Get}, and contains two integer fields,
 \co{->a} and \co{->b} (\path{singleton.c}).
 The current instance of this structure is referenced by the \co{curconfig}
 pointer defined on \clnref{myconfig.e}.
 \end{fcvref}

 \begin{fcvref}[ln:defer:singleton:get]
 The fields of the current structure are passed back through the
 \co{cur_a} and \co{cur_b} parameters to the \co{get_config()} function
 defined on \clnrefrange{get_config.b}{get_config.e}.
 These two fields can be slightly out of date, but they absolutely must
 be consistent with each other.
 The \co{get_config()} function provides this consistency
 within the RCU read-side critical section starting on
 \clnref{rrl} and ending on either \clnref{rrul1} or \clnref{rrul2},
 which provides the needed temporal synchronization.
 \Clnref{rderef} fetches the pointer to the current \co{myconfig} structure.
 This structure will be used regardless of any concurrent changes due
 to calls to the \co{set_config()} function, thus providing the needed
 spatial synchronization.
 If \clnref{nullchk} determines that the \co{curconfig} pointer was
 \co{NULL}, \clnref{retfail} returns failure.
 Otherwise, \clnref{copya,copyb} copy out the \co{->a} and \co{->b} fields
 and \clnref{retsuccess} returns success.
 These \co{->a} and \co{->b} fields are from the same \co{myconfig}
 structure, and the RCU read-side critical section prevents this structure
 from being freed, thus guaranteeing that these two fields are consistent
 with each other.
 \end{fcvref}

 \begin{fcvref}[ln:defer:singleton:set]
 The structure is updated by the \co{set_config()} function shown in
 \cref{lst:defer:Singleton Set}.
 \Clnrefrange{allocinit.b}{allocinit.e} allocate and initialize a
 new \co{myconfig} structure.
 \Clnref{xchg} atomically exchanges a pointer to this new structure with
 the pointer to the old structure in \co{curconfig}, while also providing
 full memory ordering both before and after the \co{xchg()} operation,
 thus providing the needed updater/reader spatial synchronization on
 the one hand and the needed updater/updater synchronization on the other.
 If \clnref{if} determines that the pointer to the old structure was
 in fact non-\co{NULL}, \clnref{sr} waits for a grace period (thus providing
 the needed reader/updater temporal synchronization) and \clnref{free}
 frees the old structure, safe in the knowledge that there are no
 longer any readers still referencing it.
 \end{fcvref}

 \begin{figure*}
 \centering
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{defer/RCUspacetime}}
 \caption{RCU Spatial/Temporal Synchronization}
 \label{fig:defer:RCU Spatial/Temporal Synchronization}
 \end{figure*}

 \Cref{fig:defer:RCU Spatial/Temporal Synchronization} shows an abbreviated
 representation of \co{get_config()} on the left and right and a similarly
 abbreviated representation of \co{set_config()} in the middle.
 Time advances from top to bottom, and the address space of the objects
 referenced by \co{curconfig} advances from left to right.
 The boxes with comma-separated numbers each represent a \co{myconfig}
 structure, with the constraint that \co{->b} is the square of \co{->a}.
 Each blue dash-dotted arrow represents an interaction with the old structure
 (on the left, containing ``5,25'') and each green dashed arrow represents
 an interaction with the new structure (on the right, containing ``9,81'').

 The black dotted arrows represent temporal relationships between RCU readers
 on the left and right and the RCU grace period at center, with each
 arrow pointing from an older event to a newer event.
 The call to \co{synchronize_rcu()} followed the leftmost \co{rcu_read_lock()},
 and therefore that \co{synchronize_rcu()} invocation must not return
 until after the corresponding \co{rcu_read_unlock()}.
 In contrast, the call to \co{synchronize_rcu()} precedes the
 rightmost \co{rcu_read_lock()}, which allows the return from that same
 \co{synchronize_rcu()} to ignore the corresponding \co{rcu_read_unlock()}.
 These temporal relationships prevent the \co{myconfig} structures from
 being freed while RCU readers are still accessing them.

 The two horizontal grey dashed lines represent the period of time during
 which different readers get different results, however, each reader
 will see one and only one of the two objects.
 All readers that end before the first horizontal line will see the
 leftmost \co{myconfig} structure, and all readers that start after the
 second horizontal line will see the rightmost structure.
 Between the two lines, that is, during the grace period, different
 readers might see different objects, but as long as each reader
 loads the \co{curconfig} pointer only once, each reader will see
 a consistent view of its \co{myconfig} structure.

 \QuickQuiz{
 	But doesn't the RCU grace period start sometime after the
 	call to \co{synchronize_rcu()} rather than in the middle
 	of that \co{xchg()} statement?
 }\QuickQuizAnswer{
 	Which grace period, exactly?

 	The updater is required to wait for at least one grace
 	period that starts at or some time after the removal,
 	in this case, the \co{xchg()}.
 	So in
 	\cref{fig:defer:RCU Spatial/Temporal Synchronization},
 	the indicated grace period starts as early as theoretically
 	possible and extends to the return from \co{synchronize_rcu()}.
 	This is a perfectly legal grace period corresponding to the
 	change carried out by that \co{xchg()} statement.
 }\QuickQuizEnd

 In short, when operating on a suitable linked data structure, RCU
 combines temporal and spatial synchronization in order to approximate
 reader-writer locking, with RCU read-side critical sections acting as
 the reader-writer-locking reader, as shown in
 \cref{fig:defer:Relationships Between RCU Use Cases,fig:defer:RCU Spatial/Temporal Synchronization}.
 RCU's temporal synchronization is provided by the read-side markers,
 for example, \co{rcu_read_lock()} and \co{rcu_read_unlock()}, as well as
 the update-side grace-period primitives, for example, \co{synchronize_rcu()}
 or \co{call_rcu()}.
 The spatial synchronization is provided by the read-side
 \co{rcu_dereference()} family of primitives, each of which subscribes
 to a version published by \co{rcu_assign_pointer()}.\footnote{
 	Preferably with both \co{rcu_dereference()} and
 	\co{rcu_assign_pointer()} being embedded in higher-level APIs.}
 RCU's combining of temporal and spatial synchronization contrasts to
 the schemes presented in
 \cref{sec:SMPdesign:Code Locking,sec:SMPdesign:Data Locking,sec:locking:Inefficiency},
 in which temporal and spatial synchronization are provided separately by
 locking and by static data-structure layout, respectively.

 \QuickQuiz{
 	Is RCU the only synchronization mechanism that combines temporal
 	and spatial synchronization in this way?
 }\QuickQuizAnswer{
 	Not at all.

 	Hazard pointers can be considered to combine temporal and spatial
 	synchronization in a similar manner.
 	Referring to
 	\cref{lst:defer:Hazard-Pointer Recording and Clearing},
 	the \co{hp_record()} function's acquisition of a reference
 	provides both spatial and temporal synchronization, subscribing
 	to a version and marking the start of a reference, respectively.
 	This function therefore combines the effects of RCU's
 	\co{rcu_read_lock()} and \co{rcu_dereference()}.
 	Referring now to
 	\cref{lst:defer:Hazard-Pointer Scanning and Freeing},
 	the \co{hp_clear()} function's release of a reference provides
 	temporal synchronization marking the end of a reference, and is
 	thus similar to RCU's \co{rcu_read_unlock()}.
 	The \co{hazptr_free_later()} function's retiring of a
 	hazard-pointer-protected object provides temporal synchronization,
 	similar to RCU's \co{call_rcu()}.
 	The primitives used to mutate a hazard-pointer-protected
 	structure provide spatial synchronization, similar to RCU's
 	\co{rcu_assign_pointer()}.

 	Alternatively, one could instead come at hazard pointers by
 	analogy with reference counting.
 	And, by doing so, learn that reference counting also combines
 	temporal and spatial synchronization as described above for
 	hazard pointers.
 }\QuickQuizEnd

 \subsubsection{Quasi Reference Count}
 \label{sec:defer:Quasi Reference Count}

 Because grace periods are not allowed to complete while
 there is an RCU read-side critical section in progress,
 the RCU read-side primitives may be used as a restricted
 reference-counting mechanism.
 For example, consider the following code fragment:

 \begin{VerbatimN}
 rcu_read_lock();  /* acquire reference. */
 p = rcu_dereference(head);
 /* do something with p. */
 rcu_read_unlock();  /* release reference. */
 \end{VerbatimN}

 The combination of the \co{rcu_read_lock()} and \co{rcu_dereference()}
 primitives can be thought of as acquiring a reference to \co{p},
 because a grace period starting after the \co{rcu_dereference()}
 assignment to \co{p} cannot possibly end until after we reach the matching
 \co{rcu_read_unlock()}.
 This reference-counting scheme is restricted in that it is forbidden
 to wait for RCU grace periods within RCU read-side critical sections,
 and also forbidden to hand off an RCU read-side critical section's
 references from one task to another.

 Regardless of these restrictions,
 the following code can safely delete \co{p}:

 \begin{VerbatimN}
 spin_lock(&mylock);
 p = head;
 rcu_assign_pointer(head, NULL);
 spin_unlock(&mylock);
 /* Wait for all references to be released. */
 synchronize_rcu();
 kfree(p);
 \end{VerbatimN}

 The assignment to \co{head} prevents any future references
 to \co{p} from being acquired, and the \co{synchronize_rcu()}
 waits for any previously acquired references to be released.

 \QuickQuiz{
 	But wait!
 	This is exactly the same code that might be used when thinking
 	of RCU as a replacement for reader-writer locking!
 	What gives?
 }\QuickQuizAnswer{
 	This is an effect of the Law of Toy Examples:
 	Beyond a certain point, the code fragments look the same.
 	The only difference is in how we think about the code.
 	For example, what does an \co{atomic_inc()} operation do?
 	It might be acquiring another explicit reference to an object
 	to which we already have a reference, it might be incrementing
 	an often-read/seldom-updated statistical counter, it might
 	be checking into an HPC-style barrier, or any of a number of
 	other things.

 	However, these differences can be extremely important.
 	For but one example of the importance, consider that if we think
 	of RCU as a restricted reference counting scheme, we would never
 	be fooled into thinking that the updates would exclude the RCU
 	read-side critical sections.

 	It nevertheless is often useful to think of RCU as a replacement
 	for reader-writer locking, for example, when you are replacing
 	reader-writer locking with RCU\@.
 }\QuickQuizEnd

 Of course, RCU can also be combined with traditional reference counting,
 as discussed in
 \cref{sec:together:Refurbish Reference Counting}.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refcntRCUperf}}
 \caption{Performance of RCU vs.\@ Reference Counting}
 \label{fig:defer:Performance of RCU vs. Reference Counting}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refRCUperfPREEMPT}}
 \caption{Performance of Preemptible RCU vs.\@ Reference Counting}
 \label{fig:defer:Performance of Preemptible RCU vs. Reference Counting}
 \end{figure}

 But why bother?
 Again, part of the answer is performance, as shown in
 \cref{fig:defer:Performance of RCU vs. Reference Counting,%
 fig:defer:Performance of Preemptible RCU vs. Reference Counting},
 again showing data taken on a 448-CPU 2.1\,GHz Intel x86 system
 for non-preemptible and preemptible Linux-kernel RCU, respectively.
 Non-preemptible RCU's advantage over
 \IXalt{reference counting}{reference count} ranges from
 more than an order of magnitude at one CPU up to about four orders of
 magnitude at 192~CPUs.
 Preemptible RCU's advantage ranges from about a factor of three at
 one CPU up to about three orders of magnitude at 192~CPUs.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refRCUperfwt}}
 \caption{Response Time of RCU vs.\@ Reference Counting, 192 CPUs}
 \label{fig:defer:Response Time of RCU vs. Reference Counting}
 \end{figure}

 However, as with reader-writer locking, the performance advantages of
 RCU are most pronounced for short-duration critical sections and for
 large numbers of CPUs, as shown in
 \cref{fig:defer:Response Time of RCU vs. Reference Counting}
 for the same system.
 In addition, as with reader-writer locking, many system calls (and thus
 any RCU read-side critical sections that they contain) complete in
 a few microseconds.

 Although traditional reference counters are usually associated with a
 specific data structure, or perhaps a specific group of data structures,
 this approach does have some disadvantages.
 For example, maintaining a single global reference counter for a large
 variety of data structures typically results in bouncing the cache line
 containing the reference count.
 As we saw in
 \crefrange{fig:defer:Performance of RCU vs. Reference Counting}{fig:defer:Response Time of RCU vs. Reference Counting},
 such cache-line bouncing can severely degrade performance.

 In contrast, RCU's lightweight \co{rcu_read_lock()},
 \co{rcu_dereference()}, and \co{rcu_read_unlock()} read-side primitives
 permit extremely frequent read-side usage with negligible performance
 degradation.
 Except that the calls to \co{rcu_dereference()} are not doing anything
 specific to acquire a reference to the pointed-to object.
 The heavy lifting is instead done by the \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} primitives and their interactions with RCU
 grace periods.

 And ignoring those calls to \co{rcu_dereference()} permits RCU to be
 thought of as a ``bulk reference-counting'' mechanism, where each call
 to \co{rcu_read_lock()} obtains a reference on each and every RCU-protected
 object, and with little or no overhead.
 However, the restrictions that go with RCU can be quite onerous.
 For example, in many cases, the Linux-kernel prohibition against
 sleeping while in an RCU read-side critical section would defeat the
 entire purpose.
 Such cases might be better served by the hazard pointers mechanism
 described in \cref{sec:defer:Hazard Pointers}.
 Cases where code rarely sleeps have been handled by using RCU as a
 reference count in the common non-sleeping case and by bridging
 to an explicit reference counter when sleeping is necessary.

 Alternatively, situations where a reference must be held by a single task
 across a section of code that sleeps may be accommodated with Sleepable
 RCU (SRCU)~\cite{PaulEMcKenney2006c}.
 This fails to cover the not-uncommon situation where a reference is ``passed''
 from one task to another, for example, when a reference is acquired
 when starting an I/O and released in the corresponding completion
 interrupt handler.
 Again, such cases might be better handled by explicit reference counters
 or by hazard pointers.

 Of course, SRCU brings restrictions of its own, namely that the
 return value from \co{srcu_read_lock()} be passed into the
 corresponding \co{srcu_read_unlock()}, and that no SRCU primitives
 be invoked from hardware interrupt handlers or from \IXacrf{nmi}
 handlers.
 The jury is still out as to how much of a problem is presented by
 this restriction, and as to how it can best be handled.

 However, in the common case where references are held within the confines
 of a single CPU or task, RCU can be used as high-performance and highly
 scalable reference-counting mechanism.

 As shown in \cref{fig:defer:Relationships Between RCU Use Cases},
 quasi reference counts add RCU readers as individual or bulk
 reference counts, possibly also bridging to reference counters
 in corner cases.

 \subsubsection{Quasi Multi-Version Concurrency Control}
 \label{sec:defer:Quasi Multi-Version Concurrency Control}

 RCU can also be thought of as a simplified multi-version concurrency
 control (MVCC) mechanism with weak consistency criteria.
 The multi-version aspects were touched upon in
 \cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}.
 However, in its native form, RCU provides version consistency only
 within a given RCU-protected data element.

 Nevertheless, there are situations where consistency and fresh data are
 required across multiple data elements.
 Fortunately, there are a number of approaches that avoid inconsistency
 and stale data, including the following:

 \begin{enumerate}
 \item	Enclose RCU readers within sequence-locking readers, forcing
 	the RCU readers to be retried should an update occur,
 	as described in
 	\cref{sec:together:Correlated Data Elements}
 	and
 	\cref{sec:together:Atomic Move}.
 \item	Place the data that must be consistent into a single element
 	of a linked data structure, and refrain from updating those
 	fields within any element visible to RCU readers.
 	RCU readers gaining a reference to any such element are then
 	guaranteed to see consistent values.
 	See \cref{sec:together:Correlated Fields} for additional details.
 \item	Use a per-element lock that guards a ``deleted'' flag to allow
 	RCU readers to reject stale
 	data~\cite{PaulEdwardMcKenneyPhD,Arcangeli03}.
 \item	Provide an existence flag that is referenced by all data elements
 	whose update is to appear atomic to RCU
 	readers~\cite{PaulEMcKennneyAtomicTreeN4037,PaulEMcKennneyAtomicTreeCPPCON2014,PaulEMcKenneyIssaquahUpdate2015,PaulEMcKenney2016IssaquahACMApp,PaulEMcKenney2016IssaquahCPPCON}.
 \item	Use one of a wide range of counter-based
 	methods~\cite{PaulEMcKenney2008cyclicRCU,PaulEMcKenney2010cyclicRCU,PaulEMcKenney2011cyclicparallelRCU,PaulEMcKenney2014cyclicRCU,Matveev:2015:RLS:2815400.2815406,Kim:2019:MSR:3297858.3304040}.
 	In these approaches, updaters maintain a version number and maintain links
 	to old versions of a given piece of data.
 	Readers take a snapshot of the current version number, and, if necessary,
 	traverse the links to find a version consistent with that snapshot.
 \end{enumerate}

 In short, when using RCU to approximate multi-version concurrency control,
 you only pay for the level of consistency that you actually need.

 As shown in \cref{fig:defer:Relationships Between RCU Use Cases},
 quasi multi-version concurrency control is based on existence guarantees,
 adding read-side snapshot operations and constraints on readers and
 writers, the exact form of the constraint being dictated by the
 consistency requirements, as summarized above.

 \subsubsection{RCU Usage Summary}
 \label{sec:defer:RCU Usage Summary}

 At its core, RCU is nothing more nor less than an API that provides:

 \begin{enumerate}
 \item	A publish-subscribe mechanism for adding new data,
 \item	A way of waiting for pre-existing RCU readers to finish, and
 \item	A discipline of maintaining multiple versions to permit change
 	without harming or unduly delaying concurrent RCU readers.
 \end{enumerate}

 That said, it is possible to build higher-level constructs on top of RCU,
 including the various use cases described in the earlier sections.
 Furthermore, I have no doubt that new use cases will continue to be
 found for RCU, as well as for any of a number of other synchronization
 primitives.
 And so it is that RCU's use cases are conceptually more complex than
 is RCU itself, as hinted on
 \cpageref{sec:defer:Mysteries RCU Use Cases}.

 \QuickQuiz{
 	Which of these use cases best describes the Pre-BSD routing
 	example in
 	\cref{sec:defer:RCU for Pre-BSD Routing}?
 }\QuickQuizAnswer{
 	Pre-BSD routing could be argued to fit into either
 	quasi reader-writer lock, quasi reference count, or
 	quasi multi-version concurrency control.
 	The code is the same either way.
 	This is similar to things like \co{atomic_inc()}, another tool
 	that can be put to a great many uses.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/RCUApplicability}}
 \caption{RCU Areas of Applicability}
 \label{fig:defer:RCU Areas of Applicability}
 \end{figure}

 In the meantime,
 \cref{fig:defer:RCU Areas of Applicability}
 shows some rough rules of thumb on where RCU is most helpful.

 As shown in the blue box in the upper-right corner of the figure, RCU
 works best if you have read-mostly data where stale and inconsistent
 data is permissible (but see below for more information on stale and
 inconsistent data).
 The canonical example of this case in the Linux kernel is routing tables.
 Because it may have taken many seconds or even minutes for the
 routing updates to propagate across the Internet, the system
 has been sending packets the wrong way for quite some time.
 Having some small probability of continuing to send some of them the wrong
 way for a few more milliseconds is almost never a problem.

 If you have a read-mostly workload where consistent data is required,
 RCU works well, as shown by the green ``read-mostly, need consistent data''
 box.
 One example of this case is the Linux kernel's mapping from user-level
 System-V semaphore IDs to the corresponding in-kernel data structures.
 Semaphores tend to be used far more frequently than they are created
 and destroyed, so this mapping is read-mostly.
 However, it would be erroneous to perform a semaphore operation on
 a semaphore that has already been deleted.
 This need for consistency is handled by using the lock in the
 in-kernel semaphore data structure, along with a ``deleted''
 flag that is set when deleting a semaphore.
 If a user ID maps to an in-kernel data structure with the
 ``deleted'' flag set, the data structure is ignored, so that
 the user ID is flagged as invalid.

 Although this requires that the readers acquire a lock for the
 data structure representing the semaphore itself,
 it allows them to dispense with locking for the
 mapping data structure.
 The readers therefore locklessly
 traverse the tree used to map from ID to data structure,
 which in turn greatly improves performance, scalability, and
 real-time response.

 As indicated by the yellow ``read-write'' box, RCU can also be useful
 for read-write
 workloads where consistent data is required, although usually in
 conjunction with a number of other synchronization primitives.
 For example, the directory-entry cache in recent Linux kernels uses RCU in
 conjunction with sequence locks, per-CPU locks, and per-data-structure
 locks to allow lockless traversal of pathnames in the common case.
 Although RCU can be very beneficial in this read-write case, the
 corresponding code is often more complex than that of the read-mostly
 cases.

 Finally, as indicated by the red box in the lower-left corner
 of the figure, update-mostly workloads requiring consistent
 data are rarely good places to use RCU, though there are some
 exceptions~\cite{MathieuDesnoyers2012URCU}.
 For example, as noted in
 \cref{sec:defer:Type-Safe Memory},
 within the Linux kernel, the \co{SLAB_TYPESAFE_BY_RCU}
 slab-allocator flag provides type-safe memory to RCU readers, which can
 greatly simplify \IXacrl{nbs} and other lockless
 algorithms.
 In addition, if the rare readers are on critical code paths on real-time
 systems, use of RCU for those readers might provide real-time response
 benefits that more than make up for the increased update-side overhead,
 as discussed in \cref{sec:advsync:The Role of RCU}.

 In short, RCU is an API that includes a publish-subscribe mechanism for
 adding new data, a way of waiting for pre-existing RCU readers to finish,
 and a discipline of maintaining multiple versions to allow updates to
 avoid harming or unduly delaying concurrent RCU readers.
 This RCU API is best suited for read-mostly situations, especially if
 stale and inconsistent data can be tolerated by the application.