defer/rcuapi.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/rcuAPI.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \subsection{RCU Linux-Kernel API}
 \label{sec:defer:RCU Linux-Kernel API}
 \OriginallyPublished{sec:defer:RCU Linux-Kernel API}{RCU Linux-Kernel API}{Linux Weekly News}{PaulEMcKenney2008WhatIsRCUAPI}

 This section looks at RCU from the viewpoint of its Linux-kernel API\@.\footnote{
 	Userspace RCU's API is documented
 	elsewhere~\cite{PaulMcKenney2013LWNURCU}.}
 \Cref{sec:defer:RCU has a Family of Wait-to-Finish APIs}
 presents RCU's wait-to-finish APIs,
 \cref{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs}
 presents RCU's publish-subscribe and version-maintenance APIs,
 \cref{sec:defer:RCU has List-Processing APIs}
 presents RCU's list-processing APIs,
 \cref{sec:defer:RCU Has Diagnostic APIs}
 presents RCU's diagnostic APIs, and
 \cref{sec:defer:Where Can RCU's APIs Be Used?}
 describes in which contexts RCU's various APIs may be used.
 Finally,
 \cref{sec:defer:So; What is RCU Really?}
 presents concluding remarks.

 Readers who are not excited about kernel internals may wish to skip
 ahead to \cref{sec:defer:RCU Usage}
 on \cpageref{sec:defer:RCU Usage},
 but preferably after reviewing the next section covering software-engineering
 considerations.

 \subsubsection{RCU API and Software Engineering}
 \label{sec:defer:RCU API and Software Engineering}

 Readers who have looked ahead to
 \cref{tab:defer:RCU Wait-to-Finish APIs,%
 tab:defer:RCU Publish-Subscribe and Version Maintenance APIs,%
 tab:defer:RCU-Protected List APIs,%
 tab:defer:RCU Diagnostic APIs}
 might have noted that the full list of Linux-kernel APIs sports more
 than 100 members.
 This is in sharp (and perhaps dismaying) contrast to the mere six
 API members shown in
 \cref{tab:defer:Core RCU API}.
 This situation clearly raises the question ``Why so many???''

 This question is answered more thoroughly in the following sections,
 but in the meantime the rest of this section summarizes the motivations.

 There is a wise old saying to the effect of ``To err is human.''
 This means that purpose of a significant fraction of the RCU API is to
 provide diagnostics, most notably in \cref{tab:defer:RCU Diagnostic APIs},
 but elsewhere as well.

 Important causes of human error are the limits of the human brain,
 for example, the limited capacity of short-term memory.
 The toy examples shown in this book do not stress these limits.
 This is out of necessity:
 Many readers push their cognitive limits while learning new material,
 so the examples need to be kept simple.

 These examples therefore keep \co{rcu_dereference()} invocations
 in the same function as the enclosing \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} calls.
 In contrast, real-world software must frequently invoke these API members
 from different functions, and even from different translation units.
 The Linux kernel RCU API has therefore expanded to accommodate lockdep,
 which allows \co{rcu_dereference()} and friends to complain if it is
 not protected by \co{rcu_read_lock()}.
 Linux-kernel RCU also checks for some double-free errors, infinite
 loops in RCU read-side critical sections, and attempts to invoke
 quiescent states within RCU read-side critical sections.

 Another way that real-world software accommodates the limits of human
 cognition is through abstraction.
 The Linux-kernel API therefore includes members that operate on lists
 in addition to the pointer-oriented core API of
 \cref{tab:defer:Core RCU API}.
 The Linux kernel itself also provides RCU-protected hash tables and
 search trees.

 Operating-systems kernels such as Linux operate near the bottom of the
 ``iron triangle'' of the software stack shown in
 \cref{fig:intro:Software Layers and Performance; Productivity; and Generality},
 where performance is critically important.
 There are thus specialized variants of a number of RCU APIs for use on
 fastpaths, for example, as discussed in
 \cref{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs},
 \co{RCU_INIT_POINTER()} may be used in
 place of \co{rcu_assign_pointer()} in cases where the RCU-protected pointer
 is being assigned to \co{NULL} or when that pointer is not yet accessible
 by readers.
 Use of \co{RCU_INIT_POINTER()} allows the compiler more leeway in
 selecting instructions and carrying out optimizations, thus increasing
 performance.

 On the other hand, when used incorrectly \co{RCU_INIT_POINTER()} can
 result in silent memory corruption, so please be careful!
 Yes, in some cases, the kernel can check for inappropriate use of
 RCU API members from a given kernel context, but the constraints of
 \co{RCU_INIT_POINTER()} use are not yet checkable.

 Finally, within the Linux kernel, the aforementioned limits of human
 cognition are compounded by the variety and severity of workloads running
 on Linux.
 As of v5.16, this has given rise to no fewer than five flavors of
 RCU, each designed to provide different performance, scalability,
 response-time, and energy efficiency tradeoffs to RCU readers and writers.
 These RCU flavors are the subject of the next section.

 \subsubsection{RCU has a Family of Wait-to-Finish APIs}
 \label{sec:defer:RCU has a Family of Wait-to-Finish APIs}

 The most straightforward answer to ``what is RCU'' is that RCU is
 an API\@.
 For example, the RCU implementation used in the Linux kernel is
 summarized by
 \cref{tab:defer:RCU Wait-to-Finish APIs},
 which shows the wait-for-readers portions of the RCU, ``sleepable'' RCU
 (SRCU), Tasks RCU, and generic APIs, respectively,
 and by
 \cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs},
 which shows the publish-subscribe portions of the
 API~\cite{PaulEMcKenney2024RCUAPI}.\footnote{
 	This citation covers v6.10 and later.
 	Documetation for earlier versions of the Linux-kernel RCU API may
 	be found elsewhere~\cite{PaulEMcKenney2008WhatIsRCUAPI,PaulEMcKenney2014RCUAPI,PaulEMcKenney2019RCUAPI}.}

 \begin{sidewaystable*}[tbp]
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.3}
 \centering
 \caption{RCU Wait-to-Finish APIs}
 \label{tab:defer:RCU Wait-to-Finish APIs}
 \scriptsize\IfEbookSize{\hspace*{-1.8in}}{\hspace*{-.125in}}
 \ebresizeverb{0.7}{
 \begin{tabularx}{8.5in}{>{\raggedright\arraybackslash}p{0.94in}
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}X
     >{\raggedright\arraybackslash}p{1.1in}
     >{\raggedright\arraybackslash}p{1.35in}
     >{\raggedright\arraybackslash}p{1.45in}}
 \toprule
 &
     {\bf RCU}: Original &
 	{\bf SRCU}: Sleeping readers &
 	    {\bf Tasks RCU}: Free tracing trampolines &
 		{\bf Tasks RCU Rude}: Free idle-task tracing trampolines &
 		    {\bf Tasks RCU Trace}: Protect sleepable BPF programs \\
 \midrule
 {\bf Initialization and Cleanup} &
     &
 	\tco{DEFINE_SRCU()} \tco{DEFINE_STATIC_SRCU()}
 	\tco{init_srcu_struct()} \tco{cleanup_srcu_struct()} &
 	    &
 		&
 		    \\
 {\bf Read-side critical-section markers} &
     \tco{rcu_read_lock()}~! \tco{rcu_read_unlock()}~!
     \tco{rcu_read_lock_bh()} \tco{rcu_read_unlock_bh()}
     \tco{rcu_read_lock_sched()} \tco{rcu_read_unlock_sched()}
     (Plus anything disabing bottom halves, preemption, or interrupts.) &
 	\tco{srcu_read_lock()} \tco{srcu_read_unlock()} &
 	    Voluntary context switch &
 		Voluntary context switch and preempt-enable regions of code &
 		    \tco{rcu_read_lock_trace()} \tco{rcu_read_unlock_trace()} \\
 {\bf Update-side primitives (synchronous) } &
     { \tco{synchronize_rcu()}
       \tco{synchronize_net()}
       \tco{synchronize_rcu_expedited()} } &
 	\tco{synchronize_srcu()} \tco{synchronize_srcu_expedited()} &
 	    \tco{synchronize_rcu_tasks()} &
 		\tco{synchronize_rcu_tasks_rude()} &
 		    \tco{synchronize_rcu_tasks_trace()} \\
 {\bf Update-side primitives (asynchronous / callback) } &
     \tco{call_rcu()} ! &
 	\tco{call_srcu()} &
 	    \tco{call_rcu_tasks()} &
 		\tco{call_rcu_tasks_rude()} &
 		    \tco{call_rcu_tasks_trace()} \\
 {\bf Update-side primitives (wait for callbacks) } &
     \tco{rcu_barrier()} &
 	\tco{srcu_barrier()} &
 	    \tco{rcu_barrier_tasks()} &
 		\tco{rcu_barrier_tasks_rude()} &
 		    \tco{rcu_barrier_tasks_trace()} \\
 {\bf Update-side primitives (initiate / wait)} &
     \tco{get_state_synchronize_rcu()}
     \tco{cond_synchronize_rcu()} &
 	&
 	    &
 		&
 		    \\
 {\bf Update-side primitives (free memory) } &
     \tco{kfree_rcu()} &
 	&
 	    &
 		&
 		    \\
 {\bf Type-safe memory } &
     \tco{SLAB_TYPESAFE_BY_RCU} &
 	&
 	    &
 		&
 		    \\
 {\bf Read side constraints } &
     No blocking (only preemption) &
 	No \tco{synchronize_srcu()} with same \tco{srcu_struct} &
 	    No voluntary context switch &
 		Neither blocking nor preemption &
 			No RCU tasks trace grace period \\
 {\bf Read side overhead } &
     CPU-local accesses (\tco{barrier()} on \tco{PREEMPT=n}) &
 	Simple instructions, memory barriers &
 	    Free &
 		CPU-local accesses (free on \tco{PREEMPT=n}) &
 		    CPU-local accesses \\
 {\bf Asynchronous update-side overhead } &
     sub-microsecond &
 	sub-microsecond &
 	    sub-microsecond &
 		sub-microsecond &
 		    sub-microsecond \\
 {\bf Grace-period latency } &
     10s of milliseconds &
         Milliseconds &
 	    Seconds &
 		Milliseconds &
 		    10s of milliseconds \\
 {\bf Expedited grace-period latency } &
     10s of microseconds &
         Microseconds &
 	    N/A &
 		N/A &
 		    N/A \\
 \bottomrule
 \end{tabularx}
 }
 \end{sidewaystable*}

 If you are new to RCU, you might consider focusing on just one
 of the columns in
 \cref{tab:defer:RCU Wait-to-Finish APIs},
 each of which summarizes one member of the Linux kernel's RCU API family.
 For example, if you are primarily interested in understanding how RCU
 is used in the Linux kernel, ``RCU'' would be the place to start,
 as it is used most frequently.
 On the other hand, if you want to understand RCU for its own sake,
 ``Tasks RCU'' has the simplest API\@.
 You can always come back for the other columns later.

 If you are already familiar with RCU, these tables can
 serve as a useful reference.

 \QuickQuiz{
 	Why do some of the cells in
 	\cref{tab:defer:RCU Wait-to-Finish APIs}
 	have exclamation marks (``!'')?
 }\QuickQuizAnswer{
 	The API members with exclamation marks (\co{rcu_read_lock()},
 	\co{rcu_read_unlock()}, and \co{call_rcu()}) were the
 	only members of the Linux RCU API that Paul E. McKenney was aware
 	of back in the mid-90s.
 	During this timeframe, he was under the mistaken impression that
 	he knew all that there is to know about RCU\@.
 }\QuickQuizEnd

 The ``RCU'' column corresponds to the consolidation of the
 three Linux-kernel RCU
 implementations~\cite{PaulEMcKenney2019RCUCVE,McKenney:2019:CRS:3319647.3325836},
 in which RCU read-side critical sections start with
 \apik{rcu_read_lock()}, \apik{rcu_read_lock_bh()}, or \apik{rcu_read_lock_sched()}
 and end with \apik{rcu_read_unlock()}, \apik{rcu_read_unlock_bh()},
 or \apik{rcu_read_unlock_sched()}, respectively.
 Any region of code that disables bottom halves, interrupts, or preemption
 also acts as an RCU read-side critical section.
 RCU read-side critical sections may be nested.
 The corresponding synchronous update-side primitives,
 \apik{synchronize_rcu()} and \apik{synchronize_rcu_expedited()}, along with
 their synonym \apik{synchronize_net()}, wait for any type of currently
 executing RCU read-side critical sections to complete.
 The length of this wait is known as a ``\IX{grace period}'', and
 \apik{synchronize_rcu_expedited()} is designed to reduce \IXh{grace-period}
 {latency} at the expense of increased CPU overhead and IPIs.
 The asynchronous update-side primitive, \apik{call_rcu()},
 invokes a specified function with a specified argument after a
 subsequent grace period.
 For example, \co{call_rcu(p,f);} will result in
 the ``RCU callback'' \co{f(p)}
 being invoked after a subsequent grace period.
 There are situations,
 such as when unloading a Linux-kernel module that uses \co{call_rcu()},
 when it is necessary to wait for all
 outstanding RCU callbacks to complete~\cite{PaulEMcKenney2007rcubarrier}.
 The \apik{rcu_barrier()} primitive does this job.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How do you prevent a huge number of RCU read-side critical
 	sections from indefinitely blocking a \co{synchronize_rcu()}
 	invocation?
 }\QuickQuizAnswerB{
 	There is no need to do anything to prevent RCU read-side
 	critical sections from indefinitely blocking a
 	\co{synchronize_rcu()} invocation, because the
 	\co{synchronize_rcu()} invocation need wait only for
 	\emph{pre-existing} RCU read-side critical sections.
 	So as long as each RCU read-side critical section is
 	of finite duration, RCU grace periods will also remain
 	finite.
 }\QuickQuizEndB
 %
 \QuickQuizM{
 	The \co{synchronize_rcu()} API waits for all pre-existing
 	interrupt handlers to complete, right?
 }\QuickQuizAnswerM{
 	In v4.20 and later Linux kernels,
 	yes~\cite{PaulEMcKenney2019RCUCVE,McKenney:2019:CRS:3319647.3325836}.

 	But not in earlier kernels, and especially not when using
 	preemptible RCU\@!
 	You instead want \apik{synchronize_irq()}.
 	Alternatively, you can place calls to \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()} in the specific interrupt handlers that
 	you want \co{synchronize_rcu()} to wait for.
 	But even then, be careful, as preemptible RCU will not be guaranteed
 	to wait for that portion of the interrupt handler preceding the
 	\co{rcu_read_lock()} or following the \co{rcu_read_unlock()}.
 }\QuickQuizEndM
 %
 \QuickQuizM{
 	Given that any region of code that disables interrupts acts as
 	an RCU read-side critical section, do atomic operations and
 	single machine instructions also act as tiny RCU read-side
 	critical sections?
 }\QuickQuizAnswerM{
 	Yes, they do.

 	But if you are tempted to use this trick, think again, and harder
 	this time.

 	And if you have absolutely no alternative to using this trick,
 	then carefully document your use of it.
 	Otherwise, everyone reading the code (yourself included) will
 	hate you for it.

 	Why so much hate?

 	To see why, keep in mind that the big advantage of Jack
 	Slingwine's and my invention of RCU was that we marked the
 	RCU readers.
 	Take it from all the people who independently invented something
 	kind of like RCU, but failed to mark their read-side critical
 	sections:
 	You really do need to mark your RCU readers!
 	Including your single-instruction RCU readers.
 }\QuickQuizEndM
 %
 \QuickQuizE{
 	What is the difference between \co{synchronize_rcu()} and
 	\co{rcu_barrier()}?
 }\QuickQuizAnswerE{
 	They wait on different things.
 	While \co{synchronize_rcu()} waits for pre-existing RCU read-side
 	critical sections to complete, \co{rcu_barrier()} instead waits
 	for callbacks from prior calls to \co{call_rcu()} to be invoked.

 \begin{table}
 \renewcommand*{\arraystretch}{1.2}
 \centering
 \small
 \tcresizewidth{
 \begin{fcvref}[ln:defer:synchonize-rcu vs rcu-barrier]
 \begin{tabular}{lll}
 \toprule
             & \multicolumn{2}{c}{Must Wait Until (in the snippet):} \\
 \cmidrule{2-3}
 \multicolumn{1}{c}{Invoked at:} & \multicolumn{1}{c}{\tco{synchronize_rcu()}}
 					& \multicolumn{1}{c}{\tco{rcu_barrier()}} \\
 \cmidrule(r){1-1} \cmidrule{2-3}
 \tco{do_something_1()} & 			 & \\
 \tco{do_something_2()} & \tco{rcu_read_unlock()} (\clnref{rrul}) & \\
 \tco{do_something_3()} & \tco{rcu_read_unlock()} (\clnref{rrul})
 						 & \tco{f(&p->rh)} (\clnref{cb}) \\
 \tco{do_something_4()} &			 & \tco{f(&p->rh)} (\clnref{cb}) \\
 \tco{do_something_5()} & 			 & \\
 \bottomrule
 \end{tabular}
 \end{fcvref}
 }

 \begin{fcvlabel}[ln:defer:synchonize-rcu vs rcu-barrier]
 \begin{VerbatimN}[commandchars=\\\@\$]
 	do_something_1();			\lnlbl@ds1$
 	rcu_read_lock();			\lnlbl@rrl$
 	do_something_2();			\lnlbl@ds2$
 	call_rcu(&p->rh, f);			\lnlbl@cr$
 	do_something_3();			\lnlbl@ds3$
 	rcu_read_unlock();			\lnlbl@rrul$
 	do_something_4();			\lnlbl@ds4$
 	// f(&p->rh) invoked			\lnlbl@cb$
 	do_something_5();			\lnlbl@ds5$
 \end{VerbatimN}
 \end{fcvlabel}
 \caption{\tco{synchronize_rcu()} vs. \tco{rcu_barrier()}}
 \label{tab:defer:synchonize-rcu vs rcu-barrier}
 \end{table}

 	This distinction is illustrated by
 	\cref{tab:defer:synchonize-rcu vs rcu-barrier}, in which
 	the snippet at the bottom shows code being executed by a given CPU\@.
 	For simplicity, assume that no other CPU is executing
 	\co{rcu_read_lock()}, \co{rcu_read_unlock()}, or
 	\co{call_rcu()}.

 	The table shows how long each primitive must wait if invoked
 	concurrently with each of the \co{do_something_*()}
 	functions, with empty cells indicating that no
 	waiting is necessary.
 	As you can see, \co{synchronize_rcu()} need not wait unless
 	it is in an RCU read-side critical section, in which case
 	it must wait for the \co{rcu_read_unlock()} that ends that
 	critical section.
 	In contrast, RCU read-side critical sections have no effect
 	on \co{rcu_barrier()}.
 	However, when \co{rcu_barrier()} executes after a
 	\co{call_rcu()} invocation, it must wait until the
 	corresponding RCU callback is invoked.

 	All that said, there is a special case where each call to
 	\co{rcu_barrier()} can be replaced by a direct call to
 	\co{synchronize_rcu()}, and that is where \co{synchronize_rcu()}
 	is implemented in terms of \co{call_rcu()} and where there is
 	a single global list of callbacks.
 	But please do not do this in portable code!!!
 }\QuickQuizEndE
 }

 Finally, RCU may be used to provide
 \IX{type-safe memory}~\cite{Cheriton96a}, as described in
 \cref{sec:defer:Type-Safe Memory}.
 In the context of RCU, type-safe memory guarantees that a given
 data element will not change type during any RCU read-side critical section
 that accesses it.
 To make use of RCU-based type-safe memory, pass
 \apik{SLAB_TYPESAFE_BY_RCU} to \apik{kmem_cache_create()}.

 The ``SRCU'' column in
 \cref{tab:defer:RCU Wait-to-Finish APIs}
 displays a specialized RCU API that permits general sleeping in SRCU
 read-side critical
 sections~\cite{PaulEMcKenney2006c}
 delimited by \apik{srcu_read_lock()} and \apik{srcu_read_unlock()}.
 However, unlike RCU, SRCU's \apik{srcu_read_lock()} returns a value that
 must be passed into the corresponding \apik{srcu_read_unlock()}.
 This difference is due to the fact that the SRCU user allocates an
 \apik{srcu_struct} for each distinct SRCU usage, so that there is no
 convenient place to store a per-task reader-nesting count.
 (Keep in mind that although the Linux kernel provides dynamically
 allocated per-CPU storage, there is not yet dynamically allocated
 per-task storage.)

 A given \co{srcu_struct} structure may be defined as a global
 variable with \co{DEFINE_SRCU()} if the structure must be used in
 multiple translation units, or with \co{DEFINE_STATIC_SRCU()} otherwise.
 For example, \co{DEFINE_SRCU(my_srcu)} would create a global variable
 named \co{my_srcu} that could be used by any file in the program.
 Alternatively, an \co{srcu_struct} structure may be either an on-stack
 variable or a dynamically allocated region of memory.
 In both of these non-global-variable cases, the memory must be initialized
 using \co{init_srcu_struct()} prior to its first use and cleaned up
 using \co{cleanup_srcu_struct()} after its last use (but before the
 underlying storage disappears).

 However they are created, these distinct \co{srcu_struct} structures
 prevent SRCU read-side critical sections from blocking unrelated
 \apik{synchronize_srcu()} and \apik{synchronize_srcu_expedited()}
 invocations.
 Of course, use of either \apik{synchronize_srcu()} or
 \apik{synchronize_srcu_expedited()} within an SRCU read-side critical
 section can result in self-deadlock, so should be avoided.
 As with RCU, SRCU's \co{synchronize_srcu_expedited()} decreases
 grace-period latency compared to \co{synchronize_srcu()}, but at
 the expense of increased CPU overhead.

 \QuickQuiz{
 	Under what conditions can \co{synchronize_srcu()} be safely
 	used within an SRCU read-side critical section?
 }\QuickQuizAnswer{
 	In principle, you can use either \co{synchronize_srcu()} or
 	\co{synchronize_srcu_expedited()} with a given \co{srcu_struct}
 	within an SRCU read-side critical section that uses some other
 	\co{srcu_struct}.
 	In practice, however, doing this is almost certainly a bad idea.
 	In particular, the code shown in
 	\cref{lst:defer:Multistage SRCU Deadlocks}
 	could still result in deadlock.

 \begin{listing}
 \begin{VerbatimL}
 idx = srcu_read_lock(&ssa);
 synchronize_srcu(&ssb);
 srcu_read_unlock(&ssa, idx);

 /* . . . */

 idx = srcu_read_lock(&ssb);
 synchronize_srcu(&ssa);
 srcu_read_unlock(&ssb, idx);
 \end{VerbatimL}
 \caption{Multistage SRCU Deadlocks}
 \label{lst:defer:Multistage SRCU Deadlocks}
 \end{listing}
 }\QuickQuizEnd

 Similar to normal RCU, self-deadlock can be avoided using the
 asynchronous \apik{call_srcu()} function.
 However, special care must be taken when using \co{call_srcu()} because
 a single task could register SRCU callbacks very quickly.
 Given that SRCU allows readers to block for arbitrary periods of
 time, this could consume an arbitrarily large quantity of memory.
 In contrast, given the synchronous \co{synchronize_srcu()}
 interface, a given task must finish waiting for a given grace period
 before it can start waiting for the next one.

 Also similar to RCU, there is an \apik{srcu_barrier()} function that waits
 for all prior \co{call_srcu()} callbacks to be invoked.

 In other words, SRCU compensates for its extremely weak
 forward-progress guarantees by permitting the developer to restrict
 its scope.

 The ``Tasks RCU'' column in
 \cref{tab:defer:RCU Wait-to-Finish APIs} displays a specialized
 RCU API that mediates freeing of the trampolines used in Linux-kernel
 tracing.
 These trampolines are used to transfer control from a point in the
 code being traced to the code doing the actual tracing.
 It is of course necessary to ensure that all code executing within
 a given trampoline has finished before freeing that trampoline.

 Changes to the code being traced are typically limited to a single jump
 or call instruction, and thus cannot accommodate the sequence of code
 required to implement \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
 Nor can the trampoline contain these calls to \co{rcu_read_lock()} and
 \co{rcu_read_unlock()}.
 To see this, consider a CPU that is just about to start executing a
 given trampoline.
 Because it has not yet executed the \co{rcu_read_lock()}, that
 trampoline could be freed at any time, which would come as a fatal
 surprise to this CPU\@.
 Therefore, trampolines cannot be protected by synchronization primitives
 executed in either the traced code or in the trampoline itself.
 Which does raise the question of exactly how the trampoline is to be
 protected.

 The key to answering this question is to note that trampoline code
 never contains code that either directly or indirectly does a
 voluntary context switch.
 This code might be preempted, but it will never directly or indirectly
 invoke \apik{schedule()}.
 This suggests a variant of RCU having voluntary context switches and
 idle execution as its only quiescent states.
 This variant is Tasks RCU\@.

 Tasks RCU is unusual in having no read-side marking functions, which
 is good given that its main use case has nowhere to put such markings.
 Instead, calls to \co{schedule()} serve directly as quiescent states.
 Updates can use \apik{synchronize_rcu_tasks()} to wait for all pre-existing
 trampoline execution to complete, or they can use its asynchronous
 counterpart, \apik{call_rcu_tasks()}.
 There is also an \apik{rcu_barrier_tasks()} that waits for completion
 of callbacks corresponding to all prior invocations of \co{call_rcu_tasks()}.
 There is no \co{synchronize_rcu_tasks_expedited()} because there has
 not yet been a request for it, though implementing a useful variant of
 it would not be free of challenges.

 \QuickQuiz{
 	In a kernel built with \co{CONFIG_PREEMPT_NONE=y}, won't
 	\co{synchronize_rcu()} wait for all trampolines, given
 	that preemption is disabled and that trampolines never
 	directly or indirectly invoke \co{schedule()}?
 }\QuickQuizAnswer{
 	You are quite right!

 	In fact, in nonpreemptible kernels, \co{synchronize_rcu_tasks()}
 	is a wrapper around \co{synchronize_rcu()}.
 }\QuickQuizEnd

 The ``Tasks RCU Rude'' column provides a more effective variant
 of the toy implementation presented in
 \cref{sec:defer:Toy Implementation}.
 This variant causes each CPU to execute a context switch,
 so that any voluntary context switch or any preemptible region of
 code can serve as a quiescent state.
 The Tasks RCU Rude variant uses the Linux-kernel workqueues facility to
 force concurrent context switches, in contrast to the serial
 CPU-by-CPU approach taken by the toy implementation.
 The API mirrors that of Tasks RCU, including the lack of explicit
 read-side markers.

 Finally, the ``Tasks RCU Trace'' column provides an RCU implementation
 with functionality similar to that of SRCU, except with much faster
 read-side markers.\footnote{
 	And thus is unusual for the Tasks RCU family for having
 	explicit read-side markers!}
 However, this speed is a consequence of the fact that these markers
 do not execute memory-barrier instructions, which means that Tasks RCU
 Trace grace periods must often send IPIs to all CPUs and must always
 scan the entire task list, thus degrading real-time response and
 consuming considerable CPU time.
 Nevertheless, in the absence of readers, the resulting grace-period
 latency is reasonably short, rivaling that of RCU\@.

 \subsubsection{RCU has Publish-Subscribe and Version-Maintenance APIs}
 \label{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs}

 Fortunately, the RCU publish-subscribe and version-maintenance
 primitives shown in
 \cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs}
 apply to all of the variants of RCU discussed above.
 This commonality can allow more code to be shared, and reduces API
 proliferation.
 The original purpose of the RCU publish-subscribe APIs was to
 bury memory barriers into these APIs, so that Linux kernel
 programmers could use RCU without needing to become expert on
 the memory-ordering models of each of the 20+ CPU families
 that Linux supports~\cite{Spraul01}.

 \begin{table*}
 \renewcommand*{\arraystretch}{1.15}
 \footnotesize
 \centering\OneColumnHSpace{-.4in}
 \ebresizewidth{
 \begin{tabular}{llp{2.2in}}
 \toprule
 Category &
 	Primitives &
 		Overhead \\
 \midrule
 Pointer publish &
 	\tco{rcu_assign_pointer()} &
 		Memory barrier \\
 &
 	\tco{rcu_replace_pointer()} &
 		Memory barrier (two of them on Alpha) \\
 &
 	\tco{rcu_pointer_handoff()} &
 		Simple instructions \\
 &
 	\tco{RCU_INIT_POINTER()} &
 		Simple instructions \\
 &
 	\tco{RCU_POINTER_INITIALIZER()} &
 		Compile-time constant \\
 \midrule
 Pointer subscribe (traversal) &
 	\tco{rcu_access_pointer()} &
 		Simple instructions \\
 &
 	\tco{rcu_dereference()} &
 		Simple instructions (memory barrier on Alpha) \\
 &
 	\tco{rcu_dereference_check()} &
 		Simple instructions (memory barrier on Alpha) \\
 &
 	\tco{rcu_dereference_protected()} &
 		Simple instructions \\
 &
 	\tco{rcu_dereference_raw()} &
 		Simple instructions (memory barrier on Alpha) \\
 &
 	\tco{rcu_dereference_raw_notrace()} &
 		Simple instructions (memory barrier on Alpha) \\
 \bottomrule
 \end{tabular}
 }
 \caption{RCU Publish-Subscribe and Version Maintenance APIs}
 \label{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs}
 \end{table*}

 These primitives operate directly on pointers, and are useful for
 creating RCU-protected linked data structures, such as RCU-protected
 arrays and trees.
 The special case of linked lists is handled by a separate set of
 APIs described in
 \cref{sec:defer:RCU has List-Processing APIs}.

 The first category publishes pointers to new data items.
 The \apik{rcu_assign_pointer()} primitive ensures that any
 prior initialization remains ordered before the assignment to the
 pointer on weakly ordered machines.
 The \apik{rcu_replace_pointer()} primitive updates the pointer just like
 \co{rcu_assign_pointer()} does, but also returns the previous value,
 just like \co{rcu_dereference_protected()} (see below) would, including
 the lockdep expression.
 This replacement is convenient when the updater must both publish a new
 pointer and free the structure referenced by the old pointer.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Normally, any pointer subject to \co{rcu_dereference()} \emph{must}
 	always be updated using one of the pointer-publish functions in
 	\cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs},
 	for example, \co{rcu_assign_pointer()}.

 	What is an exception to this rule?
 }\QuickQuizAnswerB{
 	One such exception is when a multi-element linked
 	data structure is initialized as a unit while inaccessible to other
 	CPUs, and then a single \co{rcu_assign_pointer()} is used
 	to plant a global pointer to this data structure.
 	The initialization-time pointer assignments need not use
 	\co{rcu_assign_pointer()}, though any such assignments that
 	happen after the structure is globally visible \emph{must} use
 	\co{rcu_assign_pointer()}.

 	However, unless this initialization code is on an impressively hot
 	code-path, it is probably wise to use \co{rcu_assign_pointer()}
 	anyway, even though it is in theory unnecessary.
 	It is all too easy for a ``minor'' change to invalidate your cherished
 	assumptions about the initialization happening privately.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Are there any downsides to the fact that these traversal and update
 	primitives can be used with any of the RCU API family members?
 }\QuickQuizAnswerE{
 	It can sometimes be difficult for automated
 	code checkers such as ``sparse'' (or indeed for human beings) to
 	work out which type of RCU read-side critical section a given
 	RCU traversal primitive corresponds to.
 	For example, consider the code shown in
 	\cref{lst:defer:Diverse RCU Read-Side Nesting}.

 \begin{listing}
 \begin{VerbatimL}
 rcu_read_lock();
 preempt_disable();
 p = rcu_dereference(global_pointer);

 /* . . . */

 preempt_enable();
 rcu_read_unlock();
 \end{VerbatimL}
 \caption{Diverse RCU Read-Side Nesting}
 \label{lst:defer:Diverse RCU Read-Side Nesting}
 \end{listing}

 	Is the \co{rcu_dereference()} primitive in a vanilla RCU critical
 	section or an RCU Sched critical section?
 	What would you have to do to figure this out?

 	But perhaps after the consolidation of the RCU flavors in
 	the v4.20 Linux kernel we no longer need to care!
 }\QuickQuizEndE
 }

 The \apik{rcu_pointer_handoff()} primitive simply returns its sole argument,
 but is useful to tooling checking for pointers being leaked from
 RCU read-side critical sections.
 Use of \co{rcu_pointer_handoff()} indicates to such tooling that protection
 of the structure in question has been handed off from RCU to some other
 mechanism, such as locking or reference counting.

 The \apik{RCU_INIT_POINTER()} macro can be used to initialize RCU-protected
 pointers that have not yet been exposed to readers, or alternatively,
 to set RCU-protected pointers to \co{NULL}.
 In these restricted cases, the memory-barrier instructions provided by
 \co{rcu_assign_pointer()} are not needed.
 Similarly, \apik{RCU_POINTER_INITIALIZER()} provides a \GCC-style
 structure initializer to allow easy initialization of RCU-protected
 pointers in structures.

 The second category subscribes to pointers to data items, or,
 alternatively, safely traverses RCU-protected pointers.
 Again, simply loading these pointers using C-language accesses
 could result in seeing pre-initialization garbage in the pointed-to data.
 Similarly, loading these pointer by any means outside of an RCU
 read-side critical section could result in the pointed-to object being
 freed at any time.
 However, if the pointer is merely to be tested and not dereferenced,
 the freeing of the pointed-to object is not necessarily a problem.
 In this case, \apik{rcu_access_pointer()} may be used.
 Normally, however, RCU read-side protection is required, and so the
 \apik{rcu_dereference()} primitive uses the Linux kernel's \co{lockdep}
 facility~\cite{JonathanCorbet2006lockdep} to verify that this
 \co{rcu_dereference()} invocation is under the protection of
 \co{rcu_read_lock()}, \co{srcu_read_lock()}, or some other RCU read-side
 marker.
 In contrast, the \co{rcu_access_pointer()} primitive does not involve
 \co{lockdep}, and thus will not provoke \co{lockdep} complaints when
 used outside of an RCU read-side critical section.

 Another situation where protection is not required is when update-side code
 accesses the RCU-protected pointer while holding the update-side lock.
 The \apik{rcu_dereference_protected()} API member is provided for this
 situation.
 Its first parameter is the RCU-protected pointer, and the second
 parameter takes a lockdep expression describing which locks must be
 held in order for the access to be safe.
 Code invoked both from readers and updaters can use
 \apik{rcu_dereference_check()}, which also takes a lockdep expression, but
 which may also be invoked from read-side code not holding the locks.
 In some cases, the lockdep expressions can be very complex, for example,
 when using fine-grained locking, any of a very large number of locks
 might be held, and it might be quite difficult to work out which applies.
 In these (hopefully rare) cases, \apik{rcu_dereference_raw()} provides
 protection but does not check for being invoked within a reader or with
 any particular lock being held.
 The \apik{rcu_dereference_raw_notrace()} API member acts similarly, but
 cannot be traced, and may therefore be safely used by tracing code.

 Although pretty much any linked structure can be accessed by manipulating
 pointers, higher-level structures can be quite helpful.
 The next section therefore looks at various sorts of RCU-protected
 linked lists used by the Linux kernel.

 \subsubsection{RCU has List-Processing APIs}
 \label{sec:defer:RCU has List-Processing APIs}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/Linux_list}}
 \caption{Linux Circular Linked List (\tco{list})}
 \label{fig:defer:Linux Circular Linked List (list)}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/Linux_list_abbr}}
 \caption{Linux Linked List Abbreviated}
 \label{fig:defer:Linux Linked List Abbreviated}
 \end{figure}

 Although \co{rcu_assign_pointer()} and
 \co{rcu_dereference()} can in theory be used to construct any
 conceivable RCU-protected data structure, in practice it is often better
 to use higher-level constructs.
 Therefore, the \co{rcu_assign_pointer()} and
 \co{rcu_dereference()}
 primitives have been embedded in special RCU variants of Linux's
 list-manipulation API\@.
 Linux has four variants of doubly linked list, the circular
 \co{struct list_head} and the linear
 \co{struct hlist_head}/\co{struct hlist_node},
 \co{struct hlist_nulls_head}/\co{struct hlist_nulls_node}, and
 \co{struct hlist_bl_head}/\co{struct hlist_bl_node}
 pairs.
 The former is laid out as shown in
 \cref{fig:defer:Linux Circular Linked List (list)},
 where the green (leftmost) boxes represent the list header and the blue
 (rightmost three) boxes represent the elements in the list.
 This notation is cumbersome, and will therefore be abbreviated as shown in
 \cref{fig:defer:Linux Linked List Abbreviated},
 which shows only the non-header (blue) elements.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/Linux_hlist}}
 \caption{Linux Linear Linked List (\tco{hlist})}
 \label{fig:defer:Linux Linear Linked List (hlist)}
 \end{figure}

 Linux's \co{hlist}\footnote{
 	The ``h'' stands for hashtable, in which it reduces memory
 	use by half compared to Linux's double-pointer circular
 	linked list.}
 is a linear list, which means that
 it needs only one pointer for the header rather than the two
 required for the circular list, as shown in
 \cref{fig:defer:Linux Linear Linked List (hlist)}.
 Thus, use of \co{hlist} can halve the memory consumption for the hash-bucket
 arrays of large hash tables.
 As before, this notation is cumbersome, so \co{hlist} structures will
 be abbreviated in the same way \co{list_head}-style lists are, as shown in
 \cref{fig:defer:Linux Linked List Abbreviated}.

 A variant of Linux's \co{hlist}, named \co{hlist_nulls}, provides multiple
 distinct \co{NULL} pointers, but otherwise uses the same layout as shown in
 \cref{fig:defer:Linux Linear Linked List (hlist)}.
 In this variant, a \co{->next} pointer having a zero low-order bit is
 considered to be a pointer.
 However, if the low-order bit is set to one, the upper bits identify
 the type of \co{NULL} pointer.
 This type of list is used to allow lockless readers to detect when a
 node has been moved from one list to another.
 For example, each bucket of a hash table might use its index to mark
 its \co{NULL} pointer.
 Should a reader encounter a \co{NULL} pointer not matching the index of
 the bucket it started from, that reader knows that an element it was
 traversing was moved to some other bucket during the traversal, taking
 that reader with it.
 The reader can use the \apik{is_a_nulls()} function (which returns \co{true}
 if passed an \co{hlist_nulls} \co{NULL} pointer) to determine when
 it reaches the end of a list, and the \apik{get_nulls_value()} function
 (which returns its argument's \co{NULL}-pointer identifier) to fetch
 the type of \co{NULL} pointer.
 When \co{get_nulls_value()} returns an unexpected value, the reader
 can take corrective action, for example, restarting its traversal from
 the beginning.

 \QuickQuiz{
 	But what if an \co{hlist_nulls} reader gets moved to some other
 	bucket and then back again?
 }\QuickQuizAnswer{
 	One way to handle this is to always move nodes to the beginning
 	of the destination bucket, ensuring that when the reader reaches
 	the end of the list having a matching \co{NULL} pointer, it will
 	have searched the entire list.

 	Of course, if there are too many move operations in a hash table
 	with many elements per bucket, the reader might never reach the
 	end of a list.
 	One way of avoiding this in the common case is to keep hash
 	tables well-tuned, thus with short lists.
 	One way of detecting the problem and handling it is for the
 	reader to terminate the search after traversing some large
 	number of nodes, acquire the update-side lock, and redo the
 	search, but this might introduce deadlocks.
 	Another way of avoiding the problem entirely is for readers to
 	search within RCU read-side critical sections, and to wait for
 	an RCU grace period between successive updates.
 	An intermediate position might wait for an RCU grace period
 	every $N$ updates, for some suitable value of $N$.
 }\QuickQuizEnd

 More information on \co{hlist_nulls} is available in the Linux-kernel
 source tree, with helpful example code provided in the
 \path{rculist_nulls.rst} file (\path{rculist_nulls.txt} in older kernels).

 Another variant of Linux's \co{hlist} incorporates bit-locking,
 and is named \co{hlist_bl}.
 This variant uses the same layout as shown in
 \cref{fig:defer:Linux Linear Linked List (hlist)},
 but reserves the low-order bit of the head pointer (``first'' in the
 figure) to lock the list.
 This approach also reduces memory usage, as it allows what would otherwise
 be a separate spinlock to be stored with the pointer itself.

 \begin{sidewaystable*}[htbp]
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.3}
 \centering
 \caption{RCU-Protected List APIs}
 \label{tab:defer:RCU-Protected List APIs}
 \footnotesize
 \newlength{\cwa}\newlength{\cwb}\newlength{\cwc}\newlength{\cwd}
 \IfNimbusAvail{
   \renewcommand{\ttdefault}{NimbusMonoN}
   \setlength{\cwa}{1.9in}\setlength{\cwb}{2.1in}
   \setlength{\cwc}{1.8in}\setlength{\cwd}{1.6in}
 }{
   \setlength{\cwa}{1.95in}\setlength{\cwb}{2.15in}
   \setlength{\cwc}{1.9in}\setlength{\cwd}{1.7in}
 }
 \ebresizewidthsw{
 \begin{tabular}{>{\raggedright\arraybackslash}p{\cwa}
     >{\raggedright\arraybackslash}p{\cwb}
     >{\raggedright\arraybackslash}p{\cwc}
     >{\raggedright\arraybackslash}p{\cwd}}
 \toprule
 \pmb{\tco{list}}: Circular doubly linked list &
     \pmb{\tco{hlist}}: Linear doubly linked list &
 	\pmb{\tco{hlist_nulls}}: Linear doubly linked list with marked
 	NULL pointer, with up to 31~bits of marking &
 	    \pmb{\tco{hlist_bl}}: Linear doubly linked list with bit locking \\
 \midrule
 \multicolumn{4}{l}{{\bf Structures}} \\
 \tco{struct list_head} &
     \tco{struct}{\tt ~}\tco{hlist_head} ~~~~~~~~~~~~~~
     \tco{struct}{\tt ~}\tco{hlist_node} &
 	\tco{struct}{\tt ~}\tco{hlist_nulls_head}
 	\tco{struct}{\tt ~}\tco{hlist_nulls_node} &
 	    \tco{struct}{\tt ~}\tco{hlist_bl_head}
 	    \tco{struct}{\tt ~}\tco{hlist_bl_node} \\
 \multicolumn{4}{l}{{\bf Initialization}} \\
 &
     \tco{INIT_LIST_HEAD_RCU()} &
 	&
 	    \\
 \multicolumn{4}{l}{{\bf Full traversal}} \\
 \tco{list_for_each_entry_rcu()}
 \tco{list_for_each_entry_lockless()} &
     \tco{hlist_for_each_entry_rcu()}
     \tco{hlist_for_each_entry_rcu_bh()}
     \tco{hlist_for_each_entry_rcu_notrace()} &
 	\tco{hlist_nulls_for_each_entry_rcu()}
 	\tco{hlist_nulls_for_each_entry_safe()} &
 	    \tco{hlist_bl_for_each_entry_rcu()} \\
 \multicolumn{4}{l}{{\bf Resume traversal}} \\
 \tco{list_for_each_entry_continue_rcu()}
 \tco{list_for_each_entry_from_rcu()} &
     \tco{hlist_for_each_entry_continue_rcu()}
     \tco{hlist_for_each_entry_continue_rcu_bh()}
     \tco{hlist_for_each_entry_from_rcu()} &
 	&
 	    \\
 \multicolumn{4}{l}{{\bf Stepwise traversal}} \\
 \tco{list_entry_rcu()}
 \tco{list_entry_lockless()}
 \tco{list_first_or_null_rcu()}
 \tco{list_next_rcu()}
 \tco{list_next_or_null_rcu()} &
     \multicolumn{1}{p{1.2in}}{\tco{hlist_first_rcu()}
 			      \tco{hlist_next_rcu()}
 			      \tco{hlist_pprev_rcu()}} &
 	\tco{hlist_nulls_first_rcu()}
 	\tco{hlist_nulls_next_rcu()} &
 	    \tco{hlist_bl_first_rcu()} \\
 \multicolumn{4}{l}{{\bf Add}} \\
 \multicolumn{1}{p{1.2in}}{\tco{list_add_rcu()}
 			  \tco{list_add_tail_rcu()}} &
     \tco{hlist_add_before_rcu()}
     \tco{hlist_add_behind_rcu()}
     \tco{hlist_add_head_rcu()}
     \tco{hlist_add_tail_rcu()} &
 	\tco{hlist_nulls_add_head_rcu()} &
 	    \tco{hlist_bl_add_head_rcu()}
 	    \tco{hlist_bl_set_first_rcu()} \\
 \multicolumn{4}{l}{{\bf Delete}} \\
 \tco{list_del_rcu()} &
     \multicolumn{1}{p{1.2in}}{\tco{hlist_del_rcu()}
 			      \tco{hlist_del_init_rcu()}} &
 	\tco{hlist_nulls_del_rcu()}
 	\tco{hlist_nulls_del_init_rcu()} &
 	    \tco{hlist_bl_del_rcu()}
 	    \tco{hlist_bl_del_init_rcu()} \\
 \multicolumn{4}{l}{{\bf Replace}} \\
 \tco{list_replace_rcu()} &
     \tco{hlist_replace_rcu()} &
 	&
 	    \\
 \multicolumn{4}{l}{{\bf Splice}} \\
 \tco{list_splice_init_rcu()} &
     \tco{list_splice_tail_init_rcu()} &
 	&
 	    \\
 \bottomrule
 \end{tabular}
 }
 \end{sidewaystable*}

 The API members for these linked-list variants are summarized in
 \cref{tab:defer:RCU-Protected List APIs}.
 More information is available in the \path{Documentation/RCU}
 directory of the Linux-kernel source tree and at
 Linux Weekly News~\cite{PaulEMcKenney2024RCUAPI}.

 However, the remainder of this section expands on the use of
 \apik{list_replace_rcu()}, given that this API member gave RCU its name.
 This API member is used to carry out more complex updates in which an
 element in the middle of the list having multiple fields is atomically
 updated, so that a given reader sees either the old set of values or
 the new set of values, but not a mixture of the two sets.
 For example, each node of a linked list might have integer fields
 \co{->a}, \co{->b}, and~\co{->c}, and it might be necessary to update
 a given node's fields from 5, 6, and~7 to 5, 2, and~3, respectively.

 The code implementing this atomic update is straightforward:

 \begin{fcvlabel}[ln:defer:Canonical RCU Replacement Example (2nd)]
 \begin{VerbatimN}[samepage=true,commandchars=\\\[\],firstnumber=15]
 q = kmalloc(sizeof(*p), GFP_KERNEL);	\lnlbl[kmalloc]
 *q = *p;				\lnlbl[copy]
 q->b = 2;				\lnlbl[update1]
 q->c = 3;				\lnlbl[update2]
 list_replace_rcu(&p->list, &q->list);	\lnlbl[replace]
 synchronize_rcu();			\lnlbl[sync_rcu]
 kfree(p);				\lnlbl[kfree]
 \end{VerbatimN}
 \end{fcvlabel}

 \begin{figure}
 \centering
 \IfEbookSize{
 \resizebox{2in}{!}{\includegraphics{defer/RCUReplacement}}
 }{
 \resizebox{2.7in}{!}{\includegraphics{defer/RCUReplacement}}
 }
 \caption{RCU Replacement in Linked List}
 \label{fig:defer:RCU Replacement in Linked List}
 \end{figure}

 The following discussion walks through this code, using
 \cref{fig:defer:RCU Replacement in Linked List} to illustrate
 the state changes.
 The triples in each element represent the values of fields \co{->a},
 \co{->b}, and~\co{->c}, respectively.
 The red-shaded elements might be referenced by readers,
 and because readers do not synchronize directly with updaters,
 readers might run concurrently with this entire replacement process.
 Please note that backwards pointers and the link from the tail to the
 head are omitted for clarity.

 The initial state of the list, including the pointer \co{p},
 is the same as for the deletion example, as shown on the
 first row of the figure.

 The following text describes how to replace the \co{5,6,7} element
 with \co{5,2,3} in such a way that any given reader sees one of these
 two values.

 \begin{fcvref}[ln:defer:Canonical RCU Replacement Example (2nd)]
 \Clnref{kmalloc} allocates a replacement element,
 resulting in the state as shown in the second row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 At this point, no reader can hold a reference to the newly allocated
 element (as indicated by its green shading), and it is uninitialized
 (as indicated by the question marks).

 \Clnref{copy} copies the old element to the new one, resulting in the
 state as shown in the third row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 The newly allocated element still cannot be referenced by readers, but
 it is now initialized.

 \Clnref{update1} updates \co{q->b} to the value ``2'', and
 \clnref{update2} updates \co{q->c} to the value ``3'',
 as shown on the fourth row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 Note that the newly allocated structure is still inaccessible to readers.

 Now, \clnref{replace} does the replacement, so that the new element is
 finally visible to readers, and hence is shaded red, as shown on
 the fifth row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 At this point, as shown below, we have two versions of the list.
 Pre-existing readers might see the \co{5,6,7} element (which is
 therefore now shaded yellow), but
 new readers will instead see the \co{5,2,3} element.
 But any given reader is guaranteed to see one set of values or the
 other, not a mixture of the two.

 After the \co{synchronize_rcu()} on \clnref{sync_rcu} returns,
 a grace period will have elapsed, and so all reads that started before the
 \apik{list_replace_rcu()} will have completed.
 In particular, any readers that might have been holding references
 to the \co{5,6,7} element are guaranteed to have exited
 their RCU read-side critical sections, and are thus prohibited from
 continuing to hold a reference.
 Therefore, there can no longer be any readers holding references
 to the old element, as indicated its green shading in the sixth row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 As far as the readers are concerned, we are back to having a single version
 of the list, but with the new element in place of the old.

 After the \apik{kfree()} on \clnref{kfree} completes, the list will
 appear as shown on the final row of
 \cref{fig:defer:RCU Replacement in Linked List}.
 \end{fcvref}

 Despite the fact that RCU was named after the replacement case,
 the vast majority of RCU usage within the Linux kernel relies on
 the simple independent insertion and deletion, as was shown in
 \cref{fig:defer:Multiple RCU Data-Structure Versions} in
 \cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}.

 The next section looks at APIs that assist developers in debugging
 their code that makes use of RCU\@.

 \subsubsection{RCU Has Diagnostic APIs}
 \label{sec:defer:RCU Has Diagnostic APIs}

 \begin{table}
 \renewcommand*{\arraystretch}{1.15}
 \footnotesize
 \centering
 \begin{tabular}{ll}
 \toprule
 Category &
 	Primitives \\
 \midrule
 Mark RCU pointer &
 	\tco{__rcu} \\
 \midrule
 Debug-object support &
 	\tco{init_rcu_head()} \\
 &	\tco{destroy_rcu_head()} \\
 &	\tco{init_rcu_head_on_stack()} \\
 &	\tco{destroy_rcu_head_on_stack()} \\
 \midrule
 Stall-warning control &
 	\tco{rcu_cpu_stall_reset()} \\
 \midrule
 Callback checking &
 	\tco{rcu_head_init()} \\
 &	\tco{rcu_head_after_call_rcu()} \\
 \midrule
 lockdep support &
 	\tco{rcu_read_lock_held()} \\
 &	\tco{rcu_read_lock_bh_held()} \\
 &	\tco{rcu_read_lock_sched_held()} \\
 &	\tco{srcu_read_lock_held()} \\
 &	\tco{rcu_is_watching()} \\
 &	\tco{RCU_LOCKDEP_WARN()} \\
 &	\tco{RCU_NONIDLE()} \\
 &	\tco{rcu_sleep_check()} \\
 \bottomrule
 \end{tabular}
 \caption{RCU Diagnostic APIs}
 \label{tab:defer:RCU Diagnostic APIs}
 \end{table}

 \Cref{tab:defer:RCU Diagnostic APIs}
 shows RCU's diagnostic APIs.

 The \co{__rcu} tag marks an RCU-protected pointer, for example,
 \qtco{struct foo __rcu *p;}.
 Pointers that might be passed to \co{rcu_dereference()} can be marked,
 but pointers holding values returned from \co{rcu_dereference()}
 should not be.
 Providing these markings on variables, structure fields, function
 parameters, and return values allows the Linux kernel's \co{sparse}
 tool to detect situations where RCU-protected pointers are
 incorrectly accessed using plain C-language loads and stores.

 Debug-object support is automatic for any \apik{rcu_head} structures
 that are part of a structure obtained from the Linux kernel's
 memory allocators, but those building their own special-purpose
 memory allocators can use \apik{init_rcu_head()} and \apik{destroy_rcu_head()}
 at allocation and free time, respectively.
 Those using \co{rcu_head} structures allocated on the function-call
 stack (it happens!\@) may use \apik{init_rcu_head_on_stack()}
 before first use and \apik{destroy_rcu_head_on_stack()} after last use,
 but before returning from the function.
 Debug-object support allows detection of bugs involving passing the
 same \co{rcu_head} structure to \co{call_rcu()} and friends in
 quick succession, which is the \co{call_rcu()} counterpart to the
 infamous double-free class of memory-allocation bugs.

 Stall-warning control is provided by \apik{rcu_cpu_stall_reset()}, which
 allows the caller to suppress RCU CPU stall warnings for the remainder
 of the current grace period.
 RCU CPU stall warnings help pinpoint situations where an RCU read-side
 critical section runs for an excessive length of time, and it is useful
 for things like kernel debuggers to be able to suppress them, for example,
 when encountering a breakpoint.

 Callback checking is provided by \apik{rcu_head_init()} and
 \apik{rcu_head_after_call_rcu()}.
 The former is invoked on an \co{rcu_head} structure before it is passed
 to \co{call_rcu()}, and then \co{rcu_head_after_call_rcu()} will
 check to see if the callback has been invoked with the specified
 function.

 Support for lockdep~\cite{JonathanCorbet2006lockdep} includes
 \apik{rcu_read_lock_held()},
 \apik{rcu_read_lock_bh_held()},
 \apik{rcu_read_lock_sched_held()}, and
 \apik{srcu_read_lock_held()},
 each of which returns \co{true} if invoked within the corresponding
 type of RCU read-side critical section.

 \QuickQuiz{
 	Why isn't there a \co{rcu_read_lock_tasks_held()} for Tasks RCU?
 }\QuickQuizAnswer{
 	Because Tasks RCU does not have read-side markers.
 	Instead, Tasks RCU read-side critical sections are
 	bounded by voluntary context switches.
 }\QuickQuizEnd

 Because \co{rcu_read_lock()} cannot be used from the idle loop,
 and because energy-efficiency concerns have caused the idle loop
 to become quite ornate, \apik{rcu_is_watching()} returns \co{true} if
 invoked in a context where use of \co{rcu_read_lock()} is legal.
 Note again that \co{srcu_read_lock()} may be used from idle and
 even offline CPUs, which means that \co{rcu_is_watching()} does not
 apply to SRCU\@.

 \apik{RCU_LOCKDEP_WARN()} emits a warning if lockdep is enabled and if
 its argument evaluates to \co{true}.
 For example, \co{RCU_LOCKDEP_WARN(!rcu_read_lock_held())} would emit a
 warning if invoked outside of an RCU read-side critical section.

 \apik{RCU_NONIDLE()} may be used to force RCU to watch when executing
 the statement that is passed in as the sole argument.
 For example, \co{RCU_NONIDLE(WARN_ON(!rcu_is_watching()))}
 would never emit a warning.
 However, changes in the 2020--2021 timeframe extend RCU's reach deeper
 into the idle loop, which should greatly reduce or even eliminate the
 need for \co{RCU_NONIDLE()}.

 Finally,  \apik{rcu_sleep_check()} emits a warning if invoked within
 an RCU, RCU-bh, or RCU-sched read-side critical section.

 \subsubsection{Where Can RCU's APIs Be Used?}
 \label{sec:defer:Where Can RCU's APIs Be Used?}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/RCUenvAPI}}
 \caption{RCU API Usage Constraints}
 \label{fig:defer:RCU API Usage Constraints}
 \end{figure}

 \Cref{fig:defer:RCU API Usage Constraints}
 shows which APIs may be used in which in-kernel environments.
 The RCU read-side primitives may be used in any environment, including NMI,
 the RCU mutation and asynchronous grace-period primitives may be used in any
 environment other than NMI, and, finally, the RCU synchronous grace-period
 primitives may be used only in process context.
 The RCU list-traversal primitives include \apik{list_for_each_entry_rcu()},
 \apik{hlist_for_each_entry_rcu()}, etc.
 Similarly, the RCU list-mutation primitives include
 \apik{list_add_rcu()}, \apik{hlist_del_rcu()}, etc.

 Note that primitives from other families of RCU may be substituted,
 for example, \co{srcu_read_lock()} may be used in any context
 in which \co{rcu_read_lock()} may be used.

 \subsubsection{So, What \emph{is} RCU Really?}
 \label{sec:defer:So; What is RCU Really?}

 At its core, RCU is nothing more nor less than an API that supports
 publication and subscription for insertions, waiting for all RCU readers
 to complete, and maintenance of multiple versions.
 That said, it is possible to build higher-level constructs
 on top of RCU, including the reader-writer-locking, reference-counting,
 and existence-guarantee constructs listed in
 \cref{sec:defer:RCU Usage}.
 Furthermore, I have no doubt that the Linux community will continue to
 find interesting new uses for RCU,
 just as they do for any of a number of synchronization
 primitives throughout the kernel.

 Of course, a more-complete view of RCU would also include
 all of the things you can do with these APIs.

 However, for many people, a complete view of RCU must include sample
 RCU implementations.
 \Cref{chp:app:``Toy'' RCU Implementations} therefore presents a series
 of ``toy'' RCU implementations of increasing complexity and capability,
 though others might prefer the classic
 ``User-Level Implementations of Read-Copy
 Update''~\cite{MathieuDesnoyers2012URCU}.
 For everyone else, the next section gives an overview of some RCU use cases.