together/applyrcu.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % together/applyrcu.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{RCU Rescues}
 \label{sec:together:RCU Rescues}
 %
 \epigraph{With great doubts comes great understanding, with little doubts
 	  comes little understanding.}
 	 {Chinese proverb}

 This section shows how to apply RCU to some examples discussed earlier
 in this book.
 In some cases, RCU provides simpler code, in other cases better
 performance and scalability, and in still other cases, both.

 \subsection{RCU and Per-Thread-Variable-Based Statistical Counters}
 \label{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}

 \Cref{sec:count:Per-Thread-Variable-Based Implementation}
 described an implementation of statistical counters that provided
 excellent
 performance, roughly that of simple increment (as in the C \co{++}
 operator), and linear scalability---but only for incrementing
 via \co{inc_count()}.
 Unfortunately, threads needing to read out the value via \co{read_count()}
 were required to acquire a global
 lock, and thus incurred high overhead and suffered poor scalability.
 The code for the lock-based implementation is shown in
 \cref{lst:count:Per-Thread Statistical Counters} on
 \cpageref{lst:count:Per-Thread Statistical Counters}.

 \QuickQuiz{
 	Why on earth did we need that global lock in the first place?
 }\QuickQuizAnswer{
 	A given thread's \co{__thread} variables vanish when that
 	thread exits.
 	It is therefore necessary to synchronize any operation that
 	accesses other threads' \co{__thread} variables with
 	thread exit.
 	Without such synchronization, accesses to \co{__thread} variable
 	of a just-exited thread will result in segmentation faults.
 }\QuickQuizEnd

 \subsubsection{Design}

 The hope is to use RCU rather than \co{final_mutex} to protect the
 thread traversal in \co{read_count()} in order to obtain excellent
 performance and scalability from \co{read_count()}, rather than just
 from \co{inc_count()}.
 However, we do not want to give up any accuracy in the computed sum.
 In particular, when a given thread exits, we absolutely cannot
 lose the exiting thread's count, nor can we double-count it.
 Such an error could result in inaccuracies equal to the full
 precision of the result, in other words, such an error would
 make the result completely useless.
 And in fact, one of the purposes of \co{final_mutex} is to
 ensure that threads do not come and go in the middle of \co{read_count()}
 execution.

 Therefore, if we are to dispense with \co{final_mutex}, we will need
 to come up with some other method for ensuring consistency.
 One approach is to place the total count for all previously exited
 threads and the array of pointers to the per-thread counters into a single
 structure.
 Such a structure, once made available to \co{read_count()}, is
 held constant, ensuring that \co{read_count()} sees consistent data.

 \subsubsection{Implementation}

 \begin{fcvref}[ln:count:count_end_rcu:whole]
 \Clnrefrange{struct:b}{struct:e} of
 \cref{lst:together:RCU and Per-Thread Statistical Counters}
 show the \co{countarray} structure, which contains a
 \co{->total} field for the count from previously exited threads,
 and a \co{counterp[]} array of pointers to the per-thread
 \co{counter} for each currently running thread.
 This structure allows a given execution of \co{read_count()}
 to see a total that is consistent with the indicated set of running
 threads.

 \begin{listing}
 \ebresizeverb{.71}{\input{CodeSamples/count/count_end_rcu@whole.fcv}}
 \caption{RCU and Per-Thread Statistical Counters}
 \label{lst:together:RCU and Per-Thread Statistical Counters}
 \end{listing}

 \Clnrefrange{perthread:b}{perthread:e}
 contain the definition of the per-thread \co{counter}
 variable, the global pointer \co{countarrayp} referencing
 the current \co{countarray} structure, and
 the \co{final_mutex} spinlock.

 \Clnrefrange{inc:b}{inc:e} show \co{inc_count()}, which is unchanged from
 \cref{lst:count:Per-Thread Statistical Counters}.
 \end{fcvref}

 \begin{fcvref}[ln:count:count_end_rcu:whole:read]
 \Clnrefrange{b}{e} show \co{read_count()}, which has changed significantly.
 \Clnref{rrl,rru} substitute \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}.
 \Clnref{deref} uses \co{rcu_dereference()} to snapshot the
 current \co{countarray} structure into local variable \co{cap}.
 Proper use of RCU will guarantee that this \co{countarray} structure
 will remain with us through at least the end of the current RCU
 read-side critical section at \clnref{rru}.
 \Clnref{init} initializes \co{sum} to \co{cap->total}, which is the
 sum of the counts of threads that have previously exited.
 \Clnrefrange{add:b}{add:e} add up the per-thread counters corresponding
 to currently
 running threads, and, finally, \clnref{ret} returns the sum.
 \end{fcvref}

 \begin{fcvref}[ln:count:count_end_rcu:whole:init]
 The initial value for \co{countarrayp} is
 provided by \co{count_init()} on \clnrefrange{b}{e}.
 This function runs before the first thread is created, and its job
 is to allocate
 and zero the initial structure, and then assign it to \co{countarrayp}.
 \end{fcvref}

 \begin{fcvref}[ln:count:count_end_rcu:whole:reg]
 \Clnrefrange{b}{e} show the \co{count_register_thread()} function, which
 is invoked by each newly created thread.
 \Clnref{idx} picks up the current thread's index, \clnref{acq} acquires
 \co{final_mutex}, \clnref{set} installs a pointer to this thread's
 \co{counter}, and \clnref{rel} releases \co{final_mutex}.
 \end{fcvref}

 \QuickQuiz{
 	\begin{fcvref}[ln:count:count_end_rcu:whole:reg]
 	Hey!!!
 	\Clnref{set} of
 	\cref{lst:together:RCU and Per-Thread Statistical Counters}
 	modifies a value in a pre-existing \co{countarray} structure!
 	Didn't you say that this structure, once made available to
 	\co{read_count()}, remained constant???
 	\end{fcvref}
 }\QuickQuizAnswer{
 	Indeed I did say that.
 	And it would be possible to make \co{count_register_thread()}
 	allocate a new structure, much as \co{count_unregister_thread()}
 	currently does.

 	But this is unnecessary.
 	Recall the derivation of the error bounds of \co{read_count()}
 	that was based on the snapshots of memory.
 	Because new threads start with initial \co{counter} values of
 	zero, the derivation holds even if we add a new thread partway
 	through \co{read_count()}'s execution.
 	So, interestingly enough, when adding a new thread, this
 	implementation gets the effect of allocating a new structure,
 	but without actually having to do the allocation.
 }\QuickQuizEnd

 \begin{fcvref}[ln:count:count_end_rcu:whole:unreg]
 \Clnrefrange{b}{e} show \co{count_unregister_thread()}, which is invoked
 by each thread just before it exits.
 \Clnrefrange{alloc:b}{alloc:e} allocate a new \co{countarray} structure,
 \clnref{acq} acquires \co{final_mutex} and \clnref{rel} releases it.
 \Clnref{copy} copies the contents of the current \co{countarray} into
 the newly allocated version, \clnref{add} adds the exiting thread's \co{counter}
 to new structure's \co{->total}, and \clnref{null} \co{NULL}s the exiting thread's
 \co{counterp[]} array element.
 \Clnref{retain} then retains a pointer to the current (soon to be old)
 \co{countarray} structure, and \clnref{assign} uses \co{rcu_assign_pointer()}
 to install the new version of the \co{countarray} structure.
 \Clnref{sync} waits for a grace period to elapse, so that any threads that
 might be concurrently executing in \co{read_count()}, and thus might
 have references to the old \co{countarray} structure, will be allowed
 to exit their RCU read-side critical sections, thus dropping any such
 references.
 \Clnref{free} can then safely free the old \co{countarray} structure.
 \end{fcvref}

 \QuickQuiz{
 	Given the fixed-size \co{counterp} array, exactly how does this
 	code avoid a fixed upper bound on the number of threads???
 }\QuickQuizAnswer{
 	You are quite right, that array does in fact reimpose the fixed
 	upper limit.
 	This limit may be avoided by tracking threads with a linked list,
 	as is done in userspace RCU~\cite{MathieuDesnoyers2012URCU}.
 	Doing something similar for this code is left as an exercise for
 	the reader.
 }\QuickQuizEnd

 \subsubsection{Discussion}

 \EQuickQuiz{
 	Wow!
 	\Cref{lst:together:RCU and Per-Thread Statistical Counters}
 	contains 70 lines of code, compared to only 42 in
 	\cref{lst:count:Per-Thread Statistical Counters}.
 	Is this extra complexity really worth it?
 }\EQuickQuizAnswer{
 	This of course needs to be decided on a case-by-case basis.
 	If you need an implementation of \co{read_count()} that
 	scales linearly, then the lock-based implementation shown in
 	\cref{lst:count:Per-Thread Statistical Counters}
 	simply will not work for you.
 	On the other hand, if calls to \co{read_count()} are sufficiently
 	rare, then the lock-based version is simpler and might thus be
 	better, although much of the size difference is due
 	to the structure definition, memory allocation, and \co{NULL}
 	return checking.

 	Of course, a better question is ``Why doesn't the language
 	implement cross-thread access to \co{__thread} variables?''
 	After all, such an implementation would make both the locking
 	and the use of RCU unnecessary.
 	This would in turn enable an implementation that
 	was even simpler than the one shown in
 	\cref{lst:count:Per-Thread Statistical Counters}, but
 	with all the scalability and performance benefits of the
 	implementation shown in
 	\cref{lst:together:RCU and Per-Thread Statistical Counters}!
 }\EQuickQuizEnd

 Use of RCU enables exiting threads to wait until other threads are
 guaranteed to be done using the exiting threads' \co{__thread} variables.
 This allows the \co{read_count()} function to dispense with locking,
 thereby providing
 excellent performance and scalability for both the \co{inc_count()}
 and \co{read_count()} functions.
 However, this performance and scalability come at the cost of some increase
 in code complexity.
 It is hoped that compiler and library writers employ user-level
 RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread
 access to \co{__thread} variables, greatly reducing the
 complexity seen by users of \co{__thread} variables.

 \subsection{RCU and Counters for Removable I/O Devices}
 \label{sec:together:RCU and Counters for Removable I/O Devices}

 \Cref{sec:count:Applying Exact Limit Counters}
 showed a fanciful pair of code fragments for dealing with counting
 I/O accesses to removable devices.
 These code fragments suffered from high overhead on the fastpath
 (starting an I/O) due to the need to acquire a reader-writer
 lock.

 This section shows how RCU may be used to avoid this overhead.

 The code for performing an I/O is quite similar to the original, with
 an RCU read-side critical section being substituted for the reader-writer
 lock read-side critical section in the original:

 \begin{VerbatimN}[tabsize=8]
 rcu_read_lock();
 if (removing) {
 	rcu_read_unlock();
 	cancel_io();
 } else {
 	add_count(1);
 	rcu_read_unlock();
 	do_io();
 	sub_count(1);
 }
 \end{VerbatimN}
 \vspace{5pt}

 The RCU read-side primitives have minimal overhead, thus speeding up
 the fastpath, as desired.

 The updated code fragment removing a device is as follows:

 \begin{fcvlabel}[ln:together:applyrcu:Removing Device]
 \begin{VerbatimN}[tabsize=8,commandchars=\\\[\]]
 spin_lock(&mylock);
 removing = 1;
 sub_count(mybias);
 spin_unlock(&mylock);
 synchronize_rcu();
 while (read_count() != 0) {	\lnlbl[nextofsync]
 	poll(NULL, 0, 1);
 }
 remove_device();
 \end{VerbatimN}
 \end{fcvlabel}

 \begin{fcvref}[ln:together:applyrcu:Removing Device]
 Here we replace the reader-writer lock with an exclusive spinlock and
 add a \co{synchronize_rcu()} to wait for all of the RCU read-side
 critical sections to complete.
 Because of the \co{synchronize_rcu()},
 once we reach \clnref{nextofsync},
 we know that all remaining I/Os have been accounted for.

 Of course, the overhead of \co{synchronize_rcu()} can be large,
 but given that device removal is quite rare, this is usually a good
 tradeoff.
 \end{fcvref}

 \FloatBarrier
 \subsection{Array and Length}
 \label{sec:together:Array and Length}

 Suppose we have an RCU-protected variable-length array, as shown in
 \cref{lst:together:RCU-Protected Variable-Length Array}.
 The length of the array \co{->a[]} can change dynamically, and at any
 given time, its length is given by the field \co{->length}.
 Of course, this introduces the following \IX{race condition}:

 \begin{listing}
 \begin{VerbatimL}[tabsize=8]
 struct foo {
 	int length;
 	char *a;
 };
 \end{VerbatimL}
 \caption{RCU-Protected Variable-Length Array}
 \label{lst:together:RCU-Protected Variable-Length Array}
 \end{listing}

 \begin{enumerate}
 \item	The array is initially 16 characters long, and thus \co{->length}
 	is equal to 16.
 \item	CPU~0 loads the value of \co{->length}, obtaining the value 16.
 \item	CPU~1 shrinks the array to be of length 8, and assigns a pointer
 	to a new 8-character block of memory into \co{->a[]}.
 \item	CPU~0 picks up the new pointer from \co{->a[]}, and stores a
 	new value into element 12.
 	Because the array has only 8 characters, this results in
 	a SEGV or (worse yet) memory corruption.
 \end{enumerate}

 How can we prevent this?

 One approach is to make careful use of memory barriers, which are
 covered in \cref{chp:Advanced Synchronization: Memory Ordering}.
 This works, but incurs read-side overhead and, perhaps worse, requires
 use of explicit memory barriers.

 \begin{listing}
 \begin{VerbatimL}[tabsize=8]
 struct foo_a {
 	int length;
 	char a[0];
 };

 struct foo {
 	struct foo_a *fa;
 };
 \end{VerbatimL}
 \caption{Improved RCU-Protected Variable-Length Array}
 \label{lst:together:Improved RCU-Protected Variable-Length Array}
 \end{listing}

 A better approach is to put the value and the array into the same structure,
 as shown in
 \cref{lst:together:Improved RCU-Protected Variable-Length Array}~\cite{Arcangeli03}.
 Allocating a new array (\co{foo_a} structure) then automatically provides
 a new place for the array length.
 This means that if any CPU picks up a reference to \co{->fa}, it is
 guaranteed that the \co{->length} will match the \co{->a[]}.

 \begin{enumerate}
 \item	The array is initially 16 characters long, and thus \co{->length}
 	is equal to 16.
 \item	CPU~0 loads the value of \co{->fa}, obtaining a pointer to
 	the structure containing the value 16 and the 16-byte array.
 \item	CPU~0 loads the value of \co{->fa->length}, obtaining the value 16.
 \item	CPU~1 shrinks the array to be of length 8, and assigns a pointer
 	to a new \co{foo_a} structure containing an 8-character block
 	of memory into \co{->fa}.
 \item	CPU~0 picks up the new pointer from \co{->a[]}, and stores a
 	new value into element 12.
 	But because CPU~0 is still referencing the old \co{foo_a}
 	structure that contains the 16-byte array, all is well.
 \end{enumerate}

 Of course, in both cases, CPU~1 must wait for a grace period before
 freeing the old array.

 A more general version of this approach is presented in the next section.

 \subsection{Correlated Fields}
 \label{sec:together:Correlated Fields}
 \OriginallyPublished{sec:together:Correlated Fields}{Correlated Fields}{Oregon Graduate Institute}{PaulEdwardMcKenneyPhD}

 Suppose that each of Sch\"odinger's animals is represented by the
 data element shown in
 \cref{lst:together:Uncorrelated Measurement Fields}.
 The \co{meas_1}, \co{meas_2}, and \co{meas_3} fields are a set
 of correlated measurements that are updated periodically.
 It is critically important that readers see these three values from
 a single measurement update:
 If a reader sees an old value of \co{meas_1} but new values of
 \co{meas_2} and \co{meas_3}, that reader will become fatally confused.
 How can we guarantee that readers will see coordinated sets of these
 three values?\footnote{
 	This situation is similar to that described in
 	\cref{sec:together:Correlated Data Elements},
 	except that here readers need only see a consistent view of a
 	given single data element, not the consistent view of a
 	group of data elements that was required in that earlier
 	\lcnamecref{sec:together:Correlated Data Elements}.}

 \begin{listing}
 \begin{VerbatimL}[tabsize=8]
 struct animal {
 	char name[40];
 	double age;
 	double meas_1;
 	double meas_2;
 	double meas_3;
 	char photo[0]; /* large bitmap. */
 };
 \end{VerbatimL}
 \caption{Uncorrelated Measurement Fields}
 \label{lst:together:Uncorrelated Measurement Fields}
 \end{listing}

 One approach would be to allocate a new \co{animal} structure,
 copy the old structure into the new structure, update the new
 structure's \co{meas_1}, \co{meas_2}, and \co{meas_3} fields,
 and then replace the old structure with a new one by updating
 the pointer.
 This does guarantee that all readers see coordinated sets of
 measurement values, but it requires copying a large structure due
 to the \co{->photo[]} field.
 This copying might incur unacceptably large overhead.

 \begin{listing}
 \begin{VerbatimL}[tabsize=8]
 struct measurement {
 	double meas_1;
 	double meas_2;
 	double meas_3;
 };

 struct animal {
 	char name[40];
 	double age;
 	struct measurement *mp;
 	char photo[0]; /* large bitmap. */
 };
 \end{VerbatimL}
 \caption{Correlated Measurement Fields}
 \label{lst:together:Correlated Measurement Fields}
 \end{listing}

 Another approach is to impose a level of indirection, as shown in
 \cref{lst:together:Correlated Measurement Fields}~\cite[Section 5.3.4]{PaulEdwardMcKenneyPhD}.
 When a new measurement is taken, a new \co{measurement} structure
 is allocated, filled in with the measurements, and the \co{animal}
 structure's \co{->mp} field is updated to point to this new
 \co{measurement} structure using \co{rcu_assign_pointer()}.
 After a grace period elapses, the old \co{measurement} structure
 can be freed.

 \QuickQuiz{
 	But cant't the approach shown in
 	\cref{lst:together:Correlated Measurement Fields}
 	result in extra cache misses, in turn resulting in additional
 	read-side overhead?
 }\QuickQuizAnswer{
 	Indeed it can.

 \begin{listing}
 \begin{VerbatimL}[tabsize=8]
 struct measurement {
 	double meas_1;
 	double meas_2;
 	double meas_3;
 };

 struct animal {
 	char name[40];
 	double age;
 	struct measurement *mp;
         struct measurement meas;
 	char photo[0]; /* large bitmap. */
 };
 \end{VerbatimL}
 \caption{Localized Correlated Measurement Fields}
 \label{lst:together:Localized Correlated Measurement Fields}
 \end{listing}

 	One way to avoid this cache-miss overhead is shown in
 	\cref{lst:together:Localized Correlated Measurement Fields}:
 	Simply embed an instance of a \co{measurement} structure
 	named \co{meas}
 	into the \co{animal} structure, and point the \co{->mp}
 	field at this \co{->meas} field.

 	Measurement updates can then be carried out as follows:

 	\begin{enumerate}
 	\item	Allocate a new \co{measurement} structure and place
 		the new measurements into it.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp} to
 		this new structure.
 	\item	Wait for a grace period to elapse, for example using
 		either \co{synchronize_rcu()} or \co{call_rcu()}.
 	\item	Copy the measurements from the new \co{measurement}
 		structure into the embedded \co{->meas} field.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp}
 		back to the old embedded \co{->meas} field.
 	\item	After another grace period elapses, free up the
 		new \co{measurement} structure.
 	\end{enumerate}

 	This approach uses a heavier weight update procedure to eliminate
 	the extra cache miss in the common case.
 	The extra cache miss will be incurred only while an update is
 	actually in progress.
 }\QuickQuizEnd

 This approach enables readers to see correlated values for selected
 fields, but while incurring minimal read-side overhead.
 This per-data-element consistency suffices in the common case where
 a reader looks only at a single data element.

 % Flag for deletion (if not already covered in the defer chapter).
 % @@@ Issaquah Challenge.
 % @@@ RLU & MV-RLU (Eventually the corresponding patents.)

 \subsection{Update-Friendly Traversal}
 \label{sec:together:Update-Friendly Traversal}

 Suppose that a statistical scan of all elements in a hash table is
 required.
 For example, Schr\"odinger might wish to compute the average
 length-to-weight ratio over all of his animals.\footnote{
 	Why would such a quantity be useful?
 	Beats me!
 	But group statistics are often useful.}
 Suppose further that Schr\"odinger is willing to ignore slight
 errors due to animals being added to and removed from the hash
 table while this statistical scan is being carried out.
 What should Schr\"odinger do to control concurrency?

 One approach is to enclose the statistical scan in an RCU read-side
 critical section.
 This permits updates to proceed concurrently without unduly impeding
 the scan.
 In particular, the scan does not block the updates and vice versa,
 which allows scan of hash tables containing very large numbers of
 elements to be supported gracefully, even in the face of very high
 update rates.

 \QuickQuiz{
 	But how does this scan work while a resizable hash table
 	is being resized?
 	In that case, neither the old nor the new hash table is
 	guaranteed to contain all the elements in the hash table!
 }\QuickQuizAnswer{
 	True, resizable hash tables as described in
 	\cref{sec:datastruct:Non-Partitionable Data Structures}
 	cannot be fully scanned while being resized.
 	One simple way around this is to acquire the
 	\co{hashtab} structure's \co{->ht_lock} while scanning,
 	but this prevents more than one scan from proceeding
 	concurrently.

 	Another approach is for updates to mutate the old hash
 	table as well as the new one while resizing is in
 	progress.
 	This would allow scans to find all elements in the old
 	hash table.
 	Implementing this is left as an exercise for the reader.
 }\QuickQuizEnd

 \subsection{Scalable Reference Count Two}
 \label{sec:together:Scalable Reference Count Two}

 Suppose a \IX{reference count} is becoming a performance or scalability
 bottleneck.
 What can you do?

 One approach is to use per-CPU counters for each reference count,
 somewhat similar to the algorithms in \cref{chp:Counting}, in particular,
 the exact limit counters described in
 \cref{sec:count:Exact Limit Counters}.
 The need to switch between per-CPU and global modes for these counters
 results either in expensive increments and decrements on the one hand
 (\cref{sec:count:Atomic Limit Counter Implementation})
 or in the use of POSIX signals on the other
 (\cref{sec:count:Signal-Theft Limit Counter Design}).

 Another approach is to use RCU to mediate the switch between per-CPU
 and global counting modes.
 Each update is carried out within an RCU read-side critical section,
 and each update checks a flag to determine whether to update the
 per-CPU counters on the one hand or the global on the other.
 To switch modes, update the flag, wait for a grace period, and then
 move any remaining counts from the per-CPU counters to the global
 counter or vice versa.

 The Linux kernel uses this RCU-mediated approach in its \co{percpu_ref}
 style of reference counter.
 Code using this reference counter must initialize the \co{percpu_ref}
 structure using \co{percpu_ref_init()}, which takes as arguments
 a pointer to the structure, a pointer to a function to invoke when
 the reference count reaches zero, a set of mode flags, and a set
 of \co{kmalloc()} \co{GFP_} flags.
 After normal initialization, the structure has one reference and
 is in per-CPU mode.

 The mode flags are usually zero, but can include the
 \co{PERCPU_REF_INIT_ATOMIC} bit if the counter is to start in slow
 non-per-CPU (that is, atomic) mode.
 There is also a \co{PERCPU_REF_ALLOW_REINIT} bit that allows
 a given \co{percpu_ref} counter to be reused via a call to
 \co{percpu_ref_reinit()} without needing to be freed and reallocated.
 Regardless of how the \co{percpu_ref} structure is initialized,
 \co{percpu_ref_get()} may be used to acquire a reference and
 \co{percpu_ref_put()} may be used to release a reference.

 When in per-CPU mode, the \co{percpu_ref} structure cannot determine whether
 or not its value has reached zero.
 When such a determination is necessary, \co{percpu_ref_kill()} may
 be invoked.
 This function switches the structure into atomic mode and removes the
 initial reference installed by the call to \co{percpu_ref_init()}.
 Of course, when in atomic mode, calls to \co{percpu_ref_get()} and
 \co{percpu_ref_put()} are quite expensive, but \co{percpu_ref_put()}
 can tell when the value reaches zero.

 Readers desiring more \co{percpu_ref} information are referred to
 the Linux-kernel documentation and source code.

 \subsection{Retriggered Grace Periods}
 \label{sec:together:Retriggered Grace Periods}

 There is no \co{call_rcu_cancel()}, so once an \co{rcu_head} structure
 is passed to \co{call_rcu()}, there is no calling it back.
 It must be left alone until the callback is invoked.
 In the common case, this is as it should be because the \co{rcu_head}
 structure is on a one-way journey to deallocation.

 However, there are use cases that combine RCU and explicit \co{open()}
 and \co{close()} calls.
 After a \co{close()} call, readers are not supposed to begin new accesses
 to the data structure, but there might well be readers completing their
 traversal.
 This situation can be handled in the usual manner:
 Wait for a grace period following the \co{close()} call before freeing
 the data structures.

 But what if \co{open()} is called before the grace period ends?

 Again, there is no \co{call_rcu_cancel()}, so another approach is to
 set a flag that is checked by the callback function, which can opt out
 of actually freeing anything.
 Problem solved!

 But what if \co{open()} and then another \co{close()} are both called
 before the grace period ends?

 One approach is to have a second value for the flag that causes the
 callback to requeue itself.

 But what if there is not only a \co{open()} and then another \co{close()},
 but also another \co{open()} before the grace period ends?

 In this case, the callback needs to set state to reflect that last
 \co{open()} still being in effect.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{together/retriggergp}}
 \caption{Retrigger-Grace-Period State Machine}
 \label{fig:count:Retrigger-Grace-Period State Machine}
 \end{figure}

 Continuing this line of thought leads us to the state machine
 shown in \cref{fig:count:Retrigger-Grace-Period State Machine}.
 The initial state is CLOSED and the operational state is OPEN\@.
 The diamond-shaped arrowheads denote \co{call_rcu()} invocation, while
 the arrows labeled ``CB'' denote callback invocation.

 The normal path through this state machine traverses the states CLOSED,
 OPEN, CLOSING (with an invocation of \co{call_rcu()}), and back to CLOSED
 once the callback has been invoked.
 If \co{open()} is invoked before the grace period completes, the
 state machine traverses the cycle OPEN, CLOSING (with
 an invocation of \co{call_rcu()}), REOPENING, and back to OPEN once the
 callback has been invoked.
 If \co{open()} and then \co{close()} are invoked before the grace period
 completes, the state machine traverses the cycle OPEN, CLOSING (with
 an invocation of \co{call_rcu()}), REOPENING, RECLOSING, and back to CLOSING once the
 callback has been invoked.

 Given an indefinite alternating sequence of \co{close()} and \co{open()}
 invocations, the state machine would traverse OPEN, and CLOSING (with
 an invocation of \co{call_rcu()}), followed by alternating sojourns in
 the REOPENING and RECLOSING states.
 Once the grace period ends, the state machine would transition to
 either of the CLOSING or the OPEN state, depending on which of the
 RECLOSING or REOPENING states the callback was invoked in.

 \begin{listing}
 \ebresizeverb{.88}{\input{CodeSamples/together/retrigger-gp@whole.fcv}}
 \caption{Retriggering a Grace Period (Pseudocode)}
 \label{lst:together:Retriggering a Grace Period}
 \end{listing}

 Rough pseudocode of this state machine is shown in
 \cref{lst:together:Retriggering a Grace Period}.
 \begin{fcvref}[ln:together:retrigger-gp:whole]
 The five states are shown on \clnrefrange{states:b}{states:e},
 the current state is held in \co{rtrg_status} on \clnref{status},
 which is protected by the lock on \clnref{lock}.

 The three CB transitions (emanating from states CLOSING, REOPENING,
 and RECLOSING) are implemented by the \co{close_cb()} function shown
 on \clnrefrange{close_cb:b}{close_cb:e}.
 \Clnref{cleanup} invokes a user-supplied \co{close_cleanup()} to
 take any final cleanup actions such as freeing memory when
 transitioning to the CLOSED state.
 \Clnref{call_rcu1} contains the \co{call_rcu()} invocation that
 causes a later transition to the CLOSED state.

 The \co{open()} function on \clnrefrange{open:b}{open:e} implements
 the transitions to the OPEN, CLOSING, and REOPENING states, with
 \clnref{do_open} invoking a \co{do_open()} function to implement
 any allocation and initialization of any needed data structures.

 The \co{close()} function on \clnrefrange{close:b}{close:e}
 implements the transitions to the CLOSING and RECLOSING states,
 with \clnref{do_close} invoking a \co{do_close()} function to take
 any actions that might be required to finalize this transition,
 for example, causing later read-only traversals to return errors.
 \Clnref{call_rcu2} contains the \co{call_rcu()} invocation that
 causes a later transition to the CLOSED state.
 \end{fcvref}

 This state machine and pseudocode shows how to get the effect of a
 \co{call_rcu_cancel()} in those rare situations needing such semantics.

 \subsection{Long-Duration Accesses Two}
 \label{sec:together:Long-Duration Accesses Two}

 Suppose a reader-writer-locking reader is holding the lock for so
 long that updates are excessively delayed.
 Suppose further that this reader cannot reasonably be converted to
 use reference counting
 (otherwise, see \cref{sec:together:Long-Duration Accesses}).

 If that reader can be reasonably converted to use RCU, that might
 solve the problem.
 The reason is that RCU readers do not completely block updates, but
 rather block only the cleanup portions of those updates (including
 memory reclamation).
 Therefore, if the system has ample memory, converting the reader-writer
 lock to RCU may suffice.

 However, converting to RCU does not always suffice.
 For example, the code might traverse an extremely large linked data
 structure within a single RCU read-side critical section, which might
 so greatly extend the RCU grace period that the system runs out of memory.
 These situations can be handled in a couple of different ways:
 \begin{enumerate*}[(1)]
 \item	Use SRCU instead of RCU and
 \item	Acquire a reference to exit the RCU reader.
 \end{enumerate*}

 \subsubsection{Use SRCU}
 \label{sec:together:Use SRCU}

 In the Linux kernel, RCU is global.
 In other words, any long-running RCU reader anywhere in the kernel will
 delay the current RCU grace period.
 If the long-running RCU reader is traversing a small data structure,
 that small amount of data is delaying freeing of all other data structures,
 which can result in memory exhaustion.

 One way to avoid this problem is to use SRCU for that long-running RCU
 reader's data structure, with its own \co{srcu_struct} structure.
 The resulting long-running SRCU readers will then delay only that
 \co{srcu_struct} structure's grace periods, and not those of RCU,
 thus avoiding memory exhaustion.
 For more details, see the SRCU API in \cref{sec:defer:RCU Linux-Kernel API}.

 Unfortunately, this approach does have some drawbacks.
 For one thing, SRCU readers are not subject to priority boosting, which can
 result in additional delays to low-priority SRCU readers on busy
 systems.
 Worse yet, defining a separate \co{srcu_struct} structure reduces the
 number of RCU updaters, which in turn increases the grace-period
 overhead per updater.
 This means that giving each current Linux-kernel RCU use case its own
 \co{srcu_struct} structure could multiply system-wide grace-period
 overhead by the number of such structures.

 Therefore, it is often better to acquire some sort of non-RCU reference
 on the needed data to permit a momentary exit from the RCU read-side
 critical section, as described in the next section.

 \subsubsection{Acquire a Reference}
 \label{sec:together:Acquire a Reference}

 If the RCU read-side critical section is too long, shorten it!

 In some cases, this can be done trivially.
 For example, code that scans all of the hash chains of a statically
 allocated array of hash buckets can just as easily scan each hash chain
 within its own critical section.

 This works because hash chains are normally quite short, and by design.
 When traversing long linked structures, it is necessary to have some
 way of stopping in the middle and resuming later.

 For example, in Linux kernel v5.16, the \co{khugepaged_scan_file()}
 function checks to see if some other task needs the current CPU
 using \co{need_resched()}, and if so invokes \co{xas_pause()} to
 adjust the traversal's iterator appropriately, and then invokes
 \co{cond_resched_rcu()} to yield the CPU\@.
 In turn, the \co{cond_resched_rcu()} function invokes \co{rcu_read_unlock()},
 \co{cond_resched()}, and finally \co{rcu_read_lock()} to drop out of
 the RCU read-side critical section in order to yield the CPU.

 Of course, where feasible, another approach would be to switch to a
 data structure such as a hash table that is more friendly to momentarily
 dropping out of an RCU read-side critical section.

 \QuickQuiz{
 	But how would this work with a resizable hash table, such
 	as the one described in
 	\cref{sec:datastruct:Non-Partitionable Data Structures}?
 }\QuickQuizAnswer{
 	In this case, more care is required because the hash table
 	might well be resized during the time that we momentarily
 	exited the RCU read-side critical section.
 	Worse yet, the resize operation can be expected to free the
 	old hash buckets, leaving us pointing to the freelist.

 	But it is not sufficient to prevent the old hash buckets
 	from being freed.
 	It is also necessary to ensure that those buckets continue
 	to be updated.

 	One way to handle this is to have a reference count on each
 	set of buckets, which is initially set to the value one.
 	A full-table scan would acquire a reference at the beginning of
 	the scan (but only if the reference is non-zero) and release it
 	at the end of the scan.
 	The resizing would populate the new buckets, release the
 	reference, wait for a grace period, and then wait for the
 	reference to go to zero.
 	Once the reference was zero, the resizing could let updaters
 	forget about the old hash buckets and then free it.

 	Actual implementation is left to the interested reader, who will
 	gain much insight from this task.
 }\QuickQuizEnd

 % @@@ RCU link counts