defer/toyrcu.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/toyrcu.tex

 \subsection{``Toy'' RCU Implementations}
 \label{sec:defer:``Toy'' RCU Implementations}

 The toy RCU implementations in this section are designed not for
 high performance, practicality, or any kind of production use,\footnote{
 	However, production-quality user-level RCU implementations
 	are available~\cite{MathieuDesnoyers2009URCU}.}
 but rather for clarity.
 Nevertheless, you will need a thorough understanding of
 Chapters~\ref{chp:Introduction},
 \ref{chp:Hardware and its Habits},
 \ref{chp:Tools of the Trade}, and
 \ref{cha:Partitioning and Synchronization Design},
 as well as the previous portions of
 Chapter~\ref{chp:Deferred Processing}
 for even these toy RCU implementations to be easily understandable.

 This section provides a series of RCU implementations in order of
 increasing sophistication, from the viewpoint of solving the
 existence-guarantee problem.
 Section~\ref{defer:Lock-Based RCU} presents a rudimentary
 RCU implementation based on simple locking, while
 Sections~\ref{defer:Per-Thread Lock-Based RCU} through
 \ref{defer:RCU Based on Quiescent States}
 present a series of
 simple RCU implementations based on locking, reference counters,
 and free-running counters.
 Finally, Section~\ref{defer:Summary of Toy RCU Implementations}
 provides a summary and a list of desirable RCU properties.

 \subsubsection{Lock-Based RCU}
 \label{defer:Lock-Based RCU}

 \begin{figure}[bp]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3   spin_lock(&rcu_gp_lock);
   4 }
   5
   6 static void rcu_read_unlock(void)
   7 {
   8   spin_unlock(&rcu_gp_lock);
   9 }
  10
  11 void synchronize_rcu(void)
  12 {
  13   spin_lock(&rcu_gp_lock);
  14   spin_unlock(&rcu_gp_lock);
  15 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Lock-Based RCU Implementation}
 \label{fig:defer:Lock-Based RCU Implementation}
 \end{figure}

 Perhaps the simplest RCU implementation leverages locking, as
 shown in
 Figure~\ref{fig:defer:Lock-Based RCU Implementation}
 (\path{rcu_lock.h} and \path{rcu_lock.c}).
 In this implementation, \co{rcu_read_lock()} acquires a global
 spinlock, \co{rcu_read_unlock()} releases it, and
 \co{synchronize_rcu()} acquires it then immediately releases it.

 Because \co{synchronize_rcu()} does not return until it has acquired
 (and released) the lock, it cannot return until all prior RCU read-side
 critical sections have completed, thus faithfully implementing
 RCU semantics.
 Of course, only one RCU reader may be in its read-side critical section
 at a time, which almost entirely defeats the purpose of RCU.
 In addition, the lock operations in \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} are extremely heavyweight,
 with read-side overhead ranging from about 100~nanoseconds on a single Power5
 CPU up to more than 17~\emph{microseconds} on a 64-CPU system.
 Worse yet,
 these same lock operations permit \co{rcu_read_lock()}
 to participate in deadlock cycles.
 Furthermore, in absence of recursive locks,
 RCU read-side critical sections cannot be nested, and, finally,
 although concurrent RCU updates could in principle be satisfied by
 a common grace period, this implementation serializes grace periods,
 preventing grace-period sharing.

 \QuickQuiz{}
 	Why wouldn't any deadlock in the RCU implementation in
 	Figure~\ref{fig:defer:Lock-Based RCU Implementation}
 	also be a deadlock in any other RCU implementation?
 \QuickQuizAnswer{
 %
 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 void foo(void)
   2 {
   3   spin_lock(&my_lock);
   4   rcu_read_lock();
   5   do_something();
   6   rcu_read_unlock();
   7   do_something_else();
   8   spin_unlock(&my_lock);
   9 }
  10
  11 void bar(void)
  12 {
  13   rcu_read_lock();
  14   spin_lock(&my_lock);
  15   do_some_other_thing();
  16   spin_unlock(&my_lock);
  17   do_whatever();
  18   rcu_read_unlock();
  19 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Deadlock in Lock-Based RCU Implementation}
 \label{fig:defer:Deadlock in Lock-Based RCU Implementation}
 \end{figure}
 %
 	Suppose the functions \co{foo()} and \co{bar()} in
 	Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation}
 	are invoked concurrently from different CPUs.
 	Then \co{foo()} will acquire \co{my_lock()} on line~3,
 	while \co{bar()} will acquire \co{rcu_gp_lock} on
 	line~13.
 	When \co{foo()} advances to line~4, it will attempt to
 	acquire \co{rcu_gp_lock}, which is held by \co{bar()}.
 	Then when \co{bar()} advances to line~14, it will attempt
 	to acquire \co{my_lock}, which is held by \co{foo()}.

 	Each function is then waiting for a lock that the other
 	holds, a classic deadlock.

 	Other RCU implementations neither spin nor block in
 	\co{rcu_read_lock()}, hence avoiding deadlocks.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why not simply use reader-writer locks in the RCU implementation
 	in
 	Figure~\ref{fig:defer:Lock-Based RCU Implementation}
 	in order to allow RCU readers to proceed in parallel?
 \QuickQuizAnswer{
 	One could in fact use reader-writer locks in this manner.
 	However, textbook reader-writer locks suffer from memory
 	contention, so that the RCU read-side critical sections would
 	need to be quite long to actually permit parallel
 	execution~\cite{McKenney03a}.

 	On the other hand, use of a reader-writer lock that is
 	read-acquired in \co{rcu_read_lock()} would avoid the
 	deadlock condition noted above.
 } \QuickQuizEnd

 It is hard to imagine this implementation being useful
 in a production setting, though it does have the virtue
 of being implementable in almost any user-level application.
 Furthermore, similar implementations having one lock per CPU
 or using reader-writer locks have been used in production
 in the 2.4 Linux kernel.

 A modified version of this one-lock-per-CPU approach, but instead using
 one lock per thread, is described
 in the next section.

 \subsubsection{Per-Thread Lock-Based RCU}
 \label{defer:Per-Thread Lock-Based RCU}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3   spin_lock(&__get_thread_var(rcu_gp_lock));
   4 }
   5
   6 static void rcu_read_unlock(void)
   7 {
   8   spin_unlock(&__get_thread_var(rcu_gp_lock));
   9 }
  10
  11 void synchronize_rcu(void)
  12 {
  13   int t;
  14
  15   for_each_running_thread(t) {
  16     spin_lock(&per_thread(rcu_gp_lock, t));
  17     spin_unlock(&per_thread(rcu_gp_lock, t));
  18   }
  19 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Per-Thread Lock-Based RCU Implementation}
 \label{fig:defer:Per-Thread Lock-Based RCU Implementation}
 \end{figure}

 Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
 (\path{rcu_lock_percpu.h} and \path{rcu_lock_percpu.c})
 shows an implementation based on one lock per thread.
 The \co{rcu_read_lock()} and \co{rcu_read_unlock()} functions
 acquire and release, respectively, the current thread's lock.
 The \co{synchronize_rcu()} function acquires and releases each thread's
 lock in turn.
 Therefore, all RCU read-side critical sections running
 when \co{synchronize_rcu()} starts must have completed before
 \co{synchronize_rcu()} can return.

 This implementation does have the virtue of permitting concurrent
 RCU readers, and does avoid the deadlock condition that can arise
 with a single global lock.
 Furthermore, the read-side overhead, though high at roughly 140 nanoseconds,
 remains at about 140 nanoseconds regardless of the number of CPUs.
 However, the update-side overhead ranges from about 600 nanoseconds
 on a single Power5 CPU
 up to more than 100 \emph{microseconds} on 64 CPUs.

 \QuickQuiz{}
 	Wouldn't it be cleaner to acquire all the locks, and then
 	release them all in the loop from lines~15-18 of
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}?
 	After all, with this change, there would be a point in time
 	when there were no readers, simplifying things greatly.
 \QuickQuizAnswer{
 	Making this change would re-introduce the deadlock, so
 	no, it would not be cleaner.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Is the implementation shown in
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
 	free from deadlocks?
 	Why or why not?
 \QuickQuizAnswer{
 	One deadlock is where a lock is
 	held across \co{synchronize_rcu()}, and that same lock is
 	acquired within an RCU read-side critical section.
 	However, this situation could deadlock any correctly designed
 	RCU implementation.
 	After all, the \co{synchronize_rcu()} primitive must wait for all
 	pre-existing RCU read-side critical sections to complete,
 	but if one of those critical sections is spinning on a lock
 	held by the thread executing the \co{synchronize_rcu()},
 	we have a deadlock inherent in the definition of RCU.

 	Another deadlock happens when attempting to nest RCU read-side
 	critical sections.
 	This deadlock is peculiar to this implementation, and might
 	be avoided by using recursive locks, or by using reader-writer
 	locks that are read-acquired by \co{rcu_read_lock()} and
 	write-acquired by \co{synchronize_rcu()}.

 	However, if we exclude the above two cases,
 	this implementation of RCU does not introduce any deadlock
 	situations.
 	This is because only time some other thread's lock is acquired is when
 	executing \co{synchronize_rcu()}, and in that case, the lock
 	is immediately released, prohibiting a deadlock cycle that
 	does not involve a lock held across the \co{synchronize_rcu()}
 	which is the first case above.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Isn't one advantage of the RCU algorithm shown in
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
 	that it uses only primitives that are widely available,
 	for example, in POSIX pthreads?
 \QuickQuizAnswer{
 	This is indeed an advantage, but do not forget that
 	\co{rcu_dereference()} and \co{rcu_assign_pointer()}
 	are still required, which means \co{volatile} manipulation
 	for \co{rcu_dereference()} and memory barriers for
 	\co{rcu_assign_pointer()}.
 	Of course, many Alpha CPUs require memory barriers for both
 	primitives.
 } \QuickQuizEnd

 This approach could be useful in some situations, given that a similar
 approach was used in the
 Linux 2.4 kernel~\cite{Molnar00a}.

 The counter-based RCU implementation described next overcomes some of
 the shortcomings of the lock-based implementation.

 \subsubsection{Simple Counter-Based RCU}
 \label{defer:Simple Counter-Based RCU}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 atomic_t rcu_refcnt;
   2
   3 static void rcu_read_lock(void)
   4 {
   5   atomic_inc(&rcu_refcnt);
   6   smp_mb();
   7 }
   8
   9 static void rcu_read_unlock(void)
  10 {
  11   smp_mb();
  12   atomic_dec(&rcu_refcnt);
  13 }
  14
  15 void synchronize_rcu(void)
  16 {
  17   smp_mb();
  18   while (atomic_read(&rcu_refcnt) != 0) {
  19     poll(NULL, 0, 10);
  20   }
  21   smp_mb();
  22 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Implementation Using Single Global Reference Counter}
 \label{fig:defer:RCU Implementation Using Single Global Reference Counter}
 \end{figure}

 A slightly more sophisticated RCU implementation is shown in
 Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}
 (\path{rcu_rcg.h} and \path{rcu_rcg.c}).
 This implementation makes use of a global reference counter
 \co{rcu_refcnt} defined on line~1.
 The \co{rcu_read_lock()} primitive atomically increments this
 counter, then executes a memory barrier to ensure that the
 RCU read-side critical section is ordered after the atomic
 increment.
 Similarly, \co{rcu_read_unlock()} executes a memory barrier to
 confine the RCU read-side critical section, then atomically
 decrements the counter.
 The \co{synchronize_rcu()} primitive spins waiting for the reference
 counter to reach zero, surrounded by memory barriers.
 The \co{poll()} on line~19 merely provides pure delay, and from
 a pure RCU-semantics point of view could be omitted.
 Again, once \co{synchronize_rcu()} returns, all prior
 RCU read-side critical sections are guaranteed to have completed.

 In happy contrast to the lock-based implementation shown in
 Section~\ref{defer:Lock-Based RCU}, this implementation
 allows parallel execution of RCU read-side critical sections.
 In happy contrast to the per-thread lock-based implementation shown in
 Section~\ref{defer:Per-Thread Lock-Based RCU},
 it also allows them to be nested.
 In addition, the \co{rcu_read_lock()} primitive cannot possibly
 participate in deadlock cycles, as it never spins nor blocks.

 \QuickQuiz{}
 	But what if you hold a lock across a call to
 	\co{synchronize_rcu()}, and then acquire that same lock within
 	an RCU read-side critical section?
 \QuickQuizAnswer{
 	Indeed, this would deadlock any legal RCU implementation.
 	But is \co{rcu_read_lock()} \emph{really} participating in
 	the deadlock cycle?
 	If you believe that it is, then please
 	ask yourself this same question when looking at the
 	RCU implementation in
 	Section~\ref{defer:RCU Based on Quiescent States}.
 } \QuickQuizEnd

 However, this implementations still has some serious shortcomings.
 First, the atomic operations in \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} are still quite  heavyweight,
 with read-side overhead ranging from about 100~nanoseconds on
 a single Power5 CPU up to almost 40~\emph{microseconds}
 on a 64-CPU system.
 This means that the RCU read-side critical sections
 have to be extremely long in order to get any real
 read-side parallelism.
 On the other hand, in the absence of readers, grace periods elapse
 in about 40~\emph{nanoseconds}, many orders of magnitude faster
 than production-quality implementations in the Linux kernel.

 \QuickQuiz{}
 	How can the grace period possibly elapse in 40 nanoseconds when
 	\co{synchronize_rcu()} contains a 10-millisecond delay?
 \QuickQuizAnswer{
 	The update-side test was run in absence of readers, so the
 	\co{poll()} system call was never invoked.
 	In addition, the actual code has this \co{poll()}
 	system call commented out, the better to evaluate the
 	true overhead of the update-side code.
 	Any production uses of this code would be better served by
 	using the \co{poll()} system call, but then again,
 	production uses would be even better served by other implementations
 	shown later in this section.
 } \QuickQuizEnd

 Second, if there are many concurrent \co{rcu_read_lock()}
 and \co{rcu_read_unlock()} operations, there will
 be extreme memory contention on \co{rcu_refcnt},
 resulting in expensive cache misses.
 Both of these first two shortcomings largely defeat a major purpose of
 RCU, namely to provide low-overhead read-side synchronization primitives.

 Finally, a large number of RCU readers with long read-side
 critical sections could prevent \co{synchronize_rcu()}
 from ever completing, as the global counter might
 never reach zero.
 This could result in starvation of RCU updates, which
 is of course unacceptable in production settings.

 \QuickQuiz{}
 	Why not simply make \co{rcu_read_lock()} wait when a concurrent
 	\co{synchronize_rcu()} has been waiting too long in
 	the RCU implementation in
 	Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}?
 	Wouldn't that prevent \co{synchronize_rcu()} from starving?
 \QuickQuizAnswer{
 	Although this would in fact eliminate the starvation, it would
 	also mean that \co{rcu_read_lock()} would spin or block waiting
 	for the writer, which is in turn waiting on readers.
 	If one of these readers is attempting to acquire a lock that
 	the spinning/blocking \co{rcu_read_lock()} holds, we again
 	have deadlock.

 	In short, the cure is worse than the disease.
 	See Section~\ref{defer:Starvation-Free Counter-Based RCU}
 	for a proper cure.
 } \QuickQuizEnd

 Therefore, it is still hard to imagine this implementation being useful
 in a production setting, though it has a bit more potential
 than the lock-based mechanism, for example, as an RCU implementation
 suitable for a high-stress debugging environment.
 The next section describes a variation on the reference-counting
 scheme that is more favorable to writers.

 \subsubsection{Starvation-Free Counter-Based RCU}
 \label{defer:Starvation-Free Counter-Based RCU}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 atomic_t rcu_refcnt[2];
   3 atomic_t rcu_idx;
   4 DEFINE_PER_THREAD(int, rcu_nesting);
   5 DEFINE_PER_THREAD(int, rcu_read_idx);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Global Reference-Count Pair Data}
 \label{fig:defer:RCU Global Reference-Count Pair Data}
 \end{figure}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3   int i;
   4   int n;
   5
   6   n = __get_thread_var(rcu_nesting);
   7   if (n == 0) {
   8     i = atomic_read(&rcu_idx);
   9     __get_thread_var(rcu_read_idx) = i;
  10     atomic_inc(&rcu_refcnt[i]);
  11   }
  12   __get_thread_var(rcu_nesting) = n + 1;
  13   smp_mb();
  14 }
  15
  16 static void rcu_read_unlock(void)
  17 {
  18   int i;
  19   int n;
  20
  21   smp_mb();
  22   n = __get_thread_var(rcu_nesting);
  23   if (n == 1) {
  24      i = __get_thread_var(rcu_read_idx);
  25      atomic_dec(&rcu_refcnt[i]);
  26   }
  27   __get_thread_var(rcu_nesting) = n - 1;
  28 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Read-Side Using Global Reference-Count Pair}
 \label{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
 \end{figure}

 Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
 (\path{rcu_rcgp.h})
 shows the read-side primitives of an RCU implementation that uses a pair
 of reference counters (\co{rcu_refcnt[]}),
 along with a global index that
 selects one counter out of the pair (\co{rcu_idx}),
 a per-thread nesting counter \co{rcu_nesting},
 a per-thread snapshot of the global index (\co{rcu_read_idx}),
 and a global lock (\co{rcu_gp_lock}),
 which are themselves shown in
 Figure~\ref{fig:defer:RCU Global Reference-Count Pair Data}.

 \paragraph{Design}

 It is the two-element \co{rcu_refcnt[]} array that provides the freedom
 from starvation.
 The key point is that \co{synchronize_rcu()} is only required to wait
 for pre-existing readers.
 If a new reader starts after a given instance of \co{synchronize_rcu()}
 has already begun execution, then that instance of \co{synchronize_rcu()}
 need not wait on that new reader.
 At any given time, when a given reader enters its RCU read-side critical
 section via \co{rcu_read_lock()},
 it increments the element of the \co{rcu_refcnt[]} array indicated by
 the \co{rcu_idx} variable.
 When that same reader exits its RCU read-side critical section via
 \co{rcu_read_unlock()}, it decrements whichever element it incremented,
 ignoring any possible subsequent changes to the \co{rcu_idx} value.

 This arrangement means that \co{synchronize_rcu()} can avoid starvation
 by complementing the value of \co{rcu_idx}, as in \co{rcu_idx = !rcu_idx}.
 Suppose that the old value of \co{rcu_idx} was zero, so that the new
 value is one.
 New readers that arrive after the complement operation will increment
 \co{rcu_idx[1]}, while the old readers that previously incremented
 \co{rcu_idx[0]} will decrement \co{rcu_idx[0]} when they exit their
 RCU read-side critical sections.
 This means that the value of \co{rcu_idx[0]} will no longer be incremented,
 and thus will be monotonically decreasing.\footnote{
 	There is a race condition that this ``monotonically decreasing''
 	statement ignores.
 	This race condition will be dealt with by the code for
 	\co{synchronize_rcu()}.
 	In the meantime, I suggest suspending disbelief.}
 This means that all that \co{synchronize_rcu()} need do is wait for the
 value of \co{rcu_refcnt[0]} to reach zero.

 With the background, we are ready to look at the implementation of the
 actual primitives.

 \paragraph{Implementation}

 The \co{rcu_read_lock()} primitive atomically increments the member of the
 \co{rcu_refcnt[]} pair indexed by \co{rcu_idx}, and keeps a
 snapshot of this index in the per-thread variable \co{rcu_read_idx}.
 The \co{rcu_read_unlock()} primitive then atomically decrements
 whichever counter of the pair that the corresponding \co{rcu_read_lock()}
 incremented.
 However, because only one value of \co{rcu_idx} is remembered per thread,
 additional measures must be taken to permit nesting.
 These additional measures use the per-thread \co{rcu_nesting} variable
 to track nesting.

 To make all this work, line~6 of \co{rcu_read_lock()} in
 Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
 picks up the
 current thread's instance of \co{rcu_nesting}, and if line~7 finds
 that this is the outermost \co{rcu_read_lock()},
 then lines~8-10 pick up the current value of
 \co{rcu_idx}, save it in this thread's instance of \co{rcu_read_idx},
 and atomically increment the selected element of \co{rcu_refcnt}.
 Regardless of the value of \co{rcu_nesting}, line~12 increments it.
 Line~13 executes a memory barrier to ensure that the RCU read-side
 critical section does not bleed out before the \co{rcu_read_lock()} code.

 Similarly, the \co{rcu_read_unlock()} function executes a memory barrier
 at line~21
 to ensure that the RCU read-side critical section does not bleed out
 after the \co{rcu_read_unlock()} code.
 Line~22 picks up this thread's instance of \co{rcu_nesting}, and if
 line~23 finds that this is the outermost \co{rcu_read_unlock()},
 then lines~24 and 25 pick up this thread's instance of \co{rcu_read_idx}
 (saved by the outermost \co{rcu_read_lock()}) and atomically decrements
 the selected element of \co{rcu_refcnt}.
 Regardless of the nesting level, line~27 decrements this thread's
 instance of \co{rcu_nesting}.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 void synchronize_rcu(void)
   2 {
   3   int i;
   4
   5   smp_mb();
   6   spin_lock(&rcu_gp_lock);
   7   i = atomic_read(&rcu_idx);
   8   atomic_set(&rcu_idx, !i);
   9   smp_mb();
  10   while (atomic_read(&rcu_refcnt[i]) != 0) {
  11     poll(NULL, 0, 10);
  12   }
  13   smp_mb();
  14   atomic_set(&rcu_idx, i);
  15   smp_mb();
  16   while (atomic_read(&rcu_refcnt[!i]) != 0) {
  17     poll(NULL, 0, 10);
  18   }
  19   spin_unlock(&rcu_gp_lock);
  20   smp_mb();
  21 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Update Using Global Reference-Count Pair}
 \label{fig:defer:RCU Update Using Global Reference-Count Pair}
 \end{figure}

 Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 (\path{rcu_rcpg.c})
 shows the corresponding \co{synchronize_rcu()} implementation.
 Lines~6 and 19 acquire and release \co{rcu_gp_lock} in order to
 prevent more than one concurrent instance of \co{synchronize_rcu()}.
 Lines~7-8 pick up the value of \co{rcu_idx} and complement it,
 respectively, so that subsequent instances of \co{rcu_read_lock()}
 will use a different element of \co{rcu_idx} that did preceding
 instances.
 Lines~10-12 then wait for the prior element of \co{rcu_idx} to
 reach zero, with the memory barrier on line~9 ensuring that the check
 of \co{rcu_idx} is not reordered to precede the complementing of
 \co{rcu_idx}.
 Lines~13-18 repeat this process, and line~20 ensures that any
 subsequent reclamation operations are not reordered to precede the
 checking of \co{rcu_refcnt}.

 \QuickQuiz{}
 	Why the memory barrier on line~5 of \co{synchronize_rcu()} in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 	given that there is a spin-lock acquisition immediately after?
 \QuickQuizAnswer{
 	The spin-lock acquisition only guarantees that the spin-lock's
 	critical section will not ``bleed out'' to precede the
 	acquisition.
 	It in no way guarantees that code preceding the spin-lock
 	acquisition won't be reordered into the critical section.
 	Such reordering could cause a removal from an RCU-protected
 	list to be reordered to follow the complementing of
 	\co{rcu_idx}, which could allow a newly starting RCU
 	read-side critical section to see the recently removed
 	data element.

 	Exercise for the reader: use a tool such as Promela/spin
 	to determine which (if any) of the memory barriers in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 	are really needed.
 	See Chapter~\ref{chp:Formal Verification}
 	for information on using these tools.
 	The first correct and complete response will be credited.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why is the counter flipped twice in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}?
 	Shouldn't a single flip-and-wait cycle be sufficient?
 \QuickQuizAnswer{
 	Both flips are absolutely required.
 	To see this, consider the following sequence of events:
 	\begin{enumerate}
 	\item	Line~8 of \co{rcu_read_lock()} in
 		Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
 		picks up \co{rcu_idx}, finding its value to be zero.
 	\item	Line~8 of \co{synchronize_rcu()} in
 		Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 		complements the value of \co{rcu_idx}, setting its
 		value to one.
 	\item	Lines~10-13 of \co{synchronize_rcu()} find that the
 		value of \co{rcu_refcnt[0]} is zero, and thus
 		returns.
 		(Recall that the question is asking what happens if
 		lines~14-20 are omitted.)
 	\item	Lines~9 and 10 of \co{rcu_read_lock()} store the
 		value zero to this thread's instance of \co{rcu_read_idx}
 		and increments \co{rcu_refcnt[0]}, respectively.
 		Execution then proceeds into the RCU read-side critical
 		section.
 		\label{defer:rcu_rcgp:RCU Read Side Start}
 	\item	Another instance of \co{synchronize_rcu()} again complements
 		\co{rcu_idx}, this time setting its value to zero.
 		Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()}
 		returns immediately.
 		(Recall that \co{rcu_read_lock()} incremented
 		\co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!)
 		\label{defer:rcu_rcgp:RCU Grace Period Start}
 	\item	The grace period that started in
 		step~\ref{defer:rcu_rcgp:RCU Grace Period Start}
 		has been allowed to end, despite
 		the fact that the RCU read-side critical section
 		that started beforehand in
 		step~\ref{defer:rcu_rcgp:RCU Read Side Start}
 		has not completed.
 		This violates RCU semantics, and could allow the update
 		to free a data element that the RCU read-side critical
 		section was still referencing.
 	\end{enumerate}

 	Exercise for the reader: What happens if \co{rcu_read_lock()}
 	is preempted for a very long time (hours!) just after
 	line~8?
 	Does this implementation operate correctly in that case?
 	Why or why not?
 	The first correct and complete response will be credited.
 } \QuickQuizEnd

 This implementation avoids the update-starvation issues that could
 occur in the single-counter implementation shown in
 Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}.

 \paragraph{Discussion}

 There are still some serious shortcomings.
 First, the atomic operations in \co{rcu_read_lock()}
 and \co{rcu_read_unlock()}
 are still quite heavyweight.
 In fact, they are more complex than those
 of the single-counter variant shown in
 Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter},
 with the read-side primitives consuming about 150~nanoseconds on a single
 Power5 CPU and almost 40~\emph{microseconds} on a 64-CPU system.
 The update-side \co{synchronize_rcu()} primitive is more costly as
 well, ranging from about 200~nanoseconds on a single Power5 CPU to
 more than 40~\emph{microseconds} on a 64-CPU system.
 This means that the RCU read-side critical sections
 have to be extremely long in order to get any real
 read-side parallelism.

 Second, if there are many concurrent \co{rcu_read_lock()}
 and \co{rcu_read_unlock()} operations, there will
 be extreme memory contention on the \co{rcu_refcnt}
 elements, resulting in expensive cache misses.
 This further extends the RCU read-side critical-section
 duration required to provide parallel read-side access.
 These first two shortcomings defeat the purpose of RCU in most
 situations.

 Third, the need to flip \co{rcu_idx} twice imposes substantial
 overhead on updates, especially if there are large
 numbers of threads.

 Finally, despite the fact that concurrent RCU updates could in principle be
 satisfied by a common grace period, this implementation
 serializes grace periods, preventing grace-period
 sharing.

 \QuickQuiz{}
 	Given that atomic increment and decrement are so expensive,
 	why not just use non-atomic increment on line~10 and a
 	non-atomic decrement on line~25 of
 	Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}?
 \QuickQuizAnswer{
 	Using non-atomic operations would cause increments and decrements
 	to be lost, in turn causing the implementation to fail.
 	See Section~\ref{defer:Scalable Counter-Based RCU}
 	for a safe way to use non-atomic operations in
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}.
 } \QuickQuizEnd

 Despite these shortcomings, one could imagine this variant
 of RCU being used on small tightly coupled multiprocessors,
 perhaps as a memory-conserving implementation that maintains
 API compatibility with more complex implementations.
 However, it would not likely scale well beyond a few CPUs.

 The next section describes yet another variation on the reference-counting
 scheme that provides greatly improved read-side performance and scalability.

 \subsubsection{Scalable Counter-Based RCU}
 \label{defer:Scalable Counter-Based RCU}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
   3 atomic_t rcu_idx;
   4 DEFINE_PER_THREAD(int, rcu_nesting);
   5 DEFINE_PER_THREAD(int, rcu_read_idx);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Per-Thread Reference-Count Pair Data}
 \label{fig:defer:RCU Per-Thread Reference-Count Pair Data}
 \end{figure}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3   int i;
   4   int n;
   5
   6   n = __get_thread_var(rcu_nesting);
   7   if (n == 0) {
   8     i = atomic_read(&rcu_idx);
   9     __get_thread_var(rcu_read_idx) = i;
  10     __get_thread_var(rcu_refcnt)[i]++;
  11   }
  12   __get_thread_var(rcu_nesting) = n + 1;
  13   smp_mb();
  14 }
  15
  16 static void rcu_read_unlock(void)
  17 {
  18   int i;
  19   int n;
  20
  21   smp_mb();
  22   n = __get_thread_var(rcu_nesting);
  23   if (n == 1) {
  24      i = __get_thread_var(rcu_read_idx);
  25      __get_thread_var(rcu_refcnt)[i]--;
  26   }
  27   __get_thread_var(rcu_nesting) = n - 1;
  28 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Read-Side Using Per-Thread Reference-Count Pair}
 \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair}
 \end{figure}

 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair}
 (\path{rcu_rcpl.h})
 shows the read-side primitives of an RCU implementation that uses per-thread
 pairs of reference counters.
 This implementation is quite similar to that shown in
 Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair},
 the only difference being that \co{rcu_refcnt} is now a per-thread
 array (as shown in
 Figure~\ref{fig:defer:RCU Per-Thread Reference-Count Pair Data}).
 As with the algorithm in the previous section, use of this two-element
 array prevents readers from starving updaters.
 One benefit of per-thread \co{rcu_refcnt[]} array is that the
 \co{rcu_read_lock()} and \co{rcu_read_unlock()} primitives no longer
 perform atomic operations.

 \QuickQuiz{}
 	Come off it!
 	We can see the \co{atomic_read()} primitive in
 	\co{rcu_read_lock()}!!!
 	So why are you trying to pretend that \co{rcu_read_lock()}
 	contains no atomic operations???
 \QuickQuizAnswer{
 	The \co{atomic_read()} primitives does not actually execute
 	atomic machine instructions, but rather does a normal load
 	from an \co{atomic_t}.
 	Its sole purpose is to keep the compiler's type-checking happy.
 	If the Linux kernel ran on 8-bit CPUs, it would also need to
 	prevent ``store tearing'', which could happen due to the need
 	to store a 16-bit pointer with two eight-bit accesses on some
 	8-bit systems.
 	But thankfully, it seems that no one runs Linux on 8-bit systems.
 } \QuickQuizEnd

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void flip_counter_and_wait(int i)
   2 {
   3   int t;
   4
   5   atomic_set(&rcu_idx, !i);
   6   smp_mb();
   7   for_each_thread(t) {
   8     while (per_thread(rcu_refcnt, t)[i] != 0) {
   9       poll(NULL, 0, 10);
  10     }
  11   }
  12   smp_mb();
  13 }
  14
  15 void synchronize_rcu(void)
  16 {
  17   int i;
  18
  19   smp_mb();
  20   spin_lock(&rcu_gp_lock);
  21   i = atomic_read(&rcu_idx);
  22   flip_counter_and_wait(i);
  23   flip_counter_and_wait(!i);
  24   spin_unlock(&rcu_gp_lock);
  25   smp_mb();
  26 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Update Using Per-Thread Reference-Count Pair}
 \label{fig:defer:RCU Update Using Per-Thread Reference-Count Pair}
 \end{figure}

 Figure~\ref{fig:defer:RCU Update Using Per-Thread Reference-Count Pair}
 (\path{rcu_rcpl.c})
 shows the implementation of \co{synchronize_rcu()}, along with a helper
 function named \co{flip_counter_and_wait()}.
 The \co{synchronize_rcu()} function resembles that shown in
 Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair},
 except that the repeated counter flip is replaced by a pair of calls
 on lines~22 and 23 to the new helper function.

 The new \co{flip_counter_and_wait()} function updates the
 \co{rcu_idx} variable on line~5, executes a memory barrier on line~6,
 then lines~7-11 spin on each thread's prior \co{rcu_refcnt} element,
 waiting for it to go to zero.
 Once all such elements have gone to zero,
 it executes another memory barrier on line~12 and returns.

 This RCU implementation imposes important new requirements on its
 software environment, namely, (1) that it be possible to declare
 per-thread variables, (2) that these per-thread variables be accessible
 from other threads, and (3) that it is possible to enumerate all threads.
 These requirements can be met in almost all software environments,
 but often result in fixed upper bounds on the number of threads.
 More-complex implementations might avoid such bounds, for example, by using
 expandable hash tables.
 Such implementations might dynamically track threads, for example, by
 adding them on their first call to \co{rcu_read_lock()}.

 \QuickQuiz{}
 	Great, if we have $N$ threads, we can have $2N$ ten-millisecond
 	waits (one set per \co{flip_counter_and_wait()} invocation,
 	and even that assumes that we wait only once for each thread.
 	Don't we need the grace period to complete \emph{much} more quickly?
 \QuickQuizAnswer{
 	Keep in mind that we only wait for a given thread if that thread
 	is still in a pre-existing RCU read-side critical section,
 	and that waiting for one hold-out thread gives all the other
 	threads a chance to complete any pre-existing RCU read-side
 	critical sections that they might still be executing.
 	So the only way that we would wait for $2N$ intervals
 	would be if the last thread still remained in a pre-existing
 	RCU read-side critical section despite all the waiting for
 	all the prior threads.
 	In short, this implementation will not wait unnecessarily.

 	However, if you are stress-testing code that uses RCU, you
 	might want to comment out the \co{poll()} statement in
 	order to better catch bugs that incorrectly retain a reference
 	to an RCU-protected data element outside of an RCU
 	read-side critical section.
 } \QuickQuizEnd

 This implementation still has several shortcomings.
 First, the need to flip \co{rcu_idx} twice imposes substantial overhead
 on updates, especially if there are large numbers of threads.

 Second, \co{synchronize_rcu()} must now examine a number of variables
 that increases linearly with the number of threads, imposing substantial
 overhead on applications with large numbers of threads.

 Third, as before, although concurrent RCU updates could in principle
 be satisfied by a common grace period, this implementation serializes
 grace periods, preventing grace-period sharing.

 Finally, as noted in the text, the need for per-thread variables
 and for enumerating threads may be problematic in some software
 environments.

 That said, the read-side primitives scale very nicely, requiring about
 115~nanoseconds regardless of whether running on a single-CPU or a 64-CPU
 Power5 system.
 As noted above, the \co{synchronize_rcu()} primitive does not scale,
 ranging in overhead from almost a microsecond on a single Power5 CPU
 up to almost 200~microseconds on a 64-CPU system.
 This implementation could conceivably form the basis for a
 production-quality user-level RCU implementation.

 The next section describes an algorithm permitting more efficient
 concurrent RCU updates.

 \subsubsection{Scalable Counter-Based RCU With Shared Grace Periods}
 \label{defer:Scalable Counter-Based RCU With Shared Grace Periods}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 DEFINE_PER_THREAD(int [2], rcu_refcnt);
   3 long rcu_idx;
   4 DEFINE_PER_THREAD(int, rcu_nesting);
   5 DEFINE_PER_THREAD(int, rcu_read_idx);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data}
 \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data}
 \end{figure}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3   int i;
   4   int n;
   5
   6   n = __get_thread_var(rcu_nesting);
   7   if (n == 0) {
   8     i = ACCESS_ONCE(rcu_idx) & 0x1;
   9     __get_thread_var(rcu_read_idx) = i;
  10     __get_thread_var(rcu_refcnt)[i]++;
  11   }
  12   __get_thread_var(rcu_nesting) = n + 1;
  13   smp_mb();
  14 }
  15
  16 static void rcu_read_unlock(void)
  17 {
  18   int i;
  19   int n;
  20
  21   smp_mb();
  22   n = __get_thread_var(rcu_nesting);
  23   if (n == 1) {
  24      i = __get_thread_var(rcu_read_idx);
  25      __get_thread_var(rcu_refcnt)[i]--;
  26   }
  27   __get_thread_var(rcu_nesting) = n - 1;
  28 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update}
 \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update}
 \end{figure}

 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update}
 (\path{rcu_rcpls.h})
 shows the read-side primitives for an RCU implementation using per-thread
 reference count pairs, as before, but permitting updates to share
 grace periods.
 The main difference from the earlier implementation shown in
 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair}
 is that \co{rcu_idx} is now a \co{long} that counts freely,
 so that line~8 of
 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update}
 must mask off the low-order bit.
 We also switched from using \co{atomic_read()} and \co{atomic_set()}
 to using \co{ACCESS_ONCE()}.
 The data is also quite similar, as shown in
 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data},
 with \co{rcu_idx} now being a \co{long} instead of an
 \co{atomic_t}.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void flip_counter_and_wait(int ctr)
   2 {
   3   int i;
   4   int t;
   5
   6   ACCESS_ONCE(rcu_idx) = ctr + 1;
   7   i = ctr & 0x1;
   8   smp_mb();
   9   for_each_thread(t) {
  10     while (per_thread(rcu_refcnt, t)[i] != 0) {
  11       poll(NULL, 0, 10);
  12     }
  13   }
  14   smp_mb();
  15 }
  16
  17 void synchronize_rcu(void)
  18 {
  19   int ctr;
  20   int oldctr;
  21
  22   smp_mb();
  23   oldctr = ACCESS_ONCE(rcu_idx);
  24   smp_mb();
  25   spin_lock(&rcu_gp_lock);
  26   ctr = ACCESS_ONCE(rcu_idx);
  27   if (ctr - oldctr >= 3) {
  28     spin_unlock(&rcu_gp_lock);
  29     smp_mb();
  30     return;
  31   }
  32   flip_counter_and_wait(ctr);
  33   if (ctr - oldctr < 2)
  34     flip_counter_and_wait(ctr + 1);
  35   spin_unlock(&rcu_gp_lock);
  36   smp_mb();
  37 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Shared Update Using Per-Thread Reference-Count Pair}
 \label{fig:defer:RCU Shared Update Using Per-Thread Reference-Count Pair}
 \end{figure}

 Figure~\ref{fig:defer:RCU Shared Update Using Per-Thread Reference-Count Pair}
 (\path{rcu_rcpls.c})
 shows the implementation of \co{synchronize_rcu()} and its helper
 function \co{flip_counter_and_wait()}.
 These are similar to those in
 Figure~\ref{fig:defer:RCU Update Using Per-Thread Reference-Count Pair}.
 The differences in \co{flip_counter_and_wait()} include:
 \begin{enumerate}
 \item	Line~6 uses \co{ACCESS_ONCE()} instead of \co{atomic_set()},
 	and increments rather than complementing.
 \item	A new line~7 masks the counter down to its bottom bit.
 \end{enumerate}

 The changes to \co{synchronize_rcu()} are more pervasive:
 \begin{enumerate}
 \item	There is a new \co{oldctr} local variable that captures
 	the pre-lock-acquisition value of \co{rcu_idx} on
 	line~23.
 \item	Line~26 uses \co{ACCESS_ONCE()} instead of \co{atomic_read()}.
 \item	Lines~27-30 check to see if at least three counter flips were
 	performed by other threads while the lock was being acquired,
 	and, if so, releases the lock, does a memory barrier, and returns.
 	In this case, there were two full waits for the counters to
 	go to zero, so those other threads already did all the required work.
 \item	At lines~33-34, \co{flip_counter_and_wait()} is only
 	invoked a second time if there were fewer than two counter flips
 	while the lock was being acquired.
 	On the other hand, if there were two counter flips, some other
 	thread did one full wait for all the counters to go to zero,
 	so only one more is required.
 \end{enumerate}

 With this approach, if an arbitrarily large number of threads invoke
 \co{synchronize_rcu()} concurrently, with one CPU for each thread, there
 will be a total of only three waits for counters to go to zero.

 Despite the improvements, this implementation of RCU still
 has a few shortcomings.
 First, as before, the need to flip \co{rcu_idx} twice imposes substantial
 overhead on updates, especially if there are large
 numbers of threads.

 Second, each updater still acquires \co{rcu_gp_lock}, even if there
 is no work to be done.
 This can result in a severe scalability limitation
 if there are large numbers of concurrent updates.
 There are ways of avoiding this, as was done in a
 production-quality real-time implementation of RCU for the Linux
 kernel~\cite{PaulEMcKenney2007PreemptibleRCU}.

 Third, this implementation requires per-thread variables
 and the ability to enumerate threads, which again can be
 problematic in some software environments.

 Finally, on 32-bit machines, a given update thread might be
 preempted long enough for the \co{rcu_idx}
 counter to overflow.
 This could cause such a thread to force an unnecessary
 pair of counter flips.
 However, even if each grace period took only one
 microsecond, the offending thread would need to be
 preempted for more than an hour, in which case an
 extra pair of counter flips is likely the least of
 your worries.

 As with the implementation described in
 Section~\ref{defer:Simple Counter-Based RCU},
 the read-side primitives scale extremely well, incurring roughly
 115~nanoseconds of overhead regardless of the number of CPUs.
 The \co{synchronize_rcu()} primitive is still expensive,
 ranging from about one microsecond up to about 16~microseconds.
 This is nevertheless much cheaper than the roughly 200~microseconds
 incurred by the implementation in
 Section~\ref{defer:Scalable Counter-Based RCU}.
 So, despite its shortcomings, one could imagine this
 RCU implementation being used in production in real-life applications.

 \QuickQuiz{}
 	All of these toy RCU implementations have either atomic operations
 	in \co{rcu_read_lock()} and \co{rcu_read_unlock()},
 	or \co{synchronize_rcu()}
 	overhead that increases linearly with the number of threads.
 	Under what circumstances could an RCU implementation enjoy
 	light-weight implementations for all three of these primitives,
 	all having deterministic ($O\left(1\right)$) overheads and latencies?
 \QuickQuizAnswer{
 	Special-purpose uniprocessor implementations of RCU can attain
 	this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}.
 } \QuickQuizEnd

 Referring back to
 Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update},
 we see that there is one global-variable access and no fewer than four
 accesses to thread-local variables.
 Given the relatively high cost of thread-local accesses on systems
 implementing POSIX threads, it is tempting to collapse the three
 thread-local variables into a single structure, permitting
 \co{rcu_read_lock()} and \co{rcu_read_unlock()} to access their
 thread-local data with a single thread-local-storage access.
 However, an even better approach would be to reduce the number of
 thread-local accesses to one, as is done in the next section.

 \subsubsection{RCU Based on Free-Running Counter}
 \label{defer:RCU Based on Free-Running Counter}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 long rcu_gp_ctr = 0;
   3 DEFINE_PER_THREAD(long, rcu_reader_gp);
   4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Data for Free-Running Counter Using RCU}
 \label{fig:defer:Data for Free-Running Counter Using RCU}
 \end{figure}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 static void rcu_read_lock(void)
  2 {
  3   __get_thread_var(rcu_reader_gp) =
  4     ACCESS_ONCE(rcu_gp_ctr) + 1;
  5   smp_mb();
  6 }
  7
  8 static void rcu_read_unlock(void)
  9 {
 10   smp_mb();
 11   __get_thread_var(rcu_reader_gp) =
 12     ACCESS_ONCE(rcu_gp_ctr);
 13 }
 14
 15 void synchronize_rcu(void)
 16 {
 17   int t;
 18
 19   smp_mb();
 20   spin_lock(&rcu_gp_lock);
 21   ACCESS_ONCE(rcu_gp_ctr) += 2;
 22   smp_mb();
 23   for_each_thread(t) {
 24     while ((per_thread(rcu_reader_gp, t) & 0x1) &&
 25             ((per_thread(rcu_reader_gp, t) -
 26               ACCESS_ONCE(rcu_gp_ctr)) < 0)) {
 27       poll(NULL, 0, 10);
 28     }
 29   }
 30   spin_unlock(&rcu_gp_lock);
 31   smp_mb();
 32 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Free-Running Counter Using RCU}
 \label{fig:defer:Free-Running Counter Using RCU}
 \end{figure}

 Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 (\path{rcu.h} and \path{rcu.c})
 shows an RCU implementation based on a single global free-running counter
 that takes on only even-numbered values, with data shown in
 Figure~\ref{fig:defer:Data for Free-Running Counter Using RCU}.
 The resulting \co{rcu_read_lock()} implementation is extremely
 straightforward.
 Lines~3 and~4 simply add one to the global free-running \co{rcu_gp_ctr}
 variable and stores the resulting odd-numbered value into the
 \co{rcu_reader_gp} per-thread variable.
 Line~5 executes a memory barrier to prevent the content of the
 subsequent RCU read-side critical section from ``leaking out''.

 The \co{rcu_read_unlock()} implementation is similar.
 Line~10 executes a memory barrier, again to prevent the prior RCU
 read-side critical section from ``leaking out''.
 Lines~11 and~12 then copy the \co{rcu_gp_ctr} global variable to the
 \co{rcu_reader_gp} per-thread variable, leaving this per-thread
 variable with an even-numbered value so that a concurrent instance
 of \co{synchronize_rcu()} will know to ignore it.

 \QuickQuiz{}
 	If any even value is sufficient to tell \co{synchronize_rcu()}
 	to ignore a given task, why don't lines~10 and~11 of
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	simply assign zero to \co{rcu_reader_gp}?
 \QuickQuizAnswer{
 	Assigning zero (or any other even-numbered constant)
 	would in fact work, but assigning the value of
 	\co{rcu_gp_ctr} can provide a valuable debugging aid,
 	as it gives the developer an idea of when the corresponding
 	thread last exited an RCU read-side critical section.
 } \QuickQuizEnd

 Thus, \co{synchronize_rcu()} could wait for all of the per-thread
 \co{rcu_reader_gp} variables to take on even-numbered values.
 However, it is possible to do much better than that because
 \co{synchronize_rcu()} need only wait on \emph{pre-existing}
 RCU read-side critical sections.
 Line~19 executes a memory barrier to prevent prior manipulations
 of RCU-protected data structures from being reordered (by either
 the CPU or the compiler) to follow the increment on line~21.
 Line~20 acquires the \co{rcu_gp_lock} (and line~30 releases it)
 in order to prevent multiple
 \co{synchronize_rcu()} instances from running concurrently.
 Line~21 then increments the global \co{rcu_gp_ctr} variable by
 two, so that all pre-existing RCU read-side critical sections will
 have corresponding per-thread \co{rcu_reader_gp} variables with
 values less than that of \co{rcu_gp_ctr}, modulo the machine's
 word size.
 Recall also that threads with even-numbered values of \co{rcu_reader_gp}
 are not in an RCU read-side critical section,
 so that lines~23-29 scan the \co{rcu_reader_gp} values until they
 all are either even (line~24) or are greater than the global
 \co{rcu_gp_ctr} (lines~25-26).
 Line~27 blocks for a short period of time to wait for a
 pre-existing RCU read-side critical section, but this can be replaced with
 a spin-loop if grace-period latency is of the essence.
 Finally, the memory barrier at line~31 ensures that any subsequent
 destruction will not be reordered into the preceding loop.

 \QuickQuiz{}
 	Why are the memory barriers on lines~19 and~31 of
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	needed?
 	Aren't the memory barriers inherent in the locking
 	primitives on lines~20 and~30 sufficient?
 \QuickQuizAnswer{
 	These memory barriers are required because the locking
 	primitives are only guaranteed to confine the critical
 	section.
 	The locking primitives are under absolutely no obligation
 	to keep other code from bleeding in to the critical section.
 	The pair of memory barriers are therefore requires to prevent
 	this sort of code motion, whether performed by the compiler
 	or by the CPU.
 } \QuickQuizEnd

 This approach achieves much better read-side performance, incurring
 roughly 63~nanoseconds of overhead regardless of the number of
 Power5 CPUs.
 Updates incur more overhead, ranging from about 500~nanoseconds on
 a single Power5 CPU to more than 100~\emph{microseconds} on 64
 such CPUs.

 \QuickQuiz{}
 	Couldn't the update-side batching optimization described in
 	Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods}
 	be applied to the implementation shown in
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}?
 \QuickQuizAnswer{
 	Indeed it could, with a few modifications.
 	This work is left as an exercise for the reader.
 } \QuickQuizEnd

 This implementation suffers from some serious shortcomings in
 addition to the high update-side overhead noted earlier.
 First, it is no longer permissible to nest RCU read-side critical
 sections, a topic that is taken up in the next section.
 Second, if a reader is preempted at line~3 of
 Figure~\ref{fig:defer:Free-Running Counter Using RCU} after fetching from
 \co{rcu_gp_ctr} but before storing to \co{rcu_reader_gp},
 and if the \co{rcu_gp_ctr} counter then runs through more than half
 but less than all of its possible values, then \co{synchronize_rcu()}
 will ignore the subsequent RCU read-side critical section.
 Third and finally, this implementation requires that the enclosing software
 environment be able to enumerate threads and maintain per-thread
 variables.

 \QuickQuiz{}
 	Is the possibility of readers being preempted in
 	lines~3-4 of Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	a real problem, in other words, is there a real sequence
 	of events that could lead to failure?
 	If not, why not?
 	If so, what is the sequence of events, and how can the
 	failure be addressed?
 \QuickQuizAnswer{
 	It is a real problem, there is a sequence of events leading to
 	failure, and there are a number of possible ways of
 	addressing it.
 	For more details, see the Quick Quizzes near the end of
 	Section~\ref{defer:Nestable RCU Based on Free-Running Counter}.
 	The reason for locating the discussion there is to (1) give you
 	more time to think about it, and (2) because the nesting support
 	added in that section greatly reduces the time required to
 	overflow the counter.
 } \QuickQuizEnd

 \subsubsection{Nestable RCU Based on Free-Running Counter}
 \label{defer:Nestable RCU Based on Free-Running Counter}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 #define RCU_GP_CTR_SHIFT 7
   3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT)
   4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
   5 long rcu_gp_ctr = 0;
   6 DEFINE_PER_THREAD(long, rcu_reader_gp);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Data for Nestable RCU Using a Free-Running Counter}
 \label{fig:defer:Data for Nestable RCU Using a Free-Running Counter}
 \end{figure}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
  1 static void rcu_read_lock(void)
  2 {
  3   long tmp;
  4   long *rrgp;
  5
  6   rrgp = &__get_thread_var(rcu_reader_gp);
  7   tmp = *rrgp;
  8   if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
  9     tmp = ACCESS_ONCE(rcu_gp_ctr);
 10   tmp++;
 11   *rrgp = tmp;
 12   smp_mb();
 13 }
 14
 15 static void rcu_read_unlock(void)
 16 {
 17   long tmp;
 18
 19   smp_mb();
 20   __get_thread_var(rcu_reader_gp)--;
 21 }
 22
 23 void synchronize_rcu(void)
 24 {
 25   int t;
 26
 27   smp_mb();
 28   spin_lock(&rcu_gp_lock);
 29   ACCESS_ONCE(rcu_gp_ctr) +=
 30     RCU_GP_CTR_BOTTOM_BIT;
 31   smp_mb();
 32   for_each_thread(t) {
 33     while (rcu_gp_ongoing(t) &&
 34            ((per_thread(rcu_reader_gp, t) -
 35              rcu_gp_ctr) < 0)) {
 36       poll(NULL, 0, 10);
 37     }
 38   }
 39   spin_unlock(&rcu_gp_lock);
 40   smp_mb();
 41 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Nestable RCU Using a Free-Running Counter}
 \label{fig:defer:Nestable RCU Using a Free-Running Counter}
 \end{figure}

 Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter}
 (\path{rcu_nest.h} and \path{rcu_nest.c})
 show an RCU implementation based on a single global free-running counter,
 but that permits nesting of RCU read-side critical sections.
 This nestability is accomplished by reserving the low-order bits of the
 global \co{rcu_gp_ctr} to count nesting, using the definitions shown in
 Figure~\ref{fig:defer:Data for Nestable RCU Using a Free-Running Counter}.
 This is a generalization of the scheme in
 Section~\ref{defer:RCU Based on Free-Running Counter},
 which can be thought of as having a single low-order bit reserved
 for counting nesting depth.
 Two C-preprocessor macros are used to arrange this,
 \co{RCU_GP_CTR_NEST_MASK} and
 \co{RCU_GP_CTR_BOTTOM_BIT}.
 These are related: \co{RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1}.
 The \co{RCU_GP_CTR_BOTTOM_BIT} macro contains a single bit that is
 positioned just above the bits reserved for counting nesting,
 and the \co{RCU_GP_CTR_NEST_MASK} has all one bits covering the
 region of \co{rcu_gp_ctr} used to count nesting.
 Obviously, these two C-preprocessor macros must reserve enough
 of the low-order bits of the counter to permit the maximum required
 nesting of RCU read-side critical sections, and this implementation
 reserves seven bits, for a maximum RCU read-side critical-section
 nesting depth of 127, which should be well in excess of that needed
 by most applications.

 The resulting \co{rcu_read_lock()} implementation is still reasonably
 straightforward.
 Line~6 places a pointer to this thread's instance of \co{rcu_reader_gp}
 into the local variable \co{rrgp}, minimizing the number of expensive
 calls to the pthreads thread-local-state API.
 Line~7 records the current value of \co{rcu_reader_gp} into another
 local variable \co{tmp}, and line~8 checks to see if the low-order
 bits are zero, which would indicate that this is the outermost
 \co{rcu_read_lock()}.
 If so, line~9 places the global \co{rcu_gp_ctr} into \co{tmp} because
 the current value previously fetched by line~7 is likely to be obsolete.
 In either case, line~10 increments the nesting depth, which you will
 recall is stored in the seven low-order bits of the counter.
 Line~11 stores the updated counter back into this thread's instance
 of \co{rcu_reader_gp}, and, finally, line~12 executes a memory
 barrier to prevent the RCU read-side critical section from bleeding out
 into the code preceding the call to \co{rcu_read_lock()}.

 In other words, this implementation of \co{rcu_read_lock()} picks up a copy
 of the global \co{rcu_gp_ctr} unless the current invocation of
 \co{rcu_read_lock()} is nested within an RCU read-side critical section,
 in which case it instead fetches the contents of the current thread's
 instance of \co{rcu_reader_gp}.
 Either way, it increments whatever value it fetched in order to record
 an additional nesting level, and stores the result in the current
 thread's instance of \co{rcu_reader_gp}.

 Interestingly enough, despite their \co{rcu_read_lock()} differences,
 the implementation of \co{rcu_read_unlock()}
 is broadly similar to that shown in
 Section~\ref{defer:RCU Based on Free-Running Counter}.
 Line~19 executes a memory barrier in order to prevent the RCU read-side
 critical section from bleeding out into code following the call
 to \co{rcu_read_unlock()}, and
 line~20 decrements this thread's instance of \co{rcu_reader_gp},
 which has the effect of decrementing the nesting count contained in
 \co{rcu_reader_gp}'s low-order bits.
 Debugging versions of this primitive would check (before decrementing!)
 that these low-order bits were non-zero.

 The implementation of \co{synchronize_rcu()} is quite similar to
 that shown in
 Section~\ref{defer:RCU Based on Free-Running Counter}.
 There are two differences.
 The first is that lines~29 and~30 adds \co{RCU_GP_CTR_BOTTOM_BIT}
 to the global \co{rcu_gp_ctr} instead of adding the constant ``2'',
 and the second is that the comparison on line~33 has been abstracted
 out to a separate function, where it checks the bit indicated
 by \co{RCU_GP_CTR_BOTTOM_BIT} instead of unconditionally checking
 the low-order bit.

 This approach achieves read-side performance almost equal to that
 shown in
 Section~\ref{defer:RCU Based on Free-Running Counter}, incurring
 roughly 65~nanoseconds of overhead regardless of the number of
 Power5 CPUs.
 Updates again incur more overhead, ranging from about 600~nanoseconds on
 a single Power5 CPU to more than 100~\emph{microseconds} on 64
 such CPUs.

 \QuickQuiz{}
 	Why not simply maintain a separate per-thread nesting-level
 	variable, as was done in previous section, rather than having
 	all this complicated bit manipulation?
 \QuickQuizAnswer{
 	The apparent simplicity of the separate per-thread variable
 	is a red herring.
 	This approach incurs much greater complexity in the guise
 	of careful ordering of operations, especially if signal
 	handlers are to be permitted to contain RCU read-side
 	critical sections.
 	But don't take my word for it, code it up and see what you
 	end up with!
 } \QuickQuizEnd

 This implementation suffers from the same shortcomings as does that of
 Section~\ref{defer:RCU Based on Free-Running Counter}, except that
 nesting of RCU read-side critical sections is now permitted.
 In addition, on 32-bit systems, this approach shortens the time
 required to overflow the global \co{rcu_gp_ctr} variable.
 The following section shows one way to greatly increase the time
 required for overflow to occur, while greatly reducing read-side
 overhead.

 \QuickQuiz{}
 	Given the algorithm shown in
 	Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
 	how could you double the time required to overflow the global
 	\co{rcu_gp_ctr}?
 \QuickQuizAnswer{
 	One way would be to replace the magnitude comparison on
 	lines~33 and 34 with an inequality check of the per-thread
 	\co{rcu_reader_gp} variable against
 	\co{rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT}.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Again, given the algorithm shown in
 	Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
 	is counter overflow fatal?
 	Why or why not?
 	If it is fatal, what can be done to fix it?
 \QuickQuizAnswer{
 	It can indeed be fatal.
 	To see this, consider the following sequence of events:
 	\begin{enumerate}
 	\item	Thread~0 enters \co{rcu_read_lock()}, determines
 		that it is not nested, and therefore fetches the
 		value of the global \co{rcu_gp_ctr}.
 		Thread~0 is then preempted for an extremely long time
 		(before storing to its per-thread \co{rcu_reader_gp}
 		variable).
 	\item	Other threads repeatedly invoke \co{synchronize_rcu()},
 		so that the new value of the global \co{rcu_gp_ctr}
 		is now \co{RCU_GP_CTR_BOTTOM_BIT}
 		less than it was when thread~0 fetched it.
 	\item	Thread~0 now starts running again, and stores into
 		its per-thread \co{rcu_reader_gp} variable.
 		The value it stores is
 		\co{RCU_GP_CTR_BOTTOM_BIT+1}
 		greater than that of the global \co{rcu_gp_ctr}.
 	\item	Thread~0 acquires a reference to RCU-protected data
 		element~A.
 	\item	Thread~1 now removes the data element~A that thread~0
 		just acquired a reference to.
 	\item	Thread~1 invokes \co{synchronize_rcu()}, which
 		increments the global \co{rcu_gp_ctr} by
 		\co{RCU_GP_CTR_BOTTOM_BIT}.
 		It then checks all of the per-thread \co{rcu_reader_gp}
 		variables, but thread~0's value (incorrectly) indicates
 		that it started after thread~1's call to
 		\co{synchronize_rcu()}, so thread~1 does not wait
 		for thread~0 to complete its RCU read-side critical
 		section.
 	\item	Thread~1 then frees up data element~A, which thread~0
 		is still referencing.
 	\end{enumerate}

 	Note that scenario can also occur in the implementation presented in
 	Section~\ref{defer:RCU Based on Free-Running Counter}.

 	One strategy for fixing this problem is to use 64-bit
 	counters so that the time required to overflow them would exceed
 	the useful lifetime of the computer system.
 	Note that non-antique members of the 32-bit x86 CPU family
 	allow atomic manipulation of 64-bit counters via the
 	\co{cmpxchg64b} instruction.

 	Another strategy is to limit the rate at which grace periods are
 	permitted to occur in order to achieve a similar effect.
 	For example, \co{synchronize_rcu()} could record the last time
 	that it was invoked, and any subsequent invocation would then
 	check this time and block as needed to force the desired
 	spacing.
 	For example, if the low-order four bits of the counter were
 	reserved for nesting, and if grace periods were permitted to
 	occur at most ten times per second, then it would take more
 	than 300 days for the counter to overflow.
 	However, this approach is not helpful if there is any possibility
 	that the system will be fully loaded with CPU-bound high-priority
 	real-time threads for the full 300 days.
 	(A remote possibility, perhaps, but best to consider it ahead
 	of time.)

 	A third approach is to administratively abolish real-time threads
 	from the system in question.
 	In this case, the preempted process will age up in priority,
 	thus getting to run long before the counter had a chance to
 	overflow.
 	Of course, this approach is less than helpful for real-time
 	applications.

 	A final approach would be for \co{rcu_read_lock()} to recheck
 	the value of the global \co{rcu_gp_ctr} after storing to its
 	per-thread \co{rcu_reader_gp} counter, retrying if the new
 	value of the global \co{rcu_gp_ctr} is inappropriate.
 	This works, but introduces non-deterministic execution time
 	into \co{rcu_read_lock()}.
 	On the other hand, if your application is being preempted long
 	enough for the counter to overflow, you have no hope of
 	deterministic execution time in any case!
 	%
 	% @@@ A fourth approach is rcu_nest32.[hc].
 } \QuickQuizEnd

 \subsubsection{RCU Based on Quiescent States}
 \label{defer:RCU Based on Quiescent States}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_SPINLOCK(rcu_gp_lock);
   2 long rcu_gp_ctr = 0;
   3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Data for Quiescent-State-Based RCU}
 \label{fig:defer:Data for Quiescent-State-Based RCU}
 \end{figure}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void rcu_read_lock(void)
   2 {
   3 }
   4
   5 static void rcu_read_unlock(void)
   6 {
   7 }
   8
   9 rcu_quiescent_state(void)
  10 {
  11   smp_mb();
  12   __get_thread_var(rcu_reader_qs_gp) =
  13     ACCESS_ONCE(rcu_gp_ctr) + 1;
  14   smp_mb();
  15 }
  16
  17 static void rcu_thread_offline(void)
  18 {
  19   smp_mb();
  20   __get_thread_var(rcu_reader_qs_gp) =
  21     ACCESS_ONCE(rcu_gp_ctr);
  22   smp_mb();
  23 }
  24
  25 static void rcu_thread_online(void)
  26 {
  27   rcu_quiescent_state();
  28 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Quiescent-State-Based RCU Read Side}
 \label{fig:defer:Quiescent-State-Based RCU Read Side}
 \end{figure}

 Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side}
 (\path{rcu_qs.h})
 shows the read-side primitives used to construct a user-level
 implementation of RCU based on quiescent states, with the data shown in
 Figure~\ref{fig:defer:Data for Quiescent-State-Based RCU}.
 As can be seen from lines~1-7 in the figure, the \co{rcu_read_lock()}
 and \co{rcu_read_unlock()} primitives do nothing, and can in fact
 be expected to be inlined and optimized away, as they are in
 server builds of the Linux kernel.
 This is due to the fact that quiescent-state-based RCU implementations
 \emph{approximate} the extents of RCU read-side critical sections
 using the aforementioned quiescent states.
 Each of these quiescent states contains a call to
 \co{rcu_quiescent_state()}, which is shown from lines~9-15 in the figure.
 Threads entering extended quiescent states (for example, when blocking)
 may instead call \co{rcu_thread_offline()} (lines~17-23) when entering
 an extended quiescent state and then call
 \co{rcu_thread_online()} (lines~25-28) when leaving it.
 As such, \co{rcu_thread_online()} is analogous to \co{rcu_read_lock()}
 and \co{rcu_thread_offline()} is analogous to \co{rcu_read_unlock()}.
 In addition, \co{rcu_quiescent_state()} can be thought of as a
 \co{rcu_thread_online()} immediately followed by a
 \co{rcu_thread_offline()}.\footnote{
 	Although the code in the figure is consistent with
 	\co{rcu_quiescent_state()}
 	being the same as \co{rcu_thread_online()} immediately followed by
 	\co{rcu_thread_offline()}, this relationship is obscured by
 	performance optimizations.}
 It is illegal to invoke \co{rcu_quiescent_state()}, \co{rcu_thread_offline()},
 or \co{rcu_thread_online()} from an RCU read-side critical section.

 In \co{rcu_quiescent_state()}, line~11 executes a memory barrier
 to prevent any code prior to the quiescent state (including possible
 RCU read-side critical sections) from being reordered
 into the quiescent state.
 Lines~12-13 pick up a copy of the global \co{rcu_gp_ctr}, using
 \co{ACCESS_ONCE()} to ensure that the compiler does not employ any
 optimizations that would result in \co{rcu_gp_ctr} being fetched
 more than once,
 and then adds one to the value fetched and stores it into
 the per-thread \co{rcu_reader_qs_gp} variable, so that any concurrent
 instance of \co{synchronize_rcu()} will see an odd-numbered value,
 thus becoming aware that a new RCU read-side critical section has started.
 Instances of \co{synchronize_rcu()} that are waiting on older
 RCU read-side critical sections will thus know to ignore this new one.
 Finally, line~14 executes a memory barrier, which prevents subsequent
 code (including a possible RCU read-side critical section) from being
 re-ordered with the lines~12-13.

 \QuickQuiz{}
 	Doesn't the additional memory barrier shown on line~14 of
 	Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side}
 	greatly increase the overhead of \co{rcu_quiescent_state}?
 \QuickQuizAnswer{
 	Indeed it does!
 	An application using this implementation of RCU should therefore
 	invoke \co{rcu_quiescent_state} sparingly, instead using
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()} most of the
 	time.

 	However, this memory barrier is absolutely required so that
 	other threads will see the store on lines~12-13 before any
 	subsequent RCU read-side critical sections executed by the
 	caller.
 } \QuickQuizEnd

 Some applications might use RCU only occasionally, but use it very heavily
 when they do use it.
 Such applications might choose to use \co{rcu_thread_online()} when
 starting to use RCU and \co{rcu_thread_offline()} when no longer
 using RCU.
 The time between a call to \co{rcu_thread_offline()} and a subsequent
 call to \co{rcu_thread_online()} is an extended quiescent state,
 so that RCU will not expect explicit quiescent states to be registered
 during this time.

 The \co{rcu_thread_offline()} function simply sets the
 per-thread \co{rcu_reader_qs_gp} variable to the current value of
 \co{rcu_gp_ctr}, which has an even-numbered value.
 Any concurrent instances of \co{synchronize_rcu()} will thus know to
 ignore this thread.

 \QuickQuiz{}
 	Why are the two memory barriers on lines~19 and 22 of
 	Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side}
 	needed?
 \QuickQuizAnswer{
 	The memory barrier on line~19 prevents any RCU read-side
 	critical sections that might precede the
 	call to \co{rcu_thread_offline()} won't be reordered by either
 	the compiler or the CPU to follow the assignment on lines~20-21.
 	The memory barrier on line~22 is, strictly speaking, unnecessary,
 	as it is illegal to have any RCU read-side critical sections
 	following the call to \co{rcu_thread_offline()}.
 } \QuickQuizEnd

 The \co{rcu_thread_online()} function simply invokes
 \co{rcu_quiescent_state()}, thus marking the end of the extended
 quiescent state.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 void synchronize_rcu(void)
   2 {
   3   int t;
   4
   5   smp_mb();
   6   spin_lock(&rcu_gp_lock);
   7   rcu_gp_ctr += 2;
   8   smp_mb();
   9   for_each_thread(t) {
  10     while (rcu_gp_ongoing(t) &&
  11            ((per_thread(rcu_reader_qs_gp, t) -
  12              rcu_gp_ctr) < 0)) {
  13       poll(NULL, 0, 10);
  14     }
  15   }
  16   spin_unlock(&rcu_gp_lock);
  17   smp_mb();
  18 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU Update Side Using Quiescent States}
 \label{fig:defer:RCU Update Side Using Quiescent States}
 \end{figure}

 Figure~\ref{fig:defer:RCU Update Side Using Quiescent States}
 (\path{rcu_qs.c})
 shows the implementation of \co{synchronize_rcu()}, which is
 quite similar to that of the preceding sections.

 This implementation has blazingly fast read-side primitives, with
 an \co{rcu_read_lock()}-\co{rcu_read_unlock()} round trip incurring
 an overhead of roughly 50~\emph{picoseconds}.
 The \co{synchronize_rcu()} overhead ranges from about 600~nanoseconds
 on a single-CPU Power5 system up to more than 100~microseconds on
 a 64-CPU system.

 \QuickQuiz{}
 	To be sure, the clock frequencies of Power
 	systems in 2008 were quite high, but even a 5GHz clock
 	frequency is insufficient to allow
 	loops to be executed in 50~picoseconds!
 	What is going on here?
 \QuickQuizAnswer{
 	Since the measurement loop contains a pair of empty functions,
 	the compiler optimizes it away.
 	The measurement loop takes 1,000 passes between each call to
 	\co{rcu_quiescent_state()}, so this measurement is roughly
 	one thousandth of the overhead of a single call to
 	\co{rcu_quiescent_state()}.
 } \QuickQuizEnd

 However, this implementation requires that each thread either
 invoke \co{rcu_quiescent_state()} periodically or to invoke
 \co{rcu_thread_offline()} for extended quiescent states.
 The need to invoke these functions periodically can make this
 implementation difficult to use in some situations, such as for
 certain types of library functions.

 \QuickQuiz{}
 	Why would the fact that the code is in a library make
 	any difference for how easy it is to use the RCU
 	implementation shown in
 	Figures~\ref{fig:defer:Quiescent-State-Based RCU Read Side} and
 	\ref{fig:defer:RCU Update Side Using Quiescent States}?
 \QuickQuizAnswer{
 	A library function has absolutely no control over the caller,
 	and thus cannot force the caller to invoke \co{rcu_quiescent_state()}
 	periodically.
 	On the other hand, a library function that made many references
 	to a given RCU-protected data structure might be able to invoke
 	\co{rcu_thread_online()} upon entry,
 	\co{rcu_quiescent_state()} periodically, and
 	\co{rcu_thread_offline()} upon exit.
 } \QuickQuizEnd

 \QuickQuiz{}
 	But what if you hold a lock across a call to
 	\co{synchronize_rcu()}, and then acquire that same lock within
 	an RCU read-side critical section?
 	This should be a deadlock, but how can a primitive that
 	generates absolutely no code possibly participate in a
 	deadlock cycle?
 \QuickQuizAnswer{
 	Please note that the RCU read-side critical section is in
 	effect extended beyond the enclosing
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}, out to
 	the previous and next call to \co{rcu_quiescent_state()}.
 	This \co{rcu_quiescent_state} can be thought of as a
 	\co{rcu_read_unlock()} immediately followed by an
 	\co{rcu_read_lock()}.

 	Even so, the actual deadlock itself will involve the lock
 	acquisition in the RCU read-side critical section and
 	the \co{synchronize_rcu()}, never the \co{rcu_quiescent_state()}.
 } \QuickQuizEnd

 In addition, this implementation does not permit concurrent calls
 to \co{synchronize_rcu()} to share grace periods.
 That said, one could easily imagine a production-quality RCU
 implementation based on this version of RCU.

 \subsubsection{Summary of Toy RCU Implementations}
 \label{defer:Summary of Toy RCU Implementations}

 If you made it this far, congratulations!
 You should now have a much clearer understanding
 not only of RCU itself, but also of the requirements of enclosing
 software environments and applications.
 Those wishing an even deeper understanding are invited to read
 descriptions of production-quality RCU
 implementations~\cite{MathieuDesnoyers2012URCU,PaulEMcKenney2007PreemptibleRCU,PaulEMcKenney2008HierarchicalRCU,PaulEMcKenney2009BloatwatchRCU}.

 The preceding sections listed some desirable properties of the
 various RCU primitives.
 The following list is provided for easy reference for those wishing to
 create a new RCU implementation.

 \begin{enumerate}
 \item	There must be read-side primitives (such as \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()}) and grace-period primitives
 	(such as \co{synchronize_rcu()} and \co{call_rcu()}), such
 	that any RCU read-side critical section in existence at the
 	start of a grace period has completed by the end of the
 	grace period.
 \item	RCU read-side primitives should have minimal overhead.
 	In particular, expensive operations such as cache misses,
 	atomic instructions, memory barriers, and branches should
 	be avoided.
 \item	RCU read-side primitives should have $O\left(1\right)$ computational
 	complexity to enable real-time use.
 	(This implies that readers run concurrently with updaters.)
 \item	RCU read-side primitives should be usable in all contexts
 	(in the Linux kernel, they are permitted everywhere except in
 	the idle loop).
 	An important special case is that RCU read-side primitives be
 	usable within an RCU read-side critical section, in other words,
 	that it be possible to nest RCU read-side critical sections.
 \item	RCU read-side primitives should be unconditional, with no
 	failure returns.
 	This property is extremely important, as failure checking
 	increases complexity and complicates testing and validation.
 \item	Any operation other than a quiescent state (and thus a grace
 	period) should be permitted in an RCU read-side critical section.
 	In particular, irrevocable operations such as I/O should be
 	permitted.
 \item	It should be possible to update an RCU-protected data structure
 	while executing within an RCU read-side critical section.
 \item	Both RCU read-side and update-side primitives should be independent
 	of memory allocator design and implementation, in other words,
 	the same RCU implementation should be able to protect a given
 	data structure regardless of how the data elements are allocated
 	and freed.
 \item	RCU grace periods should not be blocked by threads that
 	halt outside of RCU read-side critical sections.
 	(But note that most quiescent-state-based implementations
 	violate this desideratum.)
 \end{enumerate}

 \QuickQuiz{}
 	Given that grace periods are prohibited within RCU read-side
 	critical sections, how can an RCU data structure possibly be
 	updated while in an RCU read-side critical section?
 \QuickQuizAnswer{
 	This situation is one reason for the existence of asynchronous
 	grace-period primitives such as \co{call_rcu()}.
 	This primitive may be invoked within an RCU read-side critical
 	section, and the specified RCU callback will in turn be invoked
 	at a later time, after a grace period has elapsed.

 	The ability to perform an RCU update while within an RCU read-side
 	critical section can be extremely convenient, and is analogous
 	to a (mythical) unconditional read-to-write upgrade for
 	reader-writer locking.
 } \QuickQuizEnd