| % defer/toyrcu.tex |
| |
| \subsection{``Toy'' RCU Implementations} |
| \label{sec:defer:``Toy'' RCU Implementations} |
| |
| The toy RCU implementations in this section are designed not for |
| high performance, practicality, or any kind of production use,\footnote{ |
| However, production-quality user-level RCU implementations |
| are available~\cite{MathieuDesnoyers2009URCU}.} |
| but rather for clarity. |
| Nevertheless, you will need a thorough understanding of |
| Chapters~\ref{chp:Introduction}, |
| \ref{chp:Hardware and its Habits}, |
| \ref{chp:Tools of the Trade}, and |
| \ref{cha:Partitioning and Synchronization Design}, |
| as well as the previous portions of |
| Chapter~\ref{chp:Deferred Processing} |
| for even these toy RCU implementations to be easily understandable. |
| |
| This section provides a series of RCU implementations in order of |
| increasing sophistication, from the viewpoint of solving the |
| existence-guarantee problem. |
| Section~\ref{defer:Lock-Based RCU} presents a rudimentary |
| RCU implementation based on simple locking, while |
| Sections~\ref{defer:Per-Thread Lock-Based RCU} through |
| \ref{defer:RCU Based on Quiescent States} |
| present a series of |
| simple RCU implementations based on locking, reference counters, |
| and free-running counters. |
| Finally, Section~\ref{defer:Summary of Toy RCU Implementations} |
| provides a summary and a list of desirable RCU properties. |
| |
| \subsubsection{Lock-Based RCU} |
| \label{defer:Lock-Based RCU} |
| |
| \begin{figure}[bp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 spin_lock(&rcu_gp_lock); |
| 4 } |
| 5 |
| 6 static void rcu_read_unlock(void) |
| 7 { |
| 8 spin_unlock(&rcu_gp_lock); |
| 9 } |
| 10 |
| 11 void synchronize_rcu(void) |
| 12 { |
| 13 spin_lock(&rcu_gp_lock); |
| 14 spin_unlock(&rcu_gp_lock); |
| 15 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Lock-Based RCU Implementation} |
| \label{fig:defer:Lock-Based RCU Implementation} |
| \end{figure} |
| |
| Perhaps the simplest RCU implementation leverages locking, as |
| shown in |
| Figure~\ref{fig:defer:Lock-Based RCU Implementation} |
| (\path{rcu_lock.h} and \path{rcu_lock.c}). |
| In this implementation, \co{rcu_read_lock()} acquires a global |
| spinlock, \co{rcu_read_unlock()} releases it, and |
| \co{synchronize_rcu()} acquires it then immediately releases it. |
| |
| Because \co{synchronize_rcu()} does not return until it has acquired |
| (and released) the lock, it cannot return until all prior RCU read-side |
| critical sections have completed, thus faithfully implementing |
| RCU semantics. |
| Of course, only one RCU reader may be in its read-side critical section |
| at a time, which almost entirely defeats the purpose of RCU. |
| In addition, the lock operations in \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} are extremely heavyweight, |
| with read-side overhead ranging from about 100~nanoseconds on a single Power5 |
| CPU up to more than 17~\emph{microseconds} on a 64-CPU system. |
| Worse yet, |
| these same lock operations permit \co{rcu_read_lock()} |
| to participate in deadlock cycles. |
| Furthermore, in absence of recursive locks, |
| RCU read-side critical sections cannot be nested, and, finally, |
| although concurrent RCU updates could in principle be satisfied by |
| a common grace period, this implementation serializes grace periods, |
| preventing grace-period sharing. |
| |
| \QuickQuiz{} |
| Why wouldn't any deadlock in the RCU implementation in |
| Figure~\ref{fig:defer:Lock-Based RCU Implementation} |
| also be a deadlock in any other RCU implementation? |
| \QuickQuizAnswer{ |
| % |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 void foo(void) |
| 2 { |
| 3 spin_lock(&my_lock); |
| 4 rcu_read_lock(); |
| 5 do_something(); |
| 6 rcu_read_unlock(); |
| 7 do_something_else(); |
| 8 spin_unlock(&my_lock); |
| 9 } |
| 10 |
| 11 void bar(void) |
| 12 { |
| 13 rcu_read_lock(); |
| 14 spin_lock(&my_lock); |
| 15 do_some_other_thing(); |
| 16 spin_unlock(&my_lock); |
| 17 do_whatever(); |
| 18 rcu_read_unlock(); |
| 19 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Deadlock in Lock-Based RCU Implementation} |
| \label{fig:defer:Deadlock in Lock-Based RCU Implementation} |
| \end{figure} |
| % |
| Suppose the functions \co{foo()} and \co{bar()} in |
| Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation} |
| are invoked concurrently from different CPUs. |
| Then \co{foo()} will acquire \co{my_lock()} on line~3, |
| while \co{bar()} will acquire \co{rcu_gp_lock} on |
| line~13. |
| When \co{foo()} advances to line~4, it will attempt to |
| acquire \co{rcu_gp_lock}, which is held by \co{bar()}. |
| Then when \co{bar()} advances to line~14, it will attempt |
| to acquire \co{my_lock}, which is held by \co{foo()}. |
| |
| Each function is then waiting for a lock that the other |
| holds, a classic deadlock. |
| |
| Other RCU implementations neither spin nor block in |
| \co{rcu_read_lock()}, hence avoiding deadlocks. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why not simply use reader-writer locks in the RCU implementation |
| in |
| Figure~\ref{fig:defer:Lock-Based RCU Implementation} |
| in order to allow RCU readers to proceed in parallel? |
| \QuickQuizAnswer{ |
| One could in fact use reader-writer locks in this manner. |
| However, textbook reader-writer locks suffer from memory |
| contention, so that the RCU read-side critical sections would |
| need to be quite long to actually permit parallel |
| execution~\cite{McKenney03a}. |
| |
| On the other hand, use of a reader-writer lock that is |
| read-acquired in \co{rcu_read_lock()} would avoid the |
| deadlock condition noted above. |
| } \QuickQuizEnd |
| |
| It is hard to imagine this implementation being useful |
| in a production setting, though it does have the virtue |
| of being implementable in almost any user-level application. |
| Furthermore, similar implementations having one lock per CPU |
| or using reader-writer locks have been used in production |
| in the 2.4 Linux kernel. |
| |
| A modified version of this one-lock-per-CPU approach, but instead using |
| one lock per thread, is described |
| in the next section. |
| |
| \subsubsection{Per-Thread Lock-Based RCU} |
| \label{defer:Per-Thread Lock-Based RCU} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 spin_lock(&__get_thread_var(rcu_gp_lock)); |
| 4 } |
| 5 |
| 6 static void rcu_read_unlock(void) |
| 7 { |
| 8 spin_unlock(&__get_thread_var(rcu_gp_lock)); |
| 9 } |
| 10 |
| 11 void synchronize_rcu(void) |
| 12 { |
| 13 int t; |
| 14 |
| 15 for_each_running_thread(t) { |
| 16 spin_lock(&per_thread(rcu_gp_lock, t)); |
| 17 spin_unlock(&per_thread(rcu_gp_lock, t)); |
| 18 } |
| 19 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Per-Thread Lock-Based RCU Implementation} |
| \label{fig:defer:Per-Thread Lock-Based RCU Implementation} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation} |
| (\path{rcu_lock_percpu.h} and \path{rcu_lock_percpu.c}) |
| shows an implementation based on one lock per thread. |
| The \co{rcu_read_lock()} and \co{rcu_read_unlock()} functions |
| acquire and release, respectively, the current thread's lock. |
| The \co{synchronize_rcu()} function acquires and releases each thread's |
| lock in turn. |
| Therefore, all RCU read-side critical sections running |
| when \co{synchronize_rcu()} starts must have completed before |
| \co{synchronize_rcu()} can return. |
| |
| This implementation does have the virtue of permitting concurrent |
| RCU readers, and does avoid the deadlock condition that can arise |
| with a single global lock. |
| Furthermore, the read-side overhead, though high at roughly 140 nanoseconds, |
| remains at about 140 nanoseconds regardless of the number of CPUs. |
| However, the update-side overhead ranges from about 600 nanoseconds |
| on a single Power5 CPU |
| up to more than 100 \emph{microseconds} on 64 CPUs. |
| |
| \QuickQuiz{} |
| Wouldn't it be cleaner to acquire all the locks, and then |
| release them all in the loop from lines~15-18 of |
| Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}? |
| After all, with this change, there would be a point in time |
| when there were no readers, simplifying things greatly. |
| \QuickQuizAnswer{ |
| Making this change would re-introduce the deadlock, so |
| no, it would not be cleaner. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Is the implementation shown in |
| Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation} |
| free from deadlocks? |
| Why or why not? |
| \QuickQuizAnswer{ |
| One deadlock is where a lock is |
| held across \co{synchronize_rcu()}, and that same lock is |
| acquired within an RCU read-side critical section. |
| However, this situation could deadlock any correctly designed |
| RCU implementation. |
| After all, the \co{synchronize_rcu()} primitive must wait for all |
| pre-existing RCU read-side critical sections to complete, |
| but if one of those critical sections is spinning on a lock |
| held by the thread executing the \co{synchronize_rcu()}, |
| we have a deadlock inherent in the definition of RCU. |
| |
| Another deadlock happens when attempting to nest RCU read-side |
| critical sections. |
| This deadlock is peculiar to this implementation, and might |
| be avoided by using recursive locks, or by using reader-writer |
| locks that are read-acquired by \co{rcu_read_lock()} and |
| write-acquired by \co{synchronize_rcu()}. |
| |
| However, if we exclude the above two cases, |
| this implementation of RCU does not introduce any deadlock |
| situations. |
| This is because only time some other thread's lock is acquired is when |
| executing \co{synchronize_rcu()}, and in that case, the lock |
| is immediately released, prohibiting a deadlock cycle that |
| does not involve a lock held across the \co{synchronize_rcu()} |
| which is the first case above. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Isn't one advantage of the RCU algorithm shown in |
| Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation} |
| that it uses only primitives that are widely available, |
| for example, in POSIX pthreads? |
| \QuickQuizAnswer{ |
| This is indeed an advantage, but do not forget that |
| \co{rcu_dereference()} and \co{rcu_assign_pointer()} |
| are still required, which means \co{volatile} manipulation |
| for \co{rcu_dereference()} and memory barriers for |
| \co{rcu_assign_pointer()}. |
| Of course, many Alpha CPUs require memory barriers for both |
| primitives. |
| } \QuickQuizEnd |
| |
| This approach could be useful in some situations, given that a similar |
| approach was used in the |
| Linux 2.4 kernel~\cite{Molnar00a}. |
| |
| The counter-based RCU implementation described next overcomes some of |
| the shortcomings of the lock-based implementation. |
| |
| \subsubsection{Simple Counter-Based RCU} |
| \label{defer:Simple Counter-Based RCU} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 atomic_t rcu_refcnt; |
| 2 |
| 3 static void rcu_read_lock(void) |
| 4 { |
| 5 atomic_inc(&rcu_refcnt); |
| 6 smp_mb(); |
| 7 } |
| 8 |
| 9 static void rcu_read_unlock(void) |
| 10 { |
| 11 smp_mb(); |
| 12 atomic_dec(&rcu_refcnt); |
| 13 } |
| 14 |
| 15 void synchronize_rcu(void) |
| 16 { |
| 17 smp_mb(); |
| 18 while (atomic_read(&rcu_refcnt) != 0) { |
| 19 poll(NULL, 0, 10); |
| 20 } |
| 21 smp_mb(); |
| 22 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Implementation Using Single Global Reference Counter} |
| \label{fig:defer:RCU Implementation Using Single Global Reference Counter} |
| \end{figure} |
| |
| A slightly more sophisticated RCU implementation is shown in |
| Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter} |
| (\path{rcu_rcg.h} and \path{rcu_rcg.c}). |
| This implementation makes use of a global reference counter |
| \co{rcu_refcnt} defined on line~1. |
| The \co{rcu_read_lock()} primitive atomically increments this |
| counter, then executes a memory barrier to ensure that the |
| RCU read-side critical section is ordered after the atomic |
| increment. |
| Similarly, \co{rcu_read_unlock()} executes a memory barrier to |
| confine the RCU read-side critical section, then atomically |
| decrements the counter. |
| The \co{synchronize_rcu()} primitive spins waiting for the reference |
| counter to reach zero, surrounded by memory barriers. |
| The \co{poll()} on line~19 merely provides pure delay, and from |
| a pure RCU-semantics point of view could be omitted. |
| Again, once \co{synchronize_rcu()} returns, all prior |
| RCU read-side critical sections are guaranteed to have completed. |
| |
| In happy contrast to the lock-based implementation shown in |
| Section~\ref{defer:Lock-Based RCU}, this implementation |
| allows parallel execution of RCU read-side critical sections. |
| In happy contrast to the per-thread lock-based implementation shown in |
| Section~\ref{defer:Per-Thread Lock-Based RCU}, |
| it also allows them to be nested. |
| In addition, the \co{rcu_read_lock()} primitive cannot possibly |
| participate in deadlock cycles, as it never spins nor blocks. |
| |
| \QuickQuiz{} |
| But what if you hold a lock across a call to |
| \co{synchronize_rcu()}, and then acquire that same lock within |
| an RCU read-side critical section? |
| \QuickQuizAnswer{ |
| Indeed, this would deadlock any legal RCU implementation. |
| But is \co{rcu_read_lock()} \emph{really} participating in |
| the deadlock cycle? |
| If you believe that it is, then please |
| ask yourself this same question when looking at the |
| RCU implementation in |
| Section~\ref{defer:RCU Based on Quiescent States}. |
| } \QuickQuizEnd |
| |
| However, this implementations still has some serious shortcomings. |
| First, the atomic operations in \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} are still quite heavyweight, |
| with read-side overhead ranging from about 100~nanoseconds on |
| a single Power5 CPU up to almost 40~\emph{microseconds} |
| on a 64-CPU system. |
| This means that the RCU read-side critical sections |
| have to be extremely long in order to get any real |
| read-side parallelism. |
| On the other hand, in the absence of readers, grace periods elapse |
| in about 40~\emph{nanoseconds}, many orders of magnitude faster |
| than production-quality implementations in the Linux kernel. |
| |
| \QuickQuiz{} |
| How can the grace period possibly elapse in 40 nanoseconds when |
| \co{synchronize_rcu()} contains a 10-millisecond delay? |
| \QuickQuizAnswer{ |
| The update-side test was run in absence of readers, so the |
| \co{poll()} system call was never invoked. |
| In addition, the actual code has this \co{poll()} |
| system call commented out, the better to evaluate the |
| true overhead of the update-side code. |
| Any production uses of this code would be better served by |
| using the \co{poll()} system call, but then again, |
| production uses would be even better served by other implementations |
| shown later in this section. |
| } \QuickQuizEnd |
| |
| Second, if there are many concurrent \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} operations, there will |
| be extreme memory contention on \co{rcu_refcnt}, |
| resulting in expensive cache misses. |
| Both of these first two shortcomings largely defeat a major purpose of |
| RCU, namely to provide low-overhead read-side synchronization primitives. |
| |
| Finally, a large number of RCU readers with long read-side |
| critical sections could prevent \co{synchronize_rcu()} |
| from ever completing, as the global counter might |
| never reach zero. |
| This could result in starvation of RCU updates, which |
| is of course unacceptable in production settings. |
| |
| \QuickQuiz{} |
| Why not simply make \co{rcu_read_lock()} wait when a concurrent |
| \co{synchronize_rcu()} has been waiting too long in |
| the RCU implementation in |
| Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}? |
| Wouldn't that prevent \co{synchronize_rcu()} from starving? |
| \QuickQuizAnswer{ |
| Although this would in fact eliminate the starvation, it would |
| also mean that \co{rcu_read_lock()} would spin or block waiting |
| for the writer, which is in turn waiting on readers. |
| If one of these readers is attempting to acquire a lock that |
| the spinning/blocking \co{rcu_read_lock()} holds, we again |
| have deadlock. |
| |
| In short, the cure is worse than the disease. |
| See Section~\ref{defer:Starvation-Free Counter-Based RCU} |
| for a proper cure. |
| } \QuickQuizEnd |
| |
| Therefore, it is still hard to imagine this implementation being useful |
| in a production setting, though it has a bit more potential |
| than the lock-based mechanism, for example, as an RCU implementation |
| suitable for a high-stress debugging environment. |
| The next section describes a variation on the reference-counting |
| scheme that is more favorable to writers. |
| |
| \subsubsection{Starvation-Free Counter-Based RCU} |
| \label{defer:Starvation-Free Counter-Based RCU} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 atomic_t rcu_refcnt[2]; |
| 3 atomic_t rcu_idx; |
| 4 DEFINE_PER_THREAD(int, rcu_nesting); |
| 5 DEFINE_PER_THREAD(int, rcu_read_idx); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Global Reference-Count Pair Data} |
| \label{fig:defer:RCU Global Reference-Count Pair Data} |
| \end{figure} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 int i; |
| 4 int n; |
| 5 |
| 6 n = __get_thread_var(rcu_nesting); |
| 7 if (n == 0) { |
| 8 i = atomic_read(&rcu_idx); |
| 9 __get_thread_var(rcu_read_idx) = i; |
| 10 atomic_inc(&rcu_refcnt[i]); |
| 11 } |
| 12 __get_thread_var(rcu_nesting) = n + 1; |
| 13 smp_mb(); |
| 14 } |
| 15 |
| 16 static void rcu_read_unlock(void) |
| 17 { |
| 18 int i; |
| 19 int n; |
| 20 |
| 21 smp_mb(); |
| 22 n = __get_thread_var(rcu_nesting); |
| 23 if (n == 1) { |
| 24 i = __get_thread_var(rcu_read_idx); |
| 25 atomic_dec(&rcu_refcnt[i]); |
| 26 } |
| 27 __get_thread_var(rcu_nesting) = n - 1; |
| 28 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Read-Side Using Global Reference-Count Pair} |
| \label{fig:defer:RCU Read-Side Using Global Reference-Count Pair} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair} |
| (\path{rcu_rcgp.h}) |
| shows the read-side primitives of an RCU implementation that uses a pair |
| of reference counters (\co{rcu_refcnt[]}), |
| along with a global index that |
| selects one counter out of the pair (\co{rcu_idx}), |
| a per-thread nesting counter \co{rcu_nesting}, |
| a per-thread snapshot of the global index (\co{rcu_read_idx}), |
| and a global lock (\co{rcu_gp_lock}), |
| which are themselves shown in |
| Figure~\ref{fig:defer:RCU Global Reference-Count Pair Data}. |
| |
| \paragraph{Design} |
| |
| It is the two-element \co{rcu_refcnt[]} array that provides the freedom |
| from starvation. |
| The key point is that \co{synchronize_rcu()} is only required to wait |
| for pre-existing readers. |
| If a new reader starts after a given instance of \co{synchronize_rcu()} |
| has already begun execution, then that instance of \co{synchronize_rcu()} |
| need not wait on that new reader. |
| At any given time, when a given reader enters its RCU read-side critical |
| section via \co{rcu_read_lock()}, |
| it increments the element of the \co{rcu_refcnt[]} array indicated by |
| the \co{rcu_idx} variable. |
| When that same reader exits its RCU read-side critical section via |
| \co{rcu_read_unlock()}, it decrements whichever element it incremented, |
| ignoring any possible subsequent changes to the \co{rcu_idx} value. |
| |
| This arrangement means that \co{synchronize_rcu()} can avoid starvation |
| by complementing the value of \co{rcu_idx}, as in \co{rcu_idx = !rcu_idx}. |
| Suppose that the old value of \co{rcu_idx} was zero, so that the new |
| value is one. |
| New readers that arrive after the complement operation will increment |
| \co{rcu_idx[1]}, while the old readers that previously incremented |
| \co{rcu_idx[0]} will decrement \co{rcu_idx[0]} when they exit their |
| RCU read-side critical sections. |
| This means that the value of \co{rcu_idx[0]} will no longer be incremented, |
| and thus will be monotonically decreasing.\footnote{ |
| There is a race condition that this ``monotonically decreasing'' |
| statement ignores. |
| This race condition will be dealt with by the code for |
| \co{synchronize_rcu()}. |
| In the meantime, I suggest suspending disbelief.} |
| This means that all that \co{synchronize_rcu()} need do is wait for the |
| value of \co{rcu_refcnt[0]} to reach zero. |
| |
| With the background, we are ready to look at the implementation of the |
| actual primitives. |
| |
| \paragraph{Implementation} |
| |
| The \co{rcu_read_lock()} primitive atomically increments the member of the |
| \co{rcu_refcnt[]} pair indexed by \co{rcu_idx}, and keeps a |
| snapshot of this index in the per-thread variable \co{rcu_read_idx}. |
| The \co{rcu_read_unlock()} primitive then atomically decrements |
| whichever counter of the pair that the corresponding \co{rcu_read_lock()} |
| incremented. |
| However, because only one value of \co{rcu_idx} is remembered per thread, |
| additional measures must be taken to permit nesting. |
| These additional measures use the per-thread \co{rcu_nesting} variable |
| to track nesting. |
| |
| To make all this work, line~6 of \co{rcu_read_lock()} in |
| Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair} |
| picks up the |
| current thread's instance of \co{rcu_nesting}, and if line~7 finds |
| that this is the outermost \co{rcu_read_lock()}, |
| then lines~8-10 pick up the current value of |
| \co{rcu_idx}, save it in this thread's instance of \co{rcu_read_idx}, |
| and atomically increment the selected element of \co{rcu_refcnt}. |
| Regardless of the value of \co{rcu_nesting}, line~12 increments it. |
| Line~13 executes a memory barrier to ensure that the RCU read-side |
| critical section does not bleed out before the \co{rcu_read_lock()} code. |
| |
| Similarly, the \co{rcu_read_unlock()} function executes a memory barrier |
| at line~21 |
| to ensure that the RCU read-side critical section does not bleed out |
| after the \co{rcu_read_unlock()} code. |
| Line~22 picks up this thread's instance of \co{rcu_nesting}, and if |
| line~23 finds that this is the outermost \co{rcu_read_unlock()}, |
| then lines~24 and 25 pick up this thread's instance of \co{rcu_read_idx} |
| (saved by the outermost \co{rcu_read_lock()}) and atomically decrements |
| the selected element of \co{rcu_refcnt}. |
| Regardless of the nesting level, line~27 decrements this thread's |
| instance of \co{rcu_nesting}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 void synchronize_rcu(void) |
| 2 { |
| 3 int i; |
| 4 |
| 5 smp_mb(); |
| 6 spin_lock(&rcu_gp_lock); |
| 7 i = atomic_read(&rcu_idx); |
| 8 atomic_set(&rcu_idx, !i); |
| 9 smp_mb(); |
| 10 while (atomic_read(&rcu_refcnt[i]) != 0) { |
| 11 poll(NULL, 0, 10); |
| 12 } |
| 13 smp_mb(); |
| 14 atomic_set(&rcu_idx, i); |
| 15 smp_mb(); |
| 16 while (atomic_read(&rcu_refcnt[!i]) != 0) { |
| 17 poll(NULL, 0, 10); |
| 18 } |
| 19 spin_unlock(&rcu_gp_lock); |
| 20 smp_mb(); |
| 21 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Update Using Global Reference-Count Pair} |
| \label{fig:defer:RCU Update Using Global Reference-Count Pair} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} |
| (\path{rcu_rcpg.c}) |
| shows the corresponding \co{synchronize_rcu()} implementation. |
| Lines~6 and 19 acquire and release \co{rcu_gp_lock} in order to |
| prevent more than one concurrent instance of \co{synchronize_rcu()}. |
| Lines~7-8 pick up the value of \co{rcu_idx} and complement it, |
| respectively, so that subsequent instances of \co{rcu_read_lock()} |
| will use a different element of \co{rcu_idx} that did preceding |
| instances. |
| Lines~10-12 then wait for the prior element of \co{rcu_idx} to |
| reach zero, with the memory barrier on line~9 ensuring that the check |
| of \co{rcu_idx} is not reordered to precede the complementing of |
| \co{rcu_idx}. |
| Lines~13-18 repeat this process, and line~20 ensures that any |
| subsequent reclamation operations are not reordered to precede the |
| checking of \co{rcu_refcnt}. |
| |
| \QuickQuiz{} |
| Why the memory barrier on line~5 of \co{synchronize_rcu()} in |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} |
| given that there is a spin-lock acquisition immediately after? |
| \QuickQuizAnswer{ |
| The spin-lock acquisition only guarantees that the spin-lock's |
| critical section will not ``bleed out'' to precede the |
| acquisition. |
| It in no way guarantees that code preceding the spin-lock |
| acquisition won't be reordered into the critical section. |
| Such reordering could cause a removal from an RCU-protected |
| list to be reordered to follow the complementing of |
| \co{rcu_idx}, which could allow a newly starting RCU |
| read-side critical section to see the recently removed |
| data element. |
| |
| Exercise for the reader: use a tool such as Promela/spin |
| to determine which (if any) of the memory barriers in |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} |
| are really needed. |
| See Chapter~\ref{chp:Formal Verification} |
| for information on using these tools. |
| The first correct and complete response will be credited. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why is the counter flipped twice in |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}? |
| Shouldn't a single flip-and-wait cycle be sufficient? |
| \QuickQuizAnswer{ |
| Both flips are absolutely required. |
| To see this, consider the following sequence of events: |
| \begin{enumerate} |
| \item Line~8 of \co{rcu_read_lock()} in |
| Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair} |
| picks up \co{rcu_idx}, finding its value to be zero. |
| \item Line~8 of \co{synchronize_rcu()} in |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair} |
| complements the value of \co{rcu_idx}, setting its |
| value to one. |
| \item Lines~10-13 of \co{synchronize_rcu()} find that the |
| value of \co{rcu_refcnt[0]} is zero, and thus |
| returns. |
| (Recall that the question is asking what happens if |
| lines~14-20 are omitted.) |
| \item Lines~9 and 10 of \co{rcu_read_lock()} store the |
| value zero to this thread's instance of \co{rcu_read_idx} |
| and increments \co{rcu_refcnt[0]}, respectively. |
| Execution then proceeds into the RCU read-side critical |
| section. |
| \label{defer:rcu_rcgp:RCU Read Side Start} |
| \item Another instance of \co{synchronize_rcu()} again complements |
| \co{rcu_idx}, this time setting its value to zero. |
| Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()} |
| returns immediately. |
| (Recall that \co{rcu_read_lock()} incremented |
| \co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!) |
| \label{defer:rcu_rcgp:RCU Grace Period Start} |
| \item The grace period that started in |
| step~\ref{defer:rcu_rcgp:RCU Grace Period Start} |
| has been allowed to end, despite |
| the fact that the RCU read-side critical section |
| that started beforehand in |
| step~\ref{defer:rcu_rcgp:RCU Read Side Start} |
| has not completed. |
| This violates RCU semantics, and could allow the update |
| to free a data element that the RCU read-side critical |
| section was still referencing. |
| \end{enumerate} |
| |
| Exercise for the reader: What happens if \co{rcu_read_lock()} |
| is preempted for a very long time (hours!) just after |
| line~8? |
| Does this implementation operate correctly in that case? |
| Why or why not? |
| The first correct and complete response will be credited. |
| } \QuickQuizEnd |
| |
| This implementation avoids the update-starvation issues that could |
| occur in the single-counter implementation shown in |
| Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}. |
| |
| \paragraph{Discussion} |
| |
| There are still some serious shortcomings. |
| First, the atomic operations in \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} |
| are still quite heavyweight. |
| In fact, they are more complex than those |
| of the single-counter variant shown in |
| Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}, |
| with the read-side primitives consuming about 150~nanoseconds on a single |
| Power5 CPU and almost 40~\emph{microseconds} on a 64-CPU system. |
| The update-side \co{synchronize_rcu()} primitive is more costly as |
| well, ranging from about 200~nanoseconds on a single Power5 CPU to |
| more than 40~\emph{microseconds} on a 64-CPU system. |
| This means that the RCU read-side critical sections |
| have to be extremely long in order to get any real |
| read-side parallelism. |
| |
| Second, if there are many concurrent \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} operations, there will |
| be extreme memory contention on the \co{rcu_refcnt} |
| elements, resulting in expensive cache misses. |
| This further extends the RCU read-side critical-section |
| duration required to provide parallel read-side access. |
| These first two shortcomings defeat the purpose of RCU in most |
| situations. |
| |
| Third, the need to flip \co{rcu_idx} twice imposes substantial |
| overhead on updates, especially if there are large |
| numbers of threads. |
| |
| Finally, despite the fact that concurrent RCU updates could in principle be |
| satisfied by a common grace period, this implementation |
| serializes grace periods, preventing grace-period |
| sharing. |
| |
| \QuickQuiz{} |
| Given that atomic increment and decrement are so expensive, |
| why not just use non-atomic increment on line~10 and a |
| non-atomic decrement on line~25 of |
| Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}? |
| \QuickQuizAnswer{ |
| Using non-atomic operations would cause increments and decrements |
| to be lost, in turn causing the implementation to fail. |
| See Section~\ref{defer:Scalable Counter-Based RCU} |
| for a safe way to use non-atomic operations in |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()}. |
| } \QuickQuizEnd |
| |
| Despite these shortcomings, one could imagine this variant |
| of RCU being used on small tightly coupled multiprocessors, |
| perhaps as a memory-conserving implementation that maintains |
| API compatibility with more complex implementations. |
| However, it would not likely scale well beyond a few CPUs. |
| |
| The next section describes yet another variation on the reference-counting |
| scheme that provides greatly improved read-side performance and scalability. |
| |
| \subsubsection{Scalable Counter-Based RCU} |
| \label{defer:Scalable Counter-Based RCU} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 DEFINE_PER_THREAD(int [2], rcu_refcnt); |
| 3 atomic_t rcu_idx; |
| 4 DEFINE_PER_THREAD(int, rcu_nesting); |
| 5 DEFINE_PER_THREAD(int, rcu_read_idx); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Per-Thread Reference-Count Pair Data} |
| \label{fig:defer:RCU Per-Thread Reference-Count Pair Data} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 int i; |
| 4 int n; |
| 5 |
| 6 n = __get_thread_var(rcu_nesting); |
| 7 if (n == 0) { |
| 8 i = atomic_read(&rcu_idx); |
| 9 __get_thread_var(rcu_read_idx) = i; |
| 10 __get_thread_var(rcu_refcnt)[i]++; |
| 11 } |
| 12 __get_thread_var(rcu_nesting) = n + 1; |
| 13 smp_mb(); |
| 14 } |
| 15 |
| 16 static void rcu_read_unlock(void) |
| 17 { |
| 18 int i; |
| 19 int n; |
| 20 |
| 21 smp_mb(); |
| 22 n = __get_thread_var(rcu_nesting); |
| 23 if (n == 1) { |
| 24 i = __get_thread_var(rcu_read_idx); |
| 25 __get_thread_var(rcu_refcnt)[i]--; |
| 26 } |
| 27 __get_thread_var(rcu_nesting) = n - 1; |
| 28 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Read-Side Using Per-Thread Reference-Count Pair} |
| \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair} |
| (\path{rcu_rcpl.h}) |
| shows the read-side primitives of an RCU implementation that uses per-thread |
| pairs of reference counters. |
| This implementation is quite similar to that shown in |
| Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}, |
| the only difference being that \co{rcu_refcnt} is now a per-thread |
| array (as shown in |
| Figure~\ref{fig:defer:RCU Per-Thread Reference-Count Pair Data}). |
| As with the algorithm in the previous section, use of this two-element |
| array prevents readers from starving updaters. |
| One benefit of per-thread \co{rcu_refcnt[]} array is that the |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} primitives no longer |
| perform atomic operations. |
| |
| \QuickQuiz{} |
| Come off it! |
| We can see the \co{atomic_read()} primitive in |
| \co{rcu_read_lock()}!!! |
| So why are you trying to pretend that \co{rcu_read_lock()} |
| contains no atomic operations??? |
| \QuickQuizAnswer{ |
| The \co{atomic_read()} primitives does not actually execute |
| atomic machine instructions, but rather does a normal load |
| from an \co{atomic_t}. |
| Its sole purpose is to keep the compiler's type-checking happy. |
| If the Linux kernel ran on 8-bit CPUs, it would also need to |
| prevent ``store tearing'', which could happen due to the need |
| to store a 16-bit pointer with two eight-bit accesses on some |
| 8-bit systems. |
| But thankfully, it seems that no one runs Linux on 8-bit systems. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void flip_counter_and_wait(int i) |
| 2 { |
| 3 int t; |
| 4 |
| 5 atomic_set(&rcu_idx, !i); |
| 6 smp_mb(); |
| 7 for_each_thread(t) { |
| 8 while (per_thread(rcu_refcnt, t)[i] != 0) { |
| 9 poll(NULL, 0, 10); |
| 10 } |
| 11 } |
| 12 smp_mb(); |
| 13 } |
| 14 |
| 15 void synchronize_rcu(void) |
| 16 { |
| 17 int i; |
| 18 |
| 19 smp_mb(); |
| 20 spin_lock(&rcu_gp_lock); |
| 21 i = atomic_read(&rcu_idx); |
| 22 flip_counter_and_wait(i); |
| 23 flip_counter_and_wait(!i); |
| 24 spin_unlock(&rcu_gp_lock); |
| 25 smp_mb(); |
| 26 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Update Using Per-Thread Reference-Count Pair} |
| \label{fig:defer:RCU Update Using Per-Thread Reference-Count Pair} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Update Using Per-Thread Reference-Count Pair} |
| (\path{rcu_rcpl.c}) |
| shows the implementation of \co{synchronize_rcu()}, along with a helper |
| function named \co{flip_counter_and_wait()}. |
| The \co{synchronize_rcu()} function resembles that shown in |
| Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}, |
| except that the repeated counter flip is replaced by a pair of calls |
| on lines~22 and 23 to the new helper function. |
| |
| The new \co{flip_counter_and_wait()} function updates the |
| \co{rcu_idx} variable on line~5, executes a memory barrier on line~6, |
| then lines~7-11 spin on each thread's prior \co{rcu_refcnt} element, |
| waiting for it to go to zero. |
| Once all such elements have gone to zero, |
| it executes another memory barrier on line~12 and returns. |
| |
| This RCU implementation imposes important new requirements on its |
| software environment, namely, (1) that it be possible to declare |
| per-thread variables, (2) that these per-thread variables be accessible |
| from other threads, and (3) that it is possible to enumerate all threads. |
| These requirements can be met in almost all software environments, |
| but often result in fixed upper bounds on the number of threads. |
| More-complex implementations might avoid such bounds, for example, by using |
| expandable hash tables. |
| Such implementations might dynamically track threads, for example, by |
| adding them on their first call to \co{rcu_read_lock()}. |
| |
| \QuickQuiz{} |
| Great, if we have $N$ threads, we can have $2N$ ten-millisecond |
| waits (one set per \co{flip_counter_and_wait()} invocation, |
| and even that assumes that we wait only once for each thread. |
| Don't we need the grace period to complete \emph{much} more quickly? |
| \QuickQuizAnswer{ |
| Keep in mind that we only wait for a given thread if that thread |
| is still in a pre-existing RCU read-side critical section, |
| and that waiting for one hold-out thread gives all the other |
| threads a chance to complete any pre-existing RCU read-side |
| critical sections that they might still be executing. |
| So the only way that we would wait for $2N$ intervals |
| would be if the last thread still remained in a pre-existing |
| RCU read-side critical section despite all the waiting for |
| all the prior threads. |
| In short, this implementation will not wait unnecessarily. |
| |
| However, if you are stress-testing code that uses RCU, you |
| might want to comment out the \co{poll()} statement in |
| order to better catch bugs that incorrectly retain a reference |
| to an RCU-protected data element outside of an RCU |
| read-side critical section. |
| } \QuickQuizEnd |
| |
| This implementation still has several shortcomings. |
| First, the need to flip \co{rcu_idx} twice imposes substantial overhead |
| on updates, especially if there are large numbers of threads. |
| |
| Second, \co{synchronize_rcu()} must now examine a number of variables |
| that increases linearly with the number of threads, imposing substantial |
| overhead on applications with large numbers of threads. |
| |
| Third, as before, although concurrent RCU updates could in principle |
| be satisfied by a common grace period, this implementation serializes |
| grace periods, preventing grace-period sharing. |
| |
| Finally, as noted in the text, the need for per-thread variables |
| and for enumerating threads may be problematic in some software |
| environments. |
| |
| That said, the read-side primitives scale very nicely, requiring about |
| 115~nanoseconds regardless of whether running on a single-CPU or a 64-CPU |
| Power5 system. |
| As noted above, the \co{synchronize_rcu()} primitive does not scale, |
| ranging in overhead from almost a microsecond on a single Power5 CPU |
| up to almost 200~microseconds on a 64-CPU system. |
| This implementation could conceivably form the basis for a |
| production-quality user-level RCU implementation. |
| |
| The next section describes an algorithm permitting more efficient |
| concurrent RCU updates. |
| |
| \subsubsection{Scalable Counter-Based RCU With Shared Grace Periods} |
| \label{defer:Scalable Counter-Based RCU With Shared Grace Periods} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 DEFINE_PER_THREAD(int [2], rcu_refcnt); |
| 3 long rcu_idx; |
| 4 DEFINE_PER_THREAD(int, rcu_nesting); |
| 5 DEFINE_PER_THREAD(int, rcu_read_idx); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data} |
| \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data} |
| \end{figure} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 int i; |
| 4 int n; |
| 5 |
| 6 n = __get_thread_var(rcu_nesting); |
| 7 if (n == 0) { |
| 8 i = ACCESS_ONCE(rcu_idx) & 0x1; |
| 9 __get_thread_var(rcu_read_idx) = i; |
| 10 __get_thread_var(rcu_refcnt)[i]++; |
| 11 } |
| 12 __get_thread_var(rcu_nesting) = n + 1; |
| 13 smp_mb(); |
| 14 } |
| 15 |
| 16 static void rcu_read_unlock(void) |
| 17 { |
| 18 int i; |
| 19 int n; |
| 20 |
| 21 smp_mb(); |
| 22 n = __get_thread_var(rcu_nesting); |
| 23 if (n == 1) { |
| 24 i = __get_thread_var(rcu_read_idx); |
| 25 __get_thread_var(rcu_refcnt)[i]--; |
| 26 } |
| 27 __get_thread_var(rcu_nesting) = n - 1; |
| 28 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update} |
| \label{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update} |
| (\path{rcu_rcpls.h}) |
| shows the read-side primitives for an RCU implementation using per-thread |
| reference count pairs, as before, but permitting updates to share |
| grace periods. |
| The main difference from the earlier implementation shown in |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair} |
| is that \co{rcu_idx} is now a \co{long} that counts freely, |
| so that line~8 of |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update} |
| must mask off the low-order bit. |
| We also switched from using \co{atomic_read()} and \co{atomic_set()} |
| to using \co{ACCESS_ONCE()}. |
| The data is also quite similar, as shown in |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data}, |
| with \co{rcu_idx} now being a \co{long} instead of an |
| \co{atomic_t}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void flip_counter_and_wait(int ctr) |
| 2 { |
| 3 int i; |
| 4 int t; |
| 5 |
| 6 ACCESS_ONCE(rcu_idx) = ctr + 1; |
| 7 i = ctr & 0x1; |
| 8 smp_mb(); |
| 9 for_each_thread(t) { |
| 10 while (per_thread(rcu_refcnt, t)[i] != 0) { |
| 11 poll(NULL, 0, 10); |
| 12 } |
| 13 } |
| 14 smp_mb(); |
| 15 } |
| 16 |
| 17 void synchronize_rcu(void) |
| 18 { |
| 19 int ctr; |
| 20 int oldctr; |
| 21 |
| 22 smp_mb(); |
| 23 oldctr = ACCESS_ONCE(rcu_idx); |
| 24 smp_mb(); |
| 25 spin_lock(&rcu_gp_lock); |
| 26 ctr = ACCESS_ONCE(rcu_idx); |
| 27 if (ctr - oldctr >= 3) { |
| 28 spin_unlock(&rcu_gp_lock); |
| 29 smp_mb(); |
| 30 return; |
| 31 } |
| 32 flip_counter_and_wait(ctr); |
| 33 if (ctr - oldctr < 2) |
| 34 flip_counter_and_wait(ctr + 1); |
| 35 spin_unlock(&rcu_gp_lock); |
| 36 smp_mb(); |
| 37 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Shared Update Using Per-Thread Reference-Count Pair} |
| \label{fig:defer:RCU Shared Update Using Per-Thread Reference-Count Pair} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Shared Update Using Per-Thread Reference-Count Pair} |
| (\path{rcu_rcpls.c}) |
| shows the implementation of \co{synchronize_rcu()} and its helper |
| function \co{flip_counter_and_wait()}. |
| These are similar to those in |
| Figure~\ref{fig:defer:RCU Update Using Per-Thread Reference-Count Pair}. |
| The differences in \co{flip_counter_and_wait()} include: |
| \begin{enumerate} |
| \item Line~6 uses \co{ACCESS_ONCE()} instead of \co{atomic_set()}, |
| and increments rather than complementing. |
| \item A new line~7 masks the counter down to its bottom bit. |
| \end{enumerate} |
| |
| The changes to \co{synchronize_rcu()} are more pervasive: |
| \begin{enumerate} |
| \item There is a new \co{oldctr} local variable that captures |
| the pre-lock-acquisition value of \co{rcu_idx} on |
| line~23. |
| \item Line~26 uses \co{ACCESS_ONCE()} instead of \co{atomic_read()}. |
| \item Lines~27-30 check to see if at least three counter flips were |
| performed by other threads while the lock was being acquired, |
| and, if so, releases the lock, does a memory barrier, and returns. |
| In this case, there were two full waits for the counters to |
| go to zero, so those other threads already did all the required work. |
| \item At lines~33-34, \co{flip_counter_and_wait()} is only |
| invoked a second time if there were fewer than two counter flips |
| while the lock was being acquired. |
| On the other hand, if there were two counter flips, some other |
| thread did one full wait for all the counters to go to zero, |
| so only one more is required. |
| \end{enumerate} |
| |
| With this approach, if an arbitrarily large number of threads invoke |
| \co{synchronize_rcu()} concurrently, with one CPU for each thread, there |
| will be a total of only three waits for counters to go to zero. |
| |
| Despite the improvements, this implementation of RCU still |
| has a few shortcomings. |
| First, as before, the need to flip \co{rcu_idx} twice imposes substantial |
| overhead on updates, especially if there are large |
| numbers of threads. |
| |
| Second, each updater still acquires \co{rcu_gp_lock}, even if there |
| is no work to be done. |
| This can result in a severe scalability limitation |
| if there are large numbers of concurrent updates. |
| There are ways of avoiding this, as was done in a |
| production-quality real-time implementation of RCU for the Linux |
| kernel~\cite{PaulEMcKenney2007PreemptibleRCU}. |
| |
| Third, this implementation requires per-thread variables |
| and the ability to enumerate threads, which again can be |
| problematic in some software environments. |
| |
| Finally, on 32-bit machines, a given update thread might be |
| preempted long enough for the \co{rcu_idx} |
| counter to overflow. |
| This could cause such a thread to force an unnecessary |
| pair of counter flips. |
| However, even if each grace period took only one |
| microsecond, the offending thread would need to be |
| preempted for more than an hour, in which case an |
| extra pair of counter flips is likely the least of |
| your worries. |
| |
| As with the implementation described in |
| Section~\ref{defer:Simple Counter-Based RCU}, |
| the read-side primitives scale extremely well, incurring roughly |
| 115~nanoseconds of overhead regardless of the number of CPUs. |
| The \co{synchronize_rcu()} primitive is still expensive, |
| ranging from about one microsecond up to about 16~microseconds. |
| This is nevertheless much cheaper than the roughly 200~microseconds |
| incurred by the implementation in |
| Section~\ref{defer:Scalable Counter-Based RCU}. |
| So, despite its shortcomings, one could imagine this |
| RCU implementation being used in production in real-life applications. |
| |
| \QuickQuiz{} |
| All of these toy RCU implementations have either atomic operations |
| in \co{rcu_read_lock()} and \co{rcu_read_unlock()}, |
| or \co{synchronize_rcu()} |
| overhead that increases linearly with the number of threads. |
| Under what circumstances could an RCU implementation enjoy |
| light-weight implementations for all three of these primitives, |
| all having deterministic ($O\left(1\right)$) overheads and latencies? |
| \QuickQuizAnswer{ |
| Special-purpose uniprocessor implementations of RCU can attain |
| this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}. |
| } \QuickQuizEnd |
| |
| Referring back to |
| Figure~\ref{fig:defer:RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update}, |
| we see that there is one global-variable access and no fewer than four |
| accesses to thread-local variables. |
| Given the relatively high cost of thread-local accesses on systems |
| implementing POSIX threads, it is tempting to collapse the three |
| thread-local variables into a single structure, permitting |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} to access their |
| thread-local data with a single thread-local-storage access. |
| However, an even better approach would be to reduce the number of |
| thread-local accesses to one, as is done in the next section. |
| |
| \subsubsection{RCU Based on Free-Running Counter} |
| \label{defer:RCU Based on Free-Running Counter} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 long rcu_gp_ctr = 0; |
| 3 DEFINE_PER_THREAD(long, rcu_reader_gp); |
| 4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Data for Free-Running Counter Using RCU} |
| \label{fig:defer:Data for Free-Running Counter Using RCU} |
| \end{figure} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 __get_thread_var(rcu_reader_gp) = |
| 4 ACCESS_ONCE(rcu_gp_ctr) + 1; |
| 5 smp_mb(); |
| 6 } |
| 7 |
| 8 static void rcu_read_unlock(void) |
| 9 { |
| 10 smp_mb(); |
| 11 __get_thread_var(rcu_reader_gp) = |
| 12 ACCESS_ONCE(rcu_gp_ctr); |
| 13 } |
| 14 |
| 15 void synchronize_rcu(void) |
| 16 { |
| 17 int t; |
| 18 |
| 19 smp_mb(); |
| 20 spin_lock(&rcu_gp_lock); |
| 21 ACCESS_ONCE(rcu_gp_ctr) += 2; |
| 22 smp_mb(); |
| 23 for_each_thread(t) { |
| 24 while ((per_thread(rcu_reader_gp, t) & 0x1) && |
| 25 ((per_thread(rcu_reader_gp, t) - |
| 26 ACCESS_ONCE(rcu_gp_ctr)) < 0)) { |
| 27 poll(NULL, 0, 10); |
| 28 } |
| 29 } |
| 30 spin_unlock(&rcu_gp_lock); |
| 31 smp_mb(); |
| 32 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Free-Running Counter Using RCU} |
| \label{fig:defer:Free-Running Counter Using RCU} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Free-Running Counter Using RCU} |
| (\path{rcu.h} and \path{rcu.c}) |
| shows an RCU implementation based on a single global free-running counter |
| that takes on only even-numbered values, with data shown in |
| Figure~\ref{fig:defer:Data for Free-Running Counter Using RCU}. |
| The resulting \co{rcu_read_lock()} implementation is extremely |
| straightforward. |
| Lines~3 and~4 simply add one to the global free-running \co{rcu_gp_ctr} |
| variable and stores the resulting odd-numbered value into the |
| \co{rcu_reader_gp} per-thread variable. |
| Line~5 executes a memory barrier to prevent the content of the |
| subsequent RCU read-side critical section from ``leaking out''. |
| |
| The \co{rcu_read_unlock()} implementation is similar. |
| Line~10 executes a memory barrier, again to prevent the prior RCU |
| read-side critical section from ``leaking out''. |
| Lines~11 and~12 then copy the \co{rcu_gp_ctr} global variable to the |
| \co{rcu_reader_gp} per-thread variable, leaving this per-thread |
| variable with an even-numbered value so that a concurrent instance |
| of \co{synchronize_rcu()} will know to ignore it. |
| |
| \QuickQuiz{} |
| If any even value is sufficient to tell \co{synchronize_rcu()} |
| to ignore a given task, why don't lines~10 and~11 of |
| Figure~\ref{fig:defer:Free-Running Counter Using RCU} |
| simply assign zero to \co{rcu_reader_gp}? |
| \QuickQuizAnswer{ |
| Assigning zero (or any other even-numbered constant) |
| would in fact work, but assigning the value of |
| \co{rcu_gp_ctr} can provide a valuable debugging aid, |
| as it gives the developer an idea of when the corresponding |
| thread last exited an RCU read-side critical section. |
| } \QuickQuizEnd |
| |
| Thus, \co{synchronize_rcu()} could wait for all of the per-thread |
| \co{rcu_reader_gp} variables to take on even-numbered values. |
| However, it is possible to do much better than that because |
| \co{synchronize_rcu()} need only wait on \emph{pre-existing} |
| RCU read-side critical sections. |
| Line~19 executes a memory barrier to prevent prior manipulations |
| of RCU-protected data structures from being reordered (by either |
| the CPU or the compiler) to follow the increment on line~21. |
| Line~20 acquires the \co{rcu_gp_lock} (and line~30 releases it) |
| in order to prevent multiple |
| \co{synchronize_rcu()} instances from running concurrently. |
| Line~21 then increments the global \co{rcu_gp_ctr} variable by |
| two, so that all pre-existing RCU read-side critical sections will |
| have corresponding per-thread \co{rcu_reader_gp} variables with |
| values less than that of \co{rcu_gp_ctr}, modulo the machine's |
| word size. |
| Recall also that threads with even-numbered values of \co{rcu_reader_gp} |
| are not in an RCU read-side critical section, |
| so that lines~23-29 scan the \co{rcu_reader_gp} values until they |
| all are either even (line~24) or are greater than the global |
| \co{rcu_gp_ctr} (lines~25-26). |
| Line~27 blocks for a short period of time to wait for a |
| pre-existing RCU read-side critical section, but this can be replaced with |
| a spin-loop if grace-period latency is of the essence. |
| Finally, the memory barrier at line~31 ensures that any subsequent |
| destruction will not be reordered into the preceding loop. |
| |
| \QuickQuiz{} |
| Why are the memory barriers on lines~19 and~31 of |
| Figure~\ref{fig:defer:Free-Running Counter Using RCU} |
| needed? |
| Aren't the memory barriers inherent in the locking |
| primitives on lines~20 and~30 sufficient? |
| \QuickQuizAnswer{ |
| These memory barriers are required because the locking |
| primitives are only guaranteed to confine the critical |
| section. |
| The locking primitives are under absolutely no obligation |
| to keep other code from bleeding in to the critical section. |
| The pair of memory barriers are therefore requires to prevent |
| this sort of code motion, whether performed by the compiler |
| or by the CPU. |
| } \QuickQuizEnd |
| |
| This approach achieves much better read-side performance, incurring |
| roughly 63~nanoseconds of overhead regardless of the number of |
| Power5 CPUs. |
| Updates incur more overhead, ranging from about 500~nanoseconds on |
| a single Power5 CPU to more than 100~\emph{microseconds} on 64 |
| such CPUs. |
| |
| \QuickQuiz{} |
| Couldn't the update-side batching optimization described in |
| Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods} |
| be applied to the implementation shown in |
| Figure~\ref{fig:defer:Free-Running Counter Using RCU}? |
| \QuickQuizAnswer{ |
| Indeed it could, with a few modifications. |
| This work is left as an exercise for the reader. |
| } \QuickQuizEnd |
| |
| This implementation suffers from some serious shortcomings in |
| addition to the high update-side overhead noted earlier. |
| First, it is no longer permissible to nest RCU read-side critical |
| sections, a topic that is taken up in the next section. |
| Second, if a reader is preempted at line~3 of |
| Figure~\ref{fig:defer:Free-Running Counter Using RCU} after fetching from |
| \co{rcu_gp_ctr} but before storing to \co{rcu_reader_gp}, |
| and if the \co{rcu_gp_ctr} counter then runs through more than half |
| but less than all of its possible values, then \co{synchronize_rcu()} |
| will ignore the subsequent RCU read-side critical section. |
| Third and finally, this implementation requires that the enclosing software |
| environment be able to enumerate threads and maintain per-thread |
| variables. |
| |
| \QuickQuiz{} |
| Is the possibility of readers being preempted in |
| lines~3-4 of Figure~\ref{fig:defer:Free-Running Counter Using RCU} |
| a real problem, in other words, is there a real sequence |
| of events that could lead to failure? |
| If not, why not? |
| If so, what is the sequence of events, and how can the |
| failure be addressed? |
| \QuickQuizAnswer{ |
| It is a real problem, there is a sequence of events leading to |
| failure, and there are a number of possible ways of |
| addressing it. |
| For more details, see the Quick Quizzes near the end of |
| Section~\ref{defer:Nestable RCU Based on Free-Running Counter}. |
| The reason for locating the discussion there is to (1) give you |
| more time to think about it, and (2) because the nesting support |
| added in that section greatly reduces the time required to |
| overflow the counter. |
| } \QuickQuizEnd |
| |
| \subsubsection{Nestable RCU Based on Free-Running Counter} |
| \label{defer:Nestable RCU Based on Free-Running Counter} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 #define RCU_GP_CTR_SHIFT 7 |
| 3 #define RCU_GP_CTR_BOTTOM_BIT (1 << RCU_GP_CTR_SHIFT) |
| 4 #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1) |
| 5 long rcu_gp_ctr = 0; |
| 6 DEFINE_PER_THREAD(long, rcu_reader_gp); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Data for Nestable RCU Using a Free-Running Counter} |
| \label{fig:defer:Data for Nestable RCU Using a Free-Running Counter} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 long tmp; |
| 4 long *rrgp; |
| 5 |
| 6 rrgp = &__get_thread_var(rcu_reader_gp); |
| 7 tmp = *rrgp; |
| 8 if ((tmp & RCU_GP_CTR_NEST_MASK) == 0) |
| 9 tmp = ACCESS_ONCE(rcu_gp_ctr); |
| 10 tmp++; |
| 11 *rrgp = tmp; |
| 12 smp_mb(); |
| 13 } |
| 14 |
| 15 static void rcu_read_unlock(void) |
| 16 { |
| 17 long tmp; |
| 18 |
| 19 smp_mb(); |
| 20 __get_thread_var(rcu_reader_gp)--; |
| 21 } |
| 22 |
| 23 void synchronize_rcu(void) |
| 24 { |
| 25 int t; |
| 26 |
| 27 smp_mb(); |
| 28 spin_lock(&rcu_gp_lock); |
| 29 ACCESS_ONCE(rcu_gp_ctr) += |
| 30 RCU_GP_CTR_BOTTOM_BIT; |
| 31 smp_mb(); |
| 32 for_each_thread(t) { |
| 33 while (rcu_gp_ongoing(t) && |
| 34 ((per_thread(rcu_reader_gp, t) - |
| 35 rcu_gp_ctr) < 0)) { |
| 36 poll(NULL, 0, 10); |
| 37 } |
| 38 } |
| 39 spin_unlock(&rcu_gp_lock); |
| 40 smp_mb(); |
| 41 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Nestable RCU Using a Free-Running Counter} |
| \label{fig:defer:Nestable RCU Using a Free-Running Counter} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter} |
| (\path{rcu_nest.h} and \path{rcu_nest.c}) |
| show an RCU implementation based on a single global free-running counter, |
| but that permits nesting of RCU read-side critical sections. |
| This nestability is accomplished by reserving the low-order bits of the |
| global \co{rcu_gp_ctr} to count nesting, using the definitions shown in |
| Figure~\ref{fig:defer:Data for Nestable RCU Using a Free-Running Counter}. |
| This is a generalization of the scheme in |
| Section~\ref{defer:RCU Based on Free-Running Counter}, |
| which can be thought of as having a single low-order bit reserved |
| for counting nesting depth. |
| Two C-preprocessor macros are used to arrange this, |
| \co{RCU_GP_CTR_NEST_MASK} and |
| \co{RCU_GP_CTR_BOTTOM_BIT}. |
| These are related: \co{RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1}. |
| The \co{RCU_GP_CTR_BOTTOM_BIT} macro contains a single bit that is |
| positioned just above the bits reserved for counting nesting, |
| and the \co{RCU_GP_CTR_NEST_MASK} has all one bits covering the |
| region of \co{rcu_gp_ctr} used to count nesting. |
| Obviously, these two C-preprocessor macros must reserve enough |
| of the low-order bits of the counter to permit the maximum required |
| nesting of RCU read-side critical sections, and this implementation |
| reserves seven bits, for a maximum RCU read-side critical-section |
| nesting depth of 127, which should be well in excess of that needed |
| by most applications. |
| |
| The resulting \co{rcu_read_lock()} implementation is still reasonably |
| straightforward. |
| Line~6 places a pointer to this thread's instance of \co{rcu_reader_gp} |
| into the local variable \co{rrgp}, minimizing the number of expensive |
| calls to the pthreads thread-local-state API. |
| Line~7 records the current value of \co{rcu_reader_gp} into another |
| local variable \co{tmp}, and line~8 checks to see if the low-order |
| bits are zero, which would indicate that this is the outermost |
| \co{rcu_read_lock()}. |
| If so, line~9 places the global \co{rcu_gp_ctr} into \co{tmp} because |
| the current value previously fetched by line~7 is likely to be obsolete. |
| In either case, line~10 increments the nesting depth, which you will |
| recall is stored in the seven low-order bits of the counter. |
| Line~11 stores the updated counter back into this thread's instance |
| of \co{rcu_reader_gp}, and, finally, line~12 executes a memory |
| barrier to prevent the RCU read-side critical section from bleeding out |
| into the code preceding the call to \co{rcu_read_lock()}. |
| |
| In other words, this implementation of \co{rcu_read_lock()} picks up a copy |
| of the global \co{rcu_gp_ctr} unless the current invocation of |
| \co{rcu_read_lock()} is nested within an RCU read-side critical section, |
| in which case it instead fetches the contents of the current thread's |
| instance of \co{rcu_reader_gp}. |
| Either way, it increments whatever value it fetched in order to record |
| an additional nesting level, and stores the result in the current |
| thread's instance of \co{rcu_reader_gp}. |
| |
| Interestingly enough, despite their \co{rcu_read_lock()} differences, |
| the implementation of \co{rcu_read_unlock()} |
| is broadly similar to that shown in |
| Section~\ref{defer:RCU Based on Free-Running Counter}. |
| Line~19 executes a memory barrier in order to prevent the RCU read-side |
| critical section from bleeding out into code following the call |
| to \co{rcu_read_unlock()}, and |
| line~20 decrements this thread's instance of \co{rcu_reader_gp}, |
| which has the effect of decrementing the nesting count contained in |
| \co{rcu_reader_gp}'s low-order bits. |
| Debugging versions of this primitive would check (before decrementing!) |
| that these low-order bits were non-zero. |
| |
| The implementation of \co{synchronize_rcu()} is quite similar to |
| that shown in |
| Section~\ref{defer:RCU Based on Free-Running Counter}. |
| There are two differences. |
| The first is that lines~29 and~30 adds \co{RCU_GP_CTR_BOTTOM_BIT} |
| to the global \co{rcu_gp_ctr} instead of adding the constant ``2'', |
| and the second is that the comparison on line~33 has been abstracted |
| out to a separate function, where it checks the bit indicated |
| by \co{RCU_GP_CTR_BOTTOM_BIT} instead of unconditionally checking |
| the low-order bit. |
| |
| This approach achieves read-side performance almost equal to that |
| shown in |
| Section~\ref{defer:RCU Based on Free-Running Counter}, incurring |
| roughly 65~nanoseconds of overhead regardless of the number of |
| Power5 CPUs. |
| Updates again incur more overhead, ranging from about 600~nanoseconds on |
| a single Power5 CPU to more than 100~\emph{microseconds} on 64 |
| such CPUs. |
| |
| \QuickQuiz{} |
| Why not simply maintain a separate per-thread nesting-level |
| variable, as was done in previous section, rather than having |
| all this complicated bit manipulation? |
| \QuickQuizAnswer{ |
| The apparent simplicity of the separate per-thread variable |
| is a red herring. |
| This approach incurs much greater complexity in the guise |
| of careful ordering of operations, especially if signal |
| handlers are to be permitted to contain RCU read-side |
| critical sections. |
| But don't take my word for it, code it up and see what you |
| end up with! |
| } \QuickQuizEnd |
| |
| This implementation suffers from the same shortcomings as does that of |
| Section~\ref{defer:RCU Based on Free-Running Counter}, except that |
| nesting of RCU read-side critical sections is now permitted. |
| In addition, on 32-bit systems, this approach shortens the time |
| required to overflow the global \co{rcu_gp_ctr} variable. |
| The following section shows one way to greatly increase the time |
| required for overflow to occur, while greatly reducing read-side |
| overhead. |
| |
| \QuickQuiz{} |
| Given the algorithm shown in |
| Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter}, |
| how could you double the time required to overflow the global |
| \co{rcu_gp_ctr}? |
| \QuickQuizAnswer{ |
| One way would be to replace the magnitude comparison on |
| lines~33 and 34 with an inequality check of the per-thread |
| \co{rcu_reader_gp} variable against |
| \co{rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Again, given the algorithm shown in |
| Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter}, |
| is counter overflow fatal? |
| Why or why not? |
| If it is fatal, what can be done to fix it? |
| \QuickQuizAnswer{ |
| It can indeed be fatal. |
| To see this, consider the following sequence of events: |
| \begin{enumerate} |
| \item Thread~0 enters \co{rcu_read_lock()}, determines |
| that it is not nested, and therefore fetches the |
| value of the global \co{rcu_gp_ctr}. |
| Thread~0 is then preempted for an extremely long time |
| (before storing to its per-thread \co{rcu_reader_gp} |
| variable). |
| \item Other threads repeatedly invoke \co{synchronize_rcu()}, |
| so that the new value of the global \co{rcu_gp_ctr} |
| is now \co{RCU_GP_CTR_BOTTOM_BIT} |
| less than it was when thread~0 fetched it. |
| \item Thread~0 now starts running again, and stores into |
| its per-thread \co{rcu_reader_gp} variable. |
| The value it stores is |
| \co{RCU_GP_CTR_BOTTOM_BIT+1} |
| greater than that of the global \co{rcu_gp_ctr}. |
| \item Thread~0 acquires a reference to RCU-protected data |
| element~A. |
| \item Thread~1 now removes the data element~A that thread~0 |
| just acquired a reference to. |
| \item Thread~1 invokes \co{synchronize_rcu()}, which |
| increments the global \co{rcu_gp_ctr} by |
| \co{RCU_GP_CTR_BOTTOM_BIT}. |
| It then checks all of the per-thread \co{rcu_reader_gp} |
| variables, but thread~0's value (incorrectly) indicates |
| that it started after thread~1's call to |
| \co{synchronize_rcu()}, so thread~1 does not wait |
| for thread~0 to complete its RCU read-side critical |
| section. |
| \item Thread~1 then frees up data element~A, which thread~0 |
| is still referencing. |
| \end{enumerate} |
| |
| Note that scenario can also occur in the implementation presented in |
| Section~\ref{defer:RCU Based on Free-Running Counter}. |
| |
| One strategy for fixing this problem is to use 64-bit |
| counters so that the time required to overflow them would exceed |
| the useful lifetime of the computer system. |
| Note that non-antique members of the 32-bit x86 CPU family |
| allow atomic manipulation of 64-bit counters via the |
| \co{cmpxchg64b} instruction. |
| |
| Another strategy is to limit the rate at which grace periods are |
| permitted to occur in order to achieve a similar effect. |
| For example, \co{synchronize_rcu()} could record the last time |
| that it was invoked, and any subsequent invocation would then |
| check this time and block as needed to force the desired |
| spacing. |
| For example, if the low-order four bits of the counter were |
| reserved for nesting, and if grace periods were permitted to |
| occur at most ten times per second, then it would take more |
| than 300 days for the counter to overflow. |
| However, this approach is not helpful if there is any possibility |
| that the system will be fully loaded with CPU-bound high-priority |
| real-time threads for the full 300 days. |
| (A remote possibility, perhaps, but best to consider it ahead |
| of time.) |
| |
| A third approach is to administratively abolish real-time threads |
| from the system in question. |
| In this case, the preempted process will age up in priority, |
| thus getting to run long before the counter had a chance to |
| overflow. |
| Of course, this approach is less than helpful for real-time |
| applications. |
| |
| A final approach would be for \co{rcu_read_lock()} to recheck |
| the value of the global \co{rcu_gp_ctr} after storing to its |
| per-thread \co{rcu_reader_gp} counter, retrying if the new |
| value of the global \co{rcu_gp_ctr} is inappropriate. |
| This works, but introduces non-deterministic execution time |
| into \co{rcu_read_lock()}. |
| On the other hand, if your application is being preempted long |
| enough for the counter to overflow, you have no hope of |
| deterministic execution time in any case! |
| % |
| % @@@ A fourth approach is rcu_nest32.[hc]. |
| } \QuickQuizEnd |
| |
| \subsubsection{RCU Based on Quiescent States} |
| \label{defer:RCU Based on Quiescent States} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 DEFINE_SPINLOCK(rcu_gp_lock); |
| 2 long rcu_gp_ctr = 0; |
| 3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Data for Quiescent-State-Based RCU} |
| \label{fig:defer:Data for Quiescent-State-Based RCU} |
| \end{figure} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 static void rcu_read_lock(void) |
| 2 { |
| 3 } |
| 4 |
| 5 static void rcu_read_unlock(void) |
| 6 { |
| 7 } |
| 8 |
| 9 rcu_quiescent_state(void) |
| 10 { |
| 11 smp_mb(); |
| 12 __get_thread_var(rcu_reader_qs_gp) = |
| 13 ACCESS_ONCE(rcu_gp_ctr) + 1; |
| 14 smp_mb(); |
| 15 } |
| 16 |
| 17 static void rcu_thread_offline(void) |
| 18 { |
| 19 smp_mb(); |
| 20 __get_thread_var(rcu_reader_qs_gp) = |
| 21 ACCESS_ONCE(rcu_gp_ctr); |
| 22 smp_mb(); |
| 23 } |
| 24 |
| 25 static void rcu_thread_online(void) |
| 26 { |
| 27 rcu_quiescent_state(); |
| 28 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Quiescent-State-Based RCU Read Side} |
| \label{fig:defer:Quiescent-State-Based RCU Read Side} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side} |
| (\path{rcu_qs.h}) |
| shows the read-side primitives used to construct a user-level |
| implementation of RCU based on quiescent states, with the data shown in |
| Figure~\ref{fig:defer:Data for Quiescent-State-Based RCU}. |
| As can be seen from lines~1-7 in the figure, the \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} primitives do nothing, and can in fact |
| be expected to be inlined and optimized away, as they are in |
| server builds of the Linux kernel. |
| This is due to the fact that quiescent-state-based RCU implementations |
| \emph{approximate} the extents of RCU read-side critical sections |
| using the aforementioned quiescent states. |
| Each of these quiescent states contains a call to |
| \co{rcu_quiescent_state()}, which is shown from lines~9-15 in the figure. |
| Threads entering extended quiescent states (for example, when blocking) |
| may instead call \co{rcu_thread_offline()} (lines~17-23) when entering |
| an extended quiescent state and then call |
| \co{rcu_thread_online()} (lines~25-28) when leaving it. |
| As such, \co{rcu_thread_online()} is analogous to \co{rcu_read_lock()} |
| and \co{rcu_thread_offline()} is analogous to \co{rcu_read_unlock()}. |
| In addition, \co{rcu_quiescent_state()} can be thought of as a |
| \co{rcu_thread_online()} immediately followed by a |
| \co{rcu_thread_offline()}.\footnote{ |
| Although the code in the figure is consistent with |
| \co{rcu_quiescent_state()} |
| being the same as \co{rcu_thread_online()} immediately followed by |
| \co{rcu_thread_offline()}, this relationship is obscured by |
| performance optimizations.} |
| It is illegal to invoke \co{rcu_quiescent_state()}, \co{rcu_thread_offline()}, |
| or \co{rcu_thread_online()} from an RCU read-side critical section. |
| |
| In \co{rcu_quiescent_state()}, line~11 executes a memory barrier |
| to prevent any code prior to the quiescent state (including possible |
| RCU read-side critical sections) from being reordered |
| into the quiescent state. |
| Lines~12-13 pick up a copy of the global \co{rcu_gp_ctr}, using |
| \co{ACCESS_ONCE()} to ensure that the compiler does not employ any |
| optimizations that would result in \co{rcu_gp_ctr} being fetched |
| more than once, |
| and then adds one to the value fetched and stores it into |
| the per-thread \co{rcu_reader_qs_gp} variable, so that any concurrent |
| instance of \co{synchronize_rcu()} will see an odd-numbered value, |
| thus becoming aware that a new RCU read-side critical section has started. |
| Instances of \co{synchronize_rcu()} that are waiting on older |
| RCU read-side critical sections will thus know to ignore this new one. |
| Finally, line~14 executes a memory barrier, which prevents subsequent |
| code (including a possible RCU read-side critical section) from being |
| re-ordered with the lines~12-13. |
| |
| \QuickQuiz{} |
| Doesn't the additional memory barrier shown on line~14 of |
| Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side} |
| greatly increase the overhead of \co{rcu_quiescent_state}? |
| \QuickQuizAnswer{ |
| Indeed it does! |
| An application using this implementation of RCU should therefore |
| invoke \co{rcu_quiescent_state} sparingly, instead using |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} most of the |
| time. |
| |
| However, this memory barrier is absolutely required so that |
| other threads will see the store on lines~12-13 before any |
| subsequent RCU read-side critical sections executed by the |
| caller. |
| } \QuickQuizEnd |
| |
| Some applications might use RCU only occasionally, but use it very heavily |
| when they do use it. |
| Such applications might choose to use \co{rcu_thread_online()} when |
| starting to use RCU and \co{rcu_thread_offline()} when no longer |
| using RCU. |
| The time between a call to \co{rcu_thread_offline()} and a subsequent |
| call to \co{rcu_thread_online()} is an extended quiescent state, |
| so that RCU will not expect explicit quiescent states to be registered |
| during this time. |
| |
| The \co{rcu_thread_offline()} function simply sets the |
| per-thread \co{rcu_reader_qs_gp} variable to the current value of |
| \co{rcu_gp_ctr}, which has an even-numbered value. |
| Any concurrent instances of \co{synchronize_rcu()} will thus know to |
| ignore this thread. |
| |
| \QuickQuiz{} |
| Why are the two memory barriers on lines~19 and 22 of |
| Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side} |
| needed? |
| \QuickQuizAnswer{ |
| The memory barrier on line~19 prevents any RCU read-side |
| critical sections that might precede the |
| call to \co{rcu_thread_offline()} won't be reordered by either |
| the compiler or the CPU to follow the assignment on lines~20-21. |
| The memory barrier on line~22 is, strictly speaking, unnecessary, |
| as it is illegal to have any RCU read-side critical sections |
| following the call to \co{rcu_thread_offline()}. |
| } \QuickQuizEnd |
| |
| The \co{rcu_thread_online()} function simply invokes |
| \co{rcu_quiescent_state()}, thus marking the end of the extended |
| quiescent state. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 void synchronize_rcu(void) |
| 2 { |
| 3 int t; |
| 4 |
| 5 smp_mb(); |
| 6 spin_lock(&rcu_gp_lock); |
| 7 rcu_gp_ctr += 2; |
| 8 smp_mb(); |
| 9 for_each_thread(t) { |
| 10 while (rcu_gp_ongoing(t) && |
| 11 ((per_thread(rcu_reader_qs_gp, t) - |
| 12 rcu_gp_ctr) < 0)) { |
| 13 poll(NULL, 0, 10); |
| 14 } |
| 15 } |
| 16 spin_unlock(&rcu_gp_lock); |
| 17 smp_mb(); |
| 18 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Update Side Using Quiescent States} |
| \label{fig:defer:RCU Update Side Using Quiescent States} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:RCU Update Side Using Quiescent States} |
| (\path{rcu_qs.c}) |
| shows the implementation of \co{synchronize_rcu()}, which is |
| quite similar to that of the preceding sections. |
| |
| This implementation has blazingly fast read-side primitives, with |
| an \co{rcu_read_lock()}-\co{rcu_read_unlock()} round trip incurring |
| an overhead of roughly 50~\emph{picoseconds}. |
| The \co{synchronize_rcu()} overhead ranges from about 600~nanoseconds |
| on a single-CPU Power5 system up to more than 100~microseconds on |
| a 64-CPU system. |
| |
| \QuickQuiz{} |
| To be sure, the clock frequencies of Power |
| systems in 2008 were quite high, but even a 5GHz clock |
| frequency is insufficient to allow |
| loops to be executed in 50~picoseconds! |
| What is going on here? |
| \QuickQuizAnswer{ |
| Since the measurement loop contains a pair of empty functions, |
| the compiler optimizes it away. |
| The measurement loop takes 1,000 passes between each call to |
| \co{rcu_quiescent_state()}, so this measurement is roughly |
| one thousandth of the overhead of a single call to |
| \co{rcu_quiescent_state()}. |
| } \QuickQuizEnd |
| |
| However, this implementation requires that each thread either |
| invoke \co{rcu_quiescent_state()} periodically or to invoke |
| \co{rcu_thread_offline()} for extended quiescent states. |
| The need to invoke these functions periodically can make this |
| implementation difficult to use in some situations, such as for |
| certain types of library functions. |
| |
| \QuickQuiz{} |
| Why would the fact that the code is in a library make |
| any difference for how easy it is to use the RCU |
| implementation shown in |
| Figures~\ref{fig:defer:Quiescent-State-Based RCU Read Side} and |
| \ref{fig:defer:RCU Update Side Using Quiescent States}? |
| \QuickQuizAnswer{ |
| A library function has absolutely no control over the caller, |
| and thus cannot force the caller to invoke \co{rcu_quiescent_state()} |
| periodically. |
| On the other hand, a library function that made many references |
| to a given RCU-protected data structure might be able to invoke |
| \co{rcu_thread_online()} upon entry, |
| \co{rcu_quiescent_state()} periodically, and |
| \co{rcu_thread_offline()} upon exit. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But what if you hold a lock across a call to |
| \co{synchronize_rcu()}, and then acquire that same lock within |
| an RCU read-side critical section? |
| This should be a deadlock, but how can a primitive that |
| generates absolutely no code possibly participate in a |
| deadlock cycle? |
| \QuickQuizAnswer{ |
| Please note that the RCU read-side critical section is in |
| effect extended beyond the enclosing |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()}, out to |
| the previous and next call to \co{rcu_quiescent_state()}. |
| This \co{rcu_quiescent_state} can be thought of as a |
| \co{rcu_read_unlock()} immediately followed by an |
| \co{rcu_read_lock()}. |
| |
| Even so, the actual deadlock itself will involve the lock |
| acquisition in the RCU read-side critical section and |
| the \co{synchronize_rcu()}, never the \co{rcu_quiescent_state()}. |
| } \QuickQuizEnd |
| |
| In addition, this implementation does not permit concurrent calls |
| to \co{synchronize_rcu()} to share grace periods. |
| That said, one could easily imagine a production-quality RCU |
| implementation based on this version of RCU. |
| |
| \subsubsection{Summary of Toy RCU Implementations} |
| \label{defer:Summary of Toy RCU Implementations} |
| |
| If you made it this far, congratulations! |
| You should now have a much clearer understanding |
| not only of RCU itself, but also of the requirements of enclosing |
| software environments and applications. |
| Those wishing an even deeper understanding are invited to read |
| descriptions of production-quality RCU |
| implementations~\cite{MathieuDesnoyers2012URCU,PaulEMcKenney2007PreemptibleRCU,PaulEMcKenney2008HierarchicalRCU,PaulEMcKenney2009BloatwatchRCU}. |
| |
| The preceding sections listed some desirable properties of the |
| various RCU primitives. |
| The following list is provided for easy reference for those wishing to |
| create a new RCU implementation. |
| |
| \begin{enumerate} |
| \item There must be read-side primitives (such as \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()}) and grace-period primitives |
| (such as \co{synchronize_rcu()} and \co{call_rcu()}), such |
| that any RCU read-side critical section in existence at the |
| start of a grace period has completed by the end of the |
| grace period. |
| \item RCU read-side primitives should have minimal overhead. |
| In particular, expensive operations such as cache misses, |
| atomic instructions, memory barriers, and branches should |
| be avoided. |
| \item RCU read-side primitives should have $O\left(1\right)$ computational |
| complexity to enable real-time use. |
| (This implies that readers run concurrently with updaters.) |
| \item RCU read-side primitives should be usable in all contexts |
| (in the Linux kernel, they are permitted everywhere except in |
| the idle loop). |
| An important special case is that RCU read-side primitives be |
| usable within an RCU read-side critical section, in other words, |
| that it be possible to nest RCU read-side critical sections. |
| \item RCU read-side primitives should be unconditional, with no |
| failure returns. |
| This property is extremely important, as failure checking |
| increases complexity and complicates testing and validation. |
| \item Any operation other than a quiescent state (and thus a grace |
| period) should be permitted in an RCU read-side critical section. |
| In particular, irrevocable operations such as I/O should be |
| permitted. |
| \item It should be possible to update an RCU-protected data structure |
| while executing within an RCU read-side critical section. |
| \item Both RCU read-side and update-side primitives should be independent |
| of memory allocator design and implementation, in other words, |
| the same RCU implementation should be able to protect a given |
| data structure regardless of how the data elements are allocated |
| and freed. |
| \item RCU grace periods should not be blocked by threads that |
| halt outside of RCU read-side critical sections. |
| (But note that most quiescent-state-based implementations |
| violate this desideratum.) |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| Given that grace periods are prohibited within RCU read-side |
| critical sections, how can an RCU data structure possibly be |
| updated while in an RCU read-side critical section? |
| \QuickQuizAnswer{ |
| This situation is one reason for the existence of asynchronous |
| grace-period primitives such as \co{call_rcu()}. |
| This primitive may be invoked within an RCU read-side critical |
| section, and the specified RCU callback will in turn be invoked |
| at a later time, after a grace period has elapsed. |
| |
| The ability to perform an RCU update while within an RCU read-side |
| critical section can be extremely convenient, and is analogous |
| to a (mythical) unconditional read-to-write upgrade for |
| reader-writer locking. |
| } \QuickQuizEnd |