| % defer/rcuusage.tex |
| |
| \subsection{RCU Usage} |
| \label{sec:defer:RCU Usage} |
| \OriginallyPublished{Section}{sec:defer:RCU Usage}{RCU Usage}{Linux Weekly News}{PaulEMcKenney2008WhatIsRCUUsage} |
| |
| \begin{table}[tb] |
| \centering |
| \scriptsize |
| \begin{tabular}{l|l} |
| Mechanism RCU Replaces & Section \\ |
| \hline |
| \hline |
| Reader-writer locking & |
| Section~\ref{sec:defer:RCU is a Reader-Writer Lock Replacement} \\ |
| Restricted reference-counting mechanism & |
| Section~\ref{sec:defer:RCU is a Restricted Reference-Counting Mechanism} \\ |
| Bulk reference-counting mechanism & |
| Section~\ref{sec:defer:RCU is a Bulk Reference-Counting Mechanism} \\ |
| Poor man's garbage collector & |
| Section~\ref{sec:defer:RCU is a Poor Man's Garbage Collector} \\ |
| Existence Guarantees & |
| Section~\ref{sec:defer:RCU is a Way of Providing Existence Guarantees} \\ |
| Type-Safe Memory & |
| Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory} \\ |
| Wait for things to finish & |
| Section~\ref{sec:defer:RCU is a Way of Waiting for Things to Finish} \\ |
| \end{tabular} |
| \caption{RCU Usage} |
| \label{tab:defer:RCU Usage} |
| \end{table} |
| |
| This section answers the question ``What is RCU?'' from the viewpoint |
| of the uses to which RCU can be put. |
| Because RCU is most frequently used to replace some existing mechanism, |
| we look at it primarily in terms of its relationship to such mechanisms, |
| as listed in Table~\ref{tab:defer:RCU Usage}. |
| Following the sections listed in this table, |
| Section~\ref{sec:defer:RCU Usage Summary} provides a summary. |
| |
| \subsubsection{RCU for Pre-BSD Routing} |
| \label{sec:defer:RCU for Pre-BSD Routing} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct route_entry { |
| 2 struct rcu_head rh; |
| 3 struct cds_list_head re_next; |
| 4 unsigned long addr; |
| 5 unsigned long iface; |
| 6 int re_freed; |
| 7 }; |
| 8 CDS_LIST_HEAD(route_list); |
| 9 DEFINE_SPINLOCK(routelock); |
| 10 |
| 11 unsigned long route_lookup(unsigned long addr) |
| 12 { |
| 13 struct route_entry *rep; |
| 14 unsigned long ret; |
| 15 |
| 16 rcu_read_lock(); |
| 17 cds_list_for_each_entry_rcu(rep, &route_list, |
| 18 re_next) { |
| 19 if (rep->addr == addr) { |
| 20 ret = rep->iface; |
| 21 if (ACCESS_ONCE(rep->re_freed)) |
| 22 abort(); |
| 23 rcu_read_unlock(); |
| 24 return ret; |
| 25 } |
| 26 } |
| 27 rcu_read_unlock(); |
| 28 return ULONG_MAX; |
| 29 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Pre-BSD Routing Table Lookup} |
| \label{fig:defer:RCU Pre-BSD Routing Table Lookup} |
| \end{figure} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int route_add(unsigned long addr, |
| 2 unsigned long interface) |
| 3 { |
| 4 struct route_entry *rep; |
| 5 |
| 6 rep = malloc(sizeof(*rep)); |
| 7 if (!rep) |
| 8 return -ENOMEM; |
| 9 rep->addr = addr; |
| 10 rep->iface = interface; |
| 11 rep->re_freed = 0; |
| 12 spin_lock(&routelock); |
| 13 cds_list_add_rcu(&rep->re_next, &route_list); |
| 14 spin_unlock(&routelock); |
| 15 return 0; |
| 16 } |
| 17 |
| 18 static void route_cb(struct rcu_head *rhp) |
| 19 { |
| 20 struct route_entry *rep; |
| 21 |
| 22 rep = container_of(rhp, struct route_entry, rh); |
| 23 ACCESS_ONCE(rep->re_freed) = 1; |
| 24 free(rep); |
| 25 } |
| 26 |
| 27 int route_del(unsigned long addr) |
| 28 { |
| 29 struct route_entry *rep; |
| 30 |
| 31 spin_lock(&routelock); |
| 32 cds_list_for_each_entry(rep, &route_list, |
| 33 re_next) { |
| 34 if (rep->addr == addr) { |
| 35 cds_list_del_rcu(&rep->re_next); |
| 36 spin_unlock(&routelock); |
| 37 call_rcu(&rep->rh, route_cb); |
| 38 return 0; |
| 39 } |
| 40 } |
| 41 spin_unlock(&routelock); |
| 42 return -ENOENT; |
| 43 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU Pre-BSD Routing Table Add/Delete} |
| \label{fig:defer:RCU Pre-BSD Routing Table Add/Delete} |
| \end{figure} |
| |
| Figures~\ref{fig:defer:RCU Pre-BSD Routing Table Lookup} |
| and~\ref{fig:defer:RCU Pre-BSD Routing Table Add/Delete} |
| show code for an RCU-protected Pre-BSD routing table |
| (\path{route_rcu.c}). |
| The former shows data structures and \co{route_lookup()}, |
| and the latter shows \co{route_add()} and \co{route_del()}. |
| |
| In Figure~\ref{fig:defer:RCU Pre-BSD Routing Table Lookup}, |
| line~2 adds the \co{->rh} field used by RCU reclamation, |
| line~6 adds the \co{->re_freed} use-after-free-check field, |
| lines~16, 17, 23, and~27 add RCU read-side protection, |
| and lines~21 and~22 add the use-after-free check. |
| In Figure~\ref{fig:defer:RCU Pre-BSD Routing Table Add/Delete}, |
| lines~12, 14, 31, 36, and~41 add update-side locking, |
| lines~13 and~35 add RCU update-side protection, |
| line~37 causes \co{route_cb()} to be invoked after a grace period elapses, |
| and lines~18-25 define \co{route_cb()}. |
| This is minimal added code for a working concurrent implementation. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/perf-rcu}} |
| \caption{Pre-BSD Routing Table Protected by RCU} |
| \label{fig:defer:Pre-BSD Routing Table Protected by RCU} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Pre-BSD Routing Table Protected by RCU} |
| shows the performance on the read-only workload. |
| RCU scales quite well, and offers nearly ideal performance. |
| However, this data was generated using the \co{RCU_SIGNAL} |
| flavor of userspace |
| RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU}, |
| for which \co{rcu_read_lock()} and \co{rcu_read_unlock()} |
| generate a small amount of code. |
| What happens for the QSBR flavor of RCU, which generates no code at all |
| for \co{rcu_read_lock()} and \co{rcu_read_unlock()}? |
| (See Section~\ref{sec:defer:Introduction to RCU}, |
| and especially |
| Figure~\ref{fig:defer:RCU QSBR: Waiting for Pre-Existing Readers}, |
| for a discussion of RCU QSBR.) |
| |
| The answer to this shown in |
| Figure~\ref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR}, |
| which shows the RCU QSBR results as the trace between the RCU and |
| the ideal traces. |
| RCU QSBR's performance and scalability is very nearly that of an |
| ideal synchronization-free workload, as desired. |
| |
| \QuickQuiz{} |
| Why doesn't RCU QSBR give \emph{exactly} ideal results? |
| \QuickQuizAnswer{ |
| The \co{rcu_dereference()} primitive does constrain the |
| compiler's optimizations somewhat, which can result in |
| slightly slower code. |
| This effect would normally be insignificant, but |
| each search is taking on average about 13~nanoseconds, |
| which is short enough for small differences in code |
| generation to make their presence felt. |
| The difference ranges from about 1.5\% to about 11.1\%, which is |
| quite small when you consider that the RCU QSBR code can handle |
| concurrent updates and the ``ideal'' code cannot. |
| |
| It is hoped that C11 \co{memory_order_consume} |
| loads~\cite{RichardSmith2015N4527} |
| might someday allow \co{rcu_dereference()} provide the |
| needed protection at lower cost. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Given RCU QSBR's read-side performance, why bother with any |
| other flavor of userspace RCU? |
| \QuickQuizAnswer{ |
| Because RCU QSBR places constraints on the overall application |
| that might not be tolerable, |
| for example, requiring that each and every thread in the |
| application regularly pass through a quiescent state. |
| Among other things, this means that RCU QSBR is not helpful |
| to library writers, who might be better served by other |
| flavors of userspace RCU~\cite{PaulMcKenney2013LWNURCU}. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/perf-rcu-qsbr}} |
| \caption{Pre-BSD Routing Table Protected by RCU QSBR} |
| \label{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR} |
| \end{figure} |
| |
| \subsubsection{RCU is a Reader-Writer Lock Replacement} |
| \label{sec:defer:RCU is a Reader-Writer Lock Replacement} |
| |
| Perhaps the most common use of RCU within the Linux kernel is as |
| a replacement for reader-writer locking in read-intensive situations. |
| Nevertheless, this use of RCU was not immediately apparent to me |
| at the outset, in fact, I chose to implement a lightweight reader-writer |
| lock~\cite{WilsonCHsieh92a}\footnote{ |
| Similar to \co{brlock} in the 2.4 Linux kernel and to |
| \co{lglock} in more recent Linux kernels.} |
| before implementing a general-purpose RCU implementation |
| back in the early 1990s. |
| Each and every one of the uses I envisioned for the lightweight reader-writer |
| lock was instead implemented using RCU. |
| In fact, it was more than |
| three years before the lightweight reader-writer lock saw its first use. |
| Boy, did I feel foolish! |
| |
| The key similarity between RCU and reader-writer locking is that |
| both have read-side critical sections that can execute in parallel. |
| In fact, in some cases, it is possible to mechanically substitute RCU API |
| members for the corresponding reader-writer lock API members. |
| But first, why bother? |
| |
| Advantages of RCU include performance, |
| deadlock immunity, and realtime latency. |
| There are, of course, limitations to RCU, including the fact that |
| readers and updaters run concurrently, that low-priority RCU readers |
| can block high-priority threads waiting for a grace period to elapse, |
| and that grace-period latencies can extend for many milliseconds. |
| These advantages and limitations are discussed in the following sections. |
| |
| \paragraph{Performance} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{defer/rwlockRCUperf}} |
| \caption{Performance Advantage of RCU Over Reader-Writer Locking} |
| \label{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking} |
| \end{figure} |
| |
| The read-side performance advantages of RCU over reader-writer locking |
| are shown in |
| Figure~\ref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}. |
| |
| \QuickQuiz{} |
| WTF? |
| How the heck do you expect me to believe that RCU has a |
| 100-femtosecond overhead when the clock period at 3GHz is more than |
| 300 \emph{picoseconds}? |
| \QuickQuizAnswer{ |
| First, consider that the inner loop used to |
| take this measurement is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 for (i = 0; i < CSCOUNT_SCALE; i++) { |
| 2 rcu_read_lock(); |
| 3 rcu_read_unlock(); |
| 4 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Next, consider the effective definitions of \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 #define rcu_read_lock() do { } while (0) |
| 2 #define rcu_read_unlock() do { } while (0) |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Consider also that the compiler does simple optimizations, |
| allowing it to replace the loop with: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| i = CSCOUNT_SCALE; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| So the ``measurement'' of 100 femtoseconds is simply the fixed |
| overhead of the timing measurements divided by the number of |
| passes through the inner loop containing the calls |
| to \co{rcu_read_lock()} and \co{rcu_read_unlock()}. |
| And therefore, this measurement really is in error, in fact, |
| in error by an arbitrary number of orders of magnitude. |
| As you can see by the definition of \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} above, the actual overhead |
| is precisely zero. |
| |
| It certainly is not every day that a timing measurement of |
| 100 femtoseconds turns out to be an overestimate! |
| } \QuickQuizEnd |
| |
| Note that reader-writer locking is orders of magnitude slower than RCU |
| on a single CPU, and is almost two \emph{additional} orders of magnitude |
| slower on 16 CPUs. |
| In contrast, RCU scales quite well. |
| In both cases, the error bars span a single standard deviation in either |
| direction. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{defer/rwlockRCUperfPREEMPT}} |
| \caption{Performance Advantage of Preemptible RCU Over Reader-Writer Locking} |
| \label{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking} |
| \end{figure} |
| |
| A more moderate view may be obtained from a \co{CONFIG_PREEMPT} |
| kernel, though RCU still beats reader-writer locking by between one and |
| three orders of magnitude, as shown in |
| Figure~\ref{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking}. |
| Note the high variability of reader-writer locking at larger numbers of CPUs. |
| The error bars span a single standard deviation in either direction. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{defer/rwlockRCUperfwtPREEMPT}} |
| \caption{Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration} |
| \label{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration} |
| \end{figure} |
| |
| Of course, the low performance of reader-writer locking in |
| Figure~\ref{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking} |
| is exaggerated by the unrealistic zero-length critical sections. |
| The performance advantages of RCU become less significant as the overhead |
| of the critical section increases, as shown in |
| Figure~\ref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration} |
| for a 16-CPU system, |
| in which the y-axis represents the sum of the overhead |
| of the read-side primitives and that of the critical section. |
| |
| \QuickQuiz{} |
| Why does both the variability and overhead of rwlock decrease as the |
| critical-section overhead increases? |
| \QuickQuizAnswer{ |
| Because the contention on the underlying |
| \co{rwlock_t} decreases as the critical-section overhead |
| increases. |
| However, the rwlock overhead will not quite drop to that on a single |
| CPU because of cache-thrashing overhead. |
| } \QuickQuizEnd |
| |
| However, this observation must be tempered by the fact that |
| a number of system calls (and thus any RCU read-side critical sections |
| that they contain) can complete within a few microseconds. |
| |
| In addition, as is discussed in the next section, |
| RCU read-side primitives are almost entirely deadlock-immune. |
| |
| |
| \paragraph{Deadlock Immunity} |
| |
| Although RCU offers significant performance advantages for |
| read-mostly workloads, one of the primary reasons for creating |
| RCU in the first place was in fact its immunity to read-side |
| deadlocks. |
| This immunity stems from the fact that |
| RCU read-side primitives do not block, spin, or even |
| do backwards branches, so that their execution time is deterministic. |
| It is therefore impossible for them to participate in a deadlock |
| cycle. |
| |
| \QuickQuiz{} |
| Is there an exception to this deadlock immunity, and if so, |
| what sequence of events could lead to deadlock? |
| \QuickQuizAnswer{ |
| One way to cause a deadlock cycle involving |
| RCU read-side primitives is via the following (illegal) sequence |
| of statements: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| rcu_read_lock(); |
| synchronize_rcu(); |
| rcu_read_unlock(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The \co{synchronize_rcu()} cannot return until all |
| pre-existing RCU read-side critical sections complete, but |
| is enclosed in an RCU read-side critical section that cannot |
| complete until the \co{synchronize_rcu()} returns. |
| The result is a classic self-deadlock---you get the same |
| effect when attempting to write-acquire a reader-writer lock |
| while read-holding it. |
| |
| Note that this self-deadlock scenario does not apply to |
| RCU QSBR, because the context switch performed by the |
| \co{synchronize_rcu()} would act as a quiescent state |
| for this CPU, allowing a grace period to complete. |
| However, this is if anything even worse, because data used |
| by the RCU read-side critical section might be freed as a |
| result of the grace period completing. |
| |
| In short, do not invoke synchronous RCU update-side primitives |
| from within an RCU read-side critical section. |
| } \QuickQuizEnd |
| |
| An interesting consequence of RCU's read-side deadlock immunity is |
| that it is possible to unconditionally upgrade an RCU |
| reader to an RCU updater. |
| Attempting to do such an upgrade with reader-writer locking results |
| in deadlock. |
| A sample code fragment that does an RCU read-to-update upgrade follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 rcu_read_lock(); |
| 2 list_for_each_entry_rcu(p, &head, list_field) { |
| 3 do_something_with(p); |
| 4 if (need_update(p)) { |
| 5 spin_lock(my_lock); |
| 6 do_update(p); |
| 7 spin_unlock(&my_lock); |
| 8 } |
| 9 } |
| 10 rcu_read_unlock(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Note that \co{do_update()} is executed under |
| the protection of the lock \emph{and} under RCU read-side protection. |
| |
| Another interesting consequence of RCU's deadlock immunity is its |
| immunity to a large class of priority inversion problems. |
| For example, low-priority RCU readers cannot prevent a high-priority |
| RCU updater from acquiring the update-side lock. |
| Similarly, a low-priority RCU updater cannot prevent high-priority |
| RCU readers from entering an RCU read-side critical section. |
| |
| \QuickQuiz{} |
| Immunity to both deadlock and priority inversion??? |
| Sounds too good to be true. |
| Why should I believe that this is even possible? |
| \QuickQuizAnswer{ |
| It really does work. |
| After all, if it didn't work, the Linux kernel would not run. |
| } \QuickQuizEnd |
| |
| \paragraph{Realtime Latency} |
| |
| Because RCU read-side primitives neither spin nor block, they offer |
| excellent realtime latencies. |
| In addition, as noted earlier, this means that they are |
| immune to priority inversion |
| involving the RCU read-side primitives and locks. |
| |
| However, RCU is susceptible to more subtle priority-inversion scenarios, |
| for example, a high-priority process blocked waiting for an RCU |
| grace period to elapse can be blocked by low-priority RCU readers |
| in -rt kernels. |
| This can be solved by using RCU priority |
| boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}. |
| |
| \paragraph{RCU Readers and Updaters Run Concurrently} |
| |
| Because RCU readers never spin nor block, and because updaters are not |
| subject to any sort of rollback or abort semantics, RCU readers and |
| updaters must necessarily run concurrently. |
| This means that RCU readers might access stale data, and might even |
| see inconsistencies, either of which can render conversion from reader-writer |
| locking to RCU non-trivial. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/rwlockRCUupdate}} |
| \caption{Response Time of RCU vs. Reader-Writer Locking} |
| \label{fig:defer:Response Time of RCU vs. Reader-Writer Locking} |
| \end{figure} |
| |
| However, in a surprisingly large number of situations, inconsistencies and |
| stale data are not problems. |
| The classic example is the networking routing table. |
| Because routing updates can take considerable time to reach a given |
| system (seconds or even minutes), the system will have been sending |
| packets the wrong way for quite some time when the update arrives. |
| It is usually not a problem to continue sending updates the wrong |
| way for a few additional milliseconds. |
| Furthermore, because RCU updaters can make changes without waiting for |
| RCU readers to finish, |
| the RCU readers might well see the change more quickly than would |
| batch-fair |
| reader-writer-locking readers, as shown in |
| Figure~\ref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}. |
| |
| Once the update is received, the rwlock writer cannot proceed until the |
| last reader completes, and subsequent readers cannot proceed until the |
| writer completes. |
| However, these subsequent readers are guaranteed to see the new value, |
| as indicated by the green shading of the rightmost boxes. |
| In contrast, RCU readers and updaters do not block each other, which permits |
| the RCU readers to see the updated values sooner. |
| Of course, because their execution overlaps that of the RCU updater, |
| \emph{all} of the RCU readers might well see updated values, including |
| the three readers that started before the update. |
| Nevertheless only the green-shaded rightmost RCU readers |
| are \emph{guaranteed} to see the updated values. |
| |
| Reader-writer locking and RCU simply provide different guarantees. |
| With reader-writer locking, any reader that begins after the writer begins |
| is guaranteed to see new values, and any reader that attempts to |
| begin while the writer is spinning might or might not see new values, |
| depending on the reader/writer preference of the rwlock implementation in |
| question. |
| In contrast, with RCU, any reader that begins after the updater completes |
| is guaranteed to see new values, and any reader that completes after the |
| updater begins might or might not see new values, depending on timing. |
| |
| The key point here is that, although reader-writer locking does |
| indeed guarantee consistency within the confines of the computer system, |
| there are situations where this consistency comes at the price of |
| increased \emph{inconsistency} with the outside world. |
| In other words, reader-writer locking obtains internal consistency at the |
| price of silently stale data with respect to the outside world. |
| |
| Nevertheless, there are situations where inconsistency and stale |
| data within the confines of the system cannot be tolerated. |
| Fortunately, |
| there are a number of approaches that avoid inconsistency and stale |
| data~\cite{PaulEdwardMcKenneyPhD,Arcangeli03}, and some |
| methods based on reference counting are discussed in |
| Section~\ref{sec:defer:Reference Counting}. |
| |
| \paragraph{Low-Priority RCU Readers Can Block High-Priority Reclaimers} |
| |
| In Realtime RCU~\cite{DinakarGuniguntala2008IBMSysJ}, |
| SRCU~\cite{PaulEMcKenney2006c}, or |
| QRCU~\cite{PaulEMcKenney2007QRCUspin} (see |
| Section~\ref{sec:formal:Promela Example: QRCU}), |
| a preempted reader will prevent |
| a grace period from completing, even if a high-priority task is |
| blocked waiting for that grace period to complete. |
| Realtime RCU can avoid this problem by substituting \co{call_rcu()} |
| for \co{synchronize_rcu()} or by using RCU priority |
| boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}, |
| which is still in experimental status as of early 2008. |
| It might become necessary to augment SRCU and QRCU with priority boosting, |
| but not before a clear real-world need is demonstrated. |
| |
| \paragraph{RCU Grace Periods Extend for Many Milliseconds} |
| |
| With the exception of QRCU and several of the ``toy'' RCU implementations |
| described in |
| Section~\ref{sec:defer:``Toy'' RCU Implementations}, |
| RCU grace periods extend for multiple milliseconds. |
| Although there are a number of techniques to render such long delays |
| harmless, including use of the asynchronous interfaces where available |
| (\co{call_rcu()} and \co{call_rcu_bh()}), this situation |
| is a major reason for the rule of thumb that RCU be used in read-mostly |
| situations. |
| |
| \paragraph{Comparison of Reader-Writer Locking and RCU Code} |
| |
| In the best case, the conversion from reader-writer locking to RCU |
| is quite simple, as shown in |
| Figures~\ref{fig:defer:Converting Reader-Writer Locking to RCU: Data}, |
| \ref{fig:defer:Converting Reader-Writer Locking to RCU: Search}, |
| and |
| \ref{fig:defer:Converting Reader-Writer Locking to RCU: Deletion}, |
| all taken from |
| Wikipedia~\cite{WikipediaRCU}. |
| |
| \begin{figure*}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct el { 1 struct el { |
| 2 struct list_head lp; 2 struct list_head lp; |
| 3 long key; 3 long key; |
| 4 spinlock_t mutex; 4 spinlock_t mutex; |
| 5 int data; 5 int data; |
| 6 /* Other data fields */ 6 /* Other data fields */ |
| 7 }; 7 }; |
| 8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex); |
| 9 LIST_HEAD(head); 9 LIST_HEAD(head); |
| \end{verbbox} |
| } |
| \hspace*{0.9in}\OneColumnHSpace{-0.5in} |
| \theverbbox |
| \caption{Converting Reader-Writer Locking to RCU: Data} |
| \label{fig:defer:Converting Reader-Writer Locking to RCU: Data} |
| \end{figure*} |
| |
| \begin{figure*}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int search(long key, int *result) 1 int search(long key, int *result) |
| 2 { 2 { |
| 3 struct el *p; 3 struct el *p; |
| 4 4 |
| 5 read_lock(&listmutex); 5 rcu_read_lock(); |
| 6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) { |
| 7 if (p->key == key) { 7 if (p->key == key) { |
| 8 *result = p->data; 8 *result = p->data; |
| 9 read_unlock(&listmutex); 9 rcu_read_unlock(); |
| 10 return 1; 10 return 1; |
| 11 } 11 } |
| 12 } 12 } |
| 13 read_unlock(&listmutex); 13 rcu_read_unlock(); |
| 14 return 0; 14 return 0; |
| 15 } 15 } |
| \end{verbbox} |
| } |
| \hspace*{0.9in}\OneColumnHSpace{-0.5in} |
| \theverbbox |
| \caption{Converting Reader-Writer Locking to RCU: Search} |
| \label{fig:defer:Converting Reader-Writer Locking to RCU: Search} |
| \end{figure*} |
| |
| \begin{figure*}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int delete(long key) 1 int delete(long key) |
| 2 { 2 { |
| 3 struct el *p; 3 struct el *p; |
| 4 4 |
| 5 write_lock(&listmutex); 5 spin_lock(&listmutex); |
| 6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry(p, &head, lp) { |
| 7 if (p->key == key) { 7 if (p->key == key) { |
| 8 list_del(&p->lp); 8 list_del_rcu(&p->lp); |
| 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); |
| 10 synchronize_rcu(); |
| 10 kfree(p); 11 kfree(p); |
| 11 return 1; 12 return 1; |
| 12 } 13 } |
| 13 } 14 } |
| 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); |
| 15 return 0; 16 return 0; |
| 16 } 17 } |
| \end{verbbox} |
| } |
| \hspace*{0.9in}\OneColumnHSpace{-0.5in} |
| \theverbbox |
| \caption{Converting Reader-Writer Locking to RCU: Deletion} |
| \label{fig:defer:Converting Reader-Writer Locking to RCU: Deletion} |
| \end{figure*} |
| |
| More-elaborate cases of replacing reader-writer locking with RCU |
| are beyond the scope of this document. |
| |
| \subsubsection{RCU is a Restricted Reference-Counting Mechanism} |
| \label{sec:defer:RCU is a Restricted Reference-Counting Mechanism} |
| |
| Because grace periods are not allowed to complete while |
| there is an RCU read-side critical section in progress, |
| the RCU read-side primitives may be used as a restricted |
| reference-counting mechanism. |
| For example, consider the following code fragment: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 rcu_read_lock(); /* acquire reference. */ |
| 2 p = rcu_dereference(head); |
| 3 /* do something with p. */ |
| 4 rcu_read_unlock(); /* release reference. */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The \co{rcu_read_lock()} primitive can be thought of as |
| acquiring a reference to \co{p}, because a grace period |
| starting after the \co{rcu_dereference()} assigns to \co{p} |
| cannot possibly end until after we reach the matching |
| \co{rcu_read_unlock()}. |
| This reference-counting scheme is restricted in that |
| we are not allowed to block in RCU read-side critical sections, |
| nor are we permitted to hand off an RCU read-side critical section |
| from one task to another. |
| |
| Regardless of these restrictions, |
| the following code can safely delete \co{p}: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 spin_lock(&mylock); |
| 2 p = head; |
| 3 rcu_assign_pointer(head, NULL); |
| 4 spin_unlock(&mylock); |
| 5 /* Wait for all references to be released. */ |
| 6 synchronize_rcu(); |
| 7 kfree(p); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The assignment to \co{head} prevents any future references |
| to \co{p} from being acquired, and the \co{synchronize_rcu()} |
| waits for any previously acquired references to be released. |
| |
| \QuickQuiz{} |
| But wait! |
| This is exactly the same code that might be used when thinking |
| of RCU as a replacement for reader-writer locking! |
| What gives? |
| \QuickQuizAnswer{ |
| This is an effect of the Law of Toy Examples: |
| beyond a certain point, the code fragments look the same. |
| The only difference is in how we think about the code. |
| However, this difference can be extremely important. |
| For but one example of the importance, consider that if we think |
| of RCU as a restricted reference counting scheme, we would never |
| be fooled into thinking that the updates would exclude the RCU |
| read-side critical sections. |
| |
| It nevertheless is often useful to think of RCU as a replacement |
| for reader-writer locking, for example, when you are replacing |
| reader-writer locking with RCU. |
| } \QuickQuizEnd |
| |
| Of course, RCU can also be combined with traditional reference counting, |
| as discussed in |
| Section~\ref{sec:together:Refurbish Reference Counting}. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{defer/refRCUperfPREEMPT}} |
| \caption{Performance of RCU vs. Reference Counting} |
| \label{fig:defer:Performance of RCU vs. Reference Counting} |
| \end{figure} |
| |
| But why bother? |
| Again, part of the answer is performance, as shown in |
| Figure~\ref{fig:defer:Performance of RCU vs. Reference Counting}, |
| again showing data taken on a 16-CPU 3GHz Intel x86 system. |
| |
| \QuickQuiz{} |
| Why the dip in refcnt overhead near 6 CPUs? |
| \QuickQuizAnswer{ |
| Most likely NUMA effects. |
| However, there is substantial variance in the values measured for the |
| refcnt line, as can be seen by the error bars. |
| In fact, standard deviations range in excess of 10\% of measured |
| values in some cases. |
| The dip in overhead therefore might well be a statistical aberration. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{defer/refRCUperfwtPREEMPT}} |
| \caption{Response Time of RCU vs. Reference Counting} |
| \label{fig:defer:Response Time of RCU vs. Reference Counting} |
| \end{figure} |
| |
| And, as with reader-writer locking, the performance advantages |
| of RCU are most pronounced for short-duration critical sections, as shown |
| Figure~\ref{fig:defer:Response Time of RCU vs. Reference Counting} |
| for a 16-CPU system. |
| In addition, as with reader-writer locking, many system calls (and thus |
| any RCU read-side critical sections that they contain) complete in |
| a few microseconds. |
| |
| However, the restrictions that go with RCU can be quite onerous. |
| For example, in many cases, the prohibition against sleeping while in an RCU |
| read-side critical section would defeat the entire purpose. |
| The next section looks at ways of addressing this problem, while also |
| reducing the complexity of traditional reference counting, at least in |
| some cases. |
| |
| \subsubsection{RCU is a Bulk Reference-Counting Mechanism} |
| \label{sec:defer:RCU is a Bulk Reference-Counting Mechanism} |
| |
| As noted in the preceding section, |
| traditional reference counters are usually associated with a specific |
| data structure, or perhaps a specific group of data structures. |
| However, maintaining a single global reference counter for a large |
| variety of data structures typically results in bouncing |
| the cache line containing the reference count. |
| Such cache-line bouncing can severely degrade performance. |
| |
| In contrast, RCU's light-weight read-side primitives permit |
| extremely frequent read-side usage with negligible performance |
| degradation, permitting RCU to be used as a ``bulk reference-counting'' |
| mechanism with little or no performance penalty. |
| Situations where a reference must be held by a single task across a |
| section of code that blocks may be accommodated with |
| Sleepable RCU (SRCU)~\cite{PaulEMcKenney2006c}. |
| This fails to cover the not-uncommon situation where a reference is ``passed'' |
| from one task to another, for example, when a reference is acquired |
| when starting an I/O and released in the corresponding completion |
| interrupt handler. |
| (In principle, this could be handled by the SRCU implementation, |
| but in practice, it is not yet clear whether this is a good tradeoff.) |
| |
| Of course, SRCU brings restrictions of its own, namely that the |
| return value from \co{srcu_read_lock()} be passed into the |
| corresponding \co{srcu_read_unlock()}, and that no SRCU primitives |
| be invoked from hardware interrupt handlers or from non-maskable interrupt |
| (NMI) handlers. |
| The jury is still out as to how much of a problem is presented by |
| these restrictions, and as to how they can best be handled. |
| |
| \subsubsection{RCU is a Poor Man's Garbage Collector} |
| \label{sec:defer:RCU is a Poor Man's Garbage Collector} |
| |
| A not-uncommon exclamation made by people first learning about |
| RCU is ``RCU is sort of like a garbage collector!'' |
| This exclamation has a large grain of truth, but it can also be |
| misleading. |
| |
| Perhaps the best way to think of the relationship between RCU |
| and automatic garbage collectors (GCs) is that RCU resembles |
| a GC in that the \emph{timing} of collection is automatically |
| determined, but that RCU differs from a GC in that: (1) the programmer |
| must manually indicate when a given data structure is eligible |
| to be collected, and (2) the programmer must manually mark the |
| RCU read-side critical sections where references might legitimately |
| be held. |
| |
| Despite these differences, the resemblance does go quite deep, |
| and has appeared in at least one theoretical analysis of RCU. |
| Furthermore, the first RCU-like mechanism I am aware of used |
| a garbage collector to handle the grace periods. |
| Nevertheless, a better way of thinking of RCU is described in the |
| following section. |
| |
| \subsubsection{RCU is a Way of Providing Existence Guarantees} |
| \label{sec:defer:RCU is a Way of Providing Existence Guarantees} |
| |
| Gamsa et al.~\cite{Gamsa99} |
| discuss existence guarantees and describe how a mechanism |
| resembling RCU can be used to provide these existence guarantees |
| (see section~5 on page 7 of the PDF), and |
| Section~\ref{sec:locking:Lock-Based Existence Guarantees} |
| discusses how to guarantee existence via locking, along with the |
| ensuing disadvantages of doing so. |
| The effect is that if any RCU-protected data element is accessed |
| within an RCU read-side critical section, that data element is |
| guaranteed to remain in existence for the duration of that RCU |
| read-side critical section. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int delete(int key) |
| 2 { |
| 3 struct element *p; |
| 4 int b; |
| 5 |
| 6 b = hashfunction(key); |
| 7 rcu_read_lock(); |
| 8 p = rcu_dereference(hashtable[b]); |
| 9 if (p == NULL || p->key != key) { |
| 10 rcu_read_unlock(); |
| 11 return 0; |
| 12 } |
| 13 spin_lock(&p->lock); |
| 14 if (hashtable[b] == p && p->key == key) { |
| 15 rcu_read_unlock(); |
| 16 rcu_assign_pointer(hashtable[b], NULL); |
| 17 spin_unlock(&p->lock); |
| 18 synchronize_rcu(); |
| 19 kfree(p); |
| 20 return 1; |
| 21 } |
| 22 spin_unlock(&p->lock); |
| 23 rcu_read_unlock(); |
| 24 return 0; |
| 25 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Existence Guarantees Enable Per-Element Locking} |
| \label{fig:defer:Existence Guarantees Enable Per-Element Locking} |
| \end{figure} |
| |
| Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking} |
| demonstrates how RCU-based existence guarantees can enable |
| per-element locking via a function that deletes an element from |
| a hash table. |
| Line~6 computes a hash function, and line~7 enters an RCU |
| read-side critical section. |
| If line~9 finds that the corresponding bucket of the hash table is |
| empty or that the element present is not the one we wish to delete, |
| then line~10 exits the RCU read-side critical section and line~11 |
| indicates failure. |
| |
| \QuickQuiz{} |
| What if the element we need to delete is not the first element |
| of the list on line~9 of |
| Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}? |
| \QuickQuizAnswer{ |
| As with |
| Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}, |
| this is a very simple hash table with no chaining, so the only |
| element in a given bucket is the first element. |
| The reader is again invited to adapt this example to a hash table with |
| full chaining. |
| } \QuickQuizEnd |
| |
| Otherwise, line~13 acquires the update-side spinlock, and |
| line~14 then checks that the element is still the one that we want. |
| If so, line~15 leaves the RCU read-side critical section, |
| line~16 removes it from the table, line~17 releases |
| the lock, line~18 waits for all pre-existing RCU read-side critical |
| sections to complete, line~19 frees the newly removed element, |
| and line~20 indicates success. |
| If the element is no longer the one we want, line~22 releases |
| the lock, line~23 leaves the RCU read-side critical section, |
| and line~24 indicates failure to delete the specified key. |
| |
| \QuickQuiz{} |
| Why is it OK to exit the RCU read-side critical section on |
| line~15 of |
| Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking} |
| before releasing the lock on line~17? |
| \QuickQuizAnswer{ |
| First, please note that the second check on line~14 is |
| necessary because some other |
| CPU might have removed this element while we were waiting |
| to acquire the lock. |
| However, the fact that we were in an RCU read-side critical section |
| while acquiring the lock guarantees that this element could not |
| possibly have been re-allocated and re-inserted into this |
| hash table. |
| Furthermore, once we acquire the lock, the lock itself guarantees |
| the element's existence, so we no longer need to be in an |
| RCU read-side critical section. |
| |
| The question as to whether it is necessary to re-check the |
| element's key is left as an exercise to the reader. |
| % A re-check is necessary if the key can mutate or if it is |
| % necessary to reject deleted entries (in cases where deletion |
| % is recorded by mutating the key. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why not exit the RCU read-side critical section on |
| line~23 of |
| Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking} |
| before releasing the lock on line~22? |
| \QuickQuizAnswer{ |
| Suppose we reverse the order of these two lines. |
| Then this code is vulnerable to the following sequence of |
| events: |
| \begin{enumerate} |
| \item CPU~0 invokes \co{delete()}, and finds the element |
| to be deleted, executing through line~15. |
| It has not yet actually deleted the element, but |
| is about to do so. |
| \item CPU~1 concurrently invokes \co{delete()}, attempting |
| to delete this same element. |
| However, CPU~0 still holds the lock, so CPU~1 waits |
| for it at line~13. |
| \item CPU~0 executes lines~16 and 17, and blocks at |
| line~18 waiting for CPU~1 to exit its RCU read-side |
| critical section. |
| \item CPU~1 now acquires the lock, but the test on line~14 |
| fails because CPU~0 has already removed the element. |
| CPU~1 now executes line~22 (which we switched with line~23 |
| for the purposes of this Quick Quiz) |
| and exits its RCU read-side critical section. |
| \item CPU~0 can now return from \co{synchronize_rcu()}, |
| and thus executes line~19, sending the element to |
| the freelist. |
| \item CPU~1 now attempts to release a lock for an element |
| that has been freed, and, worse yet, possibly |
| reallocated as some other type of data structure. |
| This is a fatal memory-corruption error. |
| \end{enumerate} |
| } \QuickQuizEnd |
| |
| Alert readers will recognize this as only a slight variation on |
| the original ``RCU is a way of waiting for things to finish'' theme, |
| which is addressed in |
| Section~\ref{sec:defer:RCU is a Way of Waiting for Things to Finish}. |
| They might also note the deadlock-immunity advantages over the lock-based |
| existence guarantees discussed in |
| Section~\ref{sec:locking:Lock-Based Existence Guarantees}. |
| |
| \subsubsection{RCU is a Way of Providing Type-Safe Memory} |
| \label{sec:defer:RCU is a Way of Providing Type-Safe Memory} |
| |
| A number of lockless algorithms do not require that a given data |
| element keep the same identity through a given RCU read-side critical |
| section referencing it---but only if that data element retains the |
| same type. |
| In other words, these lockless algorithms can tolerate a given data |
| element being freed and reallocated as the same type of structure |
| while they are referencing it, but must prohibit a change in type. |
| This guarantee, called ``type-safe memory'' in |
| academic literature~\cite{Cheriton96a}, |
| is weaker than the existence guarantees in the |
| previous section, and is therefore quite a bit harder to work with. |
| Type-safe memory algorithms in the Linux kernel make use of slab caches, |
| specially marking these caches with \co{SLAB_DESTROY_BY_RCU} |
| so that RCU is used when returning a freed-up |
| slab to system memory. |
| This use of RCU guarantees that any in-use element of |
| such a slab will remain in that slab, thus retaining its type, |
| for the duration of any pre-existing RCU read-side critical sections. |
| |
| \QuickQuiz{} |
| But what if there is an arbitrarily long series of RCU |
| read-side critical sections in multiple threads, so that at |
| any point in time there is at least one thread in the system |
| executing in an RCU read-side critical section? |
| Wouldn't that prevent any data from a \co{SLAB_DESTROY_BY_RCU} |
| slab ever being returned to the system, possibly resulting |
| in OOM events? |
| \QuickQuizAnswer{ |
| There could certainly be an arbitrarily long period of time |
| during which at least one thread is always in an RCU read-side |
| critical section. |
| However, the key words in the description in |
| Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory} |
| are ``in-use'' and ``pre-existing''. |
| Keep in mind that a given RCU read-side critical section is |
| conceptually only permitted to gain references to data elements |
| that were in use at the beginning of that critical section. |
| Furthermore, remember that a slab cannot be returned to the |
| system until all of its data elements have been freed, in fact, |
| the RCU grace period cannot start until after they have all been |
| freed. |
| |
| Therefore, the slab cache need only wait for those RCU read-side |
| critical sections that started before the freeing of the last element |
| of the slab. |
| This in turn means that any RCU grace period that begins after |
| the freeing of the last element will do---the slab may be returned |
| to the system after that grace period ends. |
| } \QuickQuizEnd |
| |
| These algorithms typically use a validation step that checks to make |
| sure that the newly referenced data structure really is the one that |
| was requested~\cite[Section 2.5]{LaninShasha1986TSM}. |
| These validation checks require that portions of the data structure |
| remain untouched by the free-reallocate process. |
| Such validation checks are usually very hard to get right, and can |
| hide subtle and difficult bugs. |
| |
| Therefore, although type-safety-based lockless algorithms can be extremely |
| helpful in a very few difficult situations, you should instead use existence |
| guarantees where possible. |
| Simpler is after all almost always better! |
| |
| \subsubsection{RCU is a Way of Waiting for Things to Finish} |
| \label{sec:defer:RCU is a Way of Waiting for Things to Finish} |
| |
| As noted in Section~\ref{sec:defer:RCU Fundamentals} |
| an important component |
| of RCU is a way of waiting for RCU readers to finish. |
| One of |
| RCU's great strengths is that it allows you to wait for each of |
| thousands of different things to finish without having to explicitly |
| track each and every one of them, and without having to worry about |
| the performance degradation, scalability limitations, complex deadlock |
| scenarios, and memory-leak hazards that are inherent in schemes that |
| use explicit tracking. |
| |
| In this section, we will show how \co{synchronize_sched()}'s |
| read-side counterparts (which include anything that disables preemption, |
| along with hardware operations and |
| primitives that disable interrupts) permit you to implement interactions with |
| non-maskable interrupt |
| (NMI) handlers that would be quite difficult if using locking. |
| This approach has been called ``Pure RCU''~\cite{PaulEdwardMcKenneyPhD}, |
| and it is used in a number of places in the Linux kernel. |
| |
| The basic form of such ``Pure RCU'' designs is as follows: |
| |
| \begin{enumerate} |
| \item Make a change, for example, to the way that the OS reacts to an NMI. |
| \item Wait for all pre-existing read-side critical sections to |
| completely finish (for example, by using the |
| \co{synchronize_sched()} primitive). |
| The key observation here is that subsequent RCU read-side critical |
| sections are guaranteed to see whatever change was made. |
| \item Clean up, for example, return status indicating that the |
| change was successfully made. |
| \end{enumerate} |
| |
| The remainder of this section presents example code adapted from |
| the Linux kernel. |
| In this example, the \co{timer_stop} function uses |
| \co{synchronize_sched()} to ensure that all in-flight NMI |
| notifications have completed before freeing the associated resources. |
| A simplified version of this code is shown |
| Figure~\ref{fig:defer:Using RCU to Wait for NMIs to Finish}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct profile_buffer { |
| 2 long size; |
| 3 atomic_t entry[0]; |
| 4 }; |
| 5 static struct profile_buffer *buf = NULL; |
| 6 |
| 7 void nmi_profile(unsigned long pcvalue) |
| 8 { |
| 9 struct profile_buffer *p = rcu_dereference(buf); |
| 10 |
| 11 if (p == NULL) |
| 12 return; |
| 13 if (pcvalue >= p->size) |
| 14 return; |
| 15 atomic_inc(&p->entry[pcvalue]); |
| 16 } |
| 17 |
| 18 void nmi_stop(void) |
| 19 { |
| 20 struct profile_buffer *p = buf; |
| 21 |
| 22 if (p == NULL) |
| 23 return; |
| 24 rcu_assign_pointer(buf, NULL); |
| 25 synchronize_sched(); |
| 26 kfree(p); |
| 27 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Using RCU to Wait for NMIs to Finish} |
| \label{fig:defer:Using RCU to Wait for NMIs to Finish} |
| \end{figure} |
| |
| Lines~1-4 define a \co{profile_buffer} structure, containing a |
| size and an indefinite array of entries. |
| Line~5 defines a pointer to a profile buffer, which is |
| presumably initialized elsewhere to point to a dynamically allocated |
| region of memory. |
| |
| Lines~7-16 define the \co{nmi_profile()} function, |
| which is called from within an NMI handler. |
| As such, it cannot be preempted, nor can it be interrupted by a normal |
| interrupts handler, however, it is still subject to delays due to cache misses, |
| ECC errors, and cycle stealing by other hardware threads within the same |
| core. |
| Line~9 gets a local pointer to the profile buffer using the |
| \co{rcu_dereference()} primitive to ensure memory ordering on |
| DEC Alpha, and |
| lines~11 and 12 exit from this function if there is no |
| profile buffer currently allocated, while lines~13 and 14 |
| exit from this function if the \co{pcvalue} argument |
| is out of range. |
| Otherwise, line~15 increments the profile-buffer entry indexed |
| by the \co{pcvalue} argument. |
| Note that storing the size with the buffer guarantees that the |
| range check matches the buffer, even if a large buffer is suddenly |
| replaced by a smaller one. |
| |
| Lines~18-27 define the \co{nmi_stop()} function, |
| where the caller is responsible for mutual exclusion (for example, |
| holding the correct lock). |
| Line~20 fetches a pointer to the profile buffer, and |
| lines~22 and 23 exit the function if there is no buffer. |
| Otherwise, line~24 \co{NULL}s out the profile-buffer pointer |
| (using the \co{rcu_assign_pointer()} primitive to maintain |
| memory ordering on weakly ordered machines), |
| and line~25 waits for an RCU Sched grace period to elapse, |
| in particular, waiting for all non-preemptible regions of code, |
| including NMI handlers, to complete. |
| Once execution continues at line~26, we are guaranteed that |
| any instance of \co{nmi_profile()} that obtained a |
| pointer to the old buffer has returned. |
| It is therefore safe to free the buffer, in this case using the |
| \co{kfree()} primitive. |
| |
| \QuickQuiz{} |
| Suppose that the \co{nmi_profile()} function was preemptible. |
| What would need to change to make this example work correctly? |
| \QuickQuizAnswer{ |
| One approach would be to use |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} |
| in \co{nmi_profile()}, and to replace the |
| \co{synchronize_sched()} with \co{synchronize_rcu()}, |
| perhaps as shown in |
| Figure~\ref{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}. |
| % |
| \begin{figure}[tb] |
| {\scriptsize |
| \begin{verbbox} |
| 1 struct profile_buffer { |
| 2 long size; |
| 3 atomic_t entry[0]; |
| 4 }; |
| 5 static struct profile_buffer *buf = NULL; |
| 6 |
| 7 void nmi_profile(unsigned long pcvalue) |
| 8 { |
| 9 struct profile_buffer *p; |
| 10 |
| 11 rcu_read_lock(); |
| 12 p = rcu_dereference(buf); |
| 13 if (p == NULL) { |
| 14 rcu_read_unlock(); |
| 15 return; |
| 16 } |
| 17 if (pcvalue >= p->size) { |
| 18 rcu_read_unlock(); |
| 19 return; |
| 20 } |
| 21 atomic_inc(&p->entry[pcvalue]); |
| 22 rcu_read_unlock(); |
| 23 } |
| 24 |
| 25 void nmi_stop(void) |
| 26 { |
| 27 struct profile_buffer *p = buf; |
| 28 |
| 29 if (p == NULL) |
| 30 return; |
| 31 rcu_assign_pointer(buf, NULL); |
| 32 synchronize_rcu(); |
| 33 kfree(p); |
| 34 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish} |
| \label{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish} |
| \end{figure} |
| % |
| } \QuickQuizEnd |
| |
| In short, RCU makes it easy to dynamically switch among profile |
| buffers (you just \emph{try} doing this efficiently with atomic |
| operations, or at all with locking!). |
| However, RCU is normally used at a higher level of abstraction, as |
| was shown in the previous sections. |
| |
| \subsubsection{RCU Usage Summary} |
| \label{sec:defer:RCU Usage Summary} |
| |
| At its core, RCU is nothing more nor less than an API that provides: |
| |
| \begin{enumerate} |
| \item a publish-subscribe mechanism for adding new data, |
| \item a way of waiting for pre-existing RCU readers to finish, and |
| \item a discipline of maintaining multiple versions to permit change |
| without harming or unduly delaying concurrent RCU readers. |
| \end{enumerate} |
| |
| That said, it is possible to build higher-level constructs |
| on top of RCU, including the reader-writer-locking, reference-counting, |
| and existence-guarantee constructs listed in the earlier sections. |
| Furthermore, I have no doubt that the Linux community will continue to |
| find interesting new uses for RCU, |
| as well as for any of a number of other synchronization primitives. |
| |
| \begin{figure}[tbh] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/RCUApplicability}} |
| \caption{RCU Areas of Applicability} |
| \label{fig:defer:RCU Areas of Applicability} |
| \end{figure} |
| |
| In the meantime, |
| Figure~\ref{fig:defer:RCU Areas of Applicability} |
| shows some rough rules of thumb on where RCU is most helpful. |
| |
| As shown in the blue box at the top of the figure, RCU works best if |
| you have read-mostly data where stale and inconsistent |
| data is permissible (but see below for more information on stale and |
| inconsistent data). |
| The canonical example of this case in the Linux kernel is routing tables. |
| Because it may have taken many seconds or even minutes for the |
| routing updates to propagate across Internet, the system |
| has been sending packets the wrong way for quite some time. |
| Having some small probability of continuing to send some of them the wrong |
| way for a few more milliseconds is almost never a problem. |
| |
| If you have a read-mostly workload where consistent data is required, |
| RCU works well, as shown by the green ``read-mostly, need consistent data'' |
| box. |
| One example of this case is the Linux kernel's mapping from user-level |
| System-V semaphore IDs to the corresponding in-kernel data structures. |
| Semaphores tend to be used far more frequently than they are created |
| and destroyed, so this mapping is read-mostly. |
| However, it would be erroneous to perform a semaphore operation on |
| a semaphore that has already been deleted. |
| This need for consistency is handled by using the lock in the |
| in-kernel semaphore data structure, along with a ``deleted'' |
| flag that is set when deleting a semaphore. |
| If a user ID maps to an in-kernel data structure with the |
| ``deleted'' flag set, the data structure is ignored, so that |
| the user ID is flagged as invalid. |
| |
| Although this requires that the readers acquire a lock for the |
| data structure representing the semaphore itself, |
| it allows them to dispense with locking for the |
| mapping data structure. |
| The readers therefore locklessly |
| traverse the tree used to map from ID to data structure, |
| which in turn greatly improves performance, scalability, and |
| real-time response. |
| |
| As indicated by the yellow ``read-write'' box, RCU can also be useful |
| for read-write |
| workloads where consistent data is required, although usually in |
| conjunction with a number of other synchronization primitives. |
| For example, the directory-entry cache in recent Linux kernels uses RCU in |
| conjunction with sequence locks, per-CPU locks, and per-data-structure |
| locks to allow lockless traversal of pathnames in the common case. |
| Although RCU can be very beneficial in this read-write case, such |
| use is often more complex than that of the read-mostly cases. |
| |
| Finally, as indicated by the red box at the bottom of the figure, |
| update-mostly workloads requiring |
| consistent data are rarely good places to use RCU, though there are some |
| exceptions~\cite{MathieuDesnoyers2012URCU}. |
| In addition, as noted in |
| Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory}, |
| within the Linux kernel, the \co{SLAB_DESTROY_BY_RCU} |
| slab-allocator flag provides type-safe memory to RCU readers, which can |
| greatly simplify non-blocking synchronization and other lockless |
| algorithms. |
| |
| In short, RCU is an API that includes a publish-subscribe mechanism for |
| adding new data, a way of waiting for pre-existing RCU readers to finish, |
| and a discipline of maintaining multiple versions to allow updates to |
| avoid harming or unduly delaying concurrent RCU readers. |
| This RCU API is best suited for read-mostly situations, especially if |
| stale and inconsistent data can be tolerated by the application. |
| |
| % <h3>Acknowledgements</h3> @@@ |
| % |
| % We are all indebted to Andy Whitcroft, Jon Walpole, and Gautham Shenoy, |
| % whose review of an early draft of this document greatly improved it. |
| % I owe thanks to the members of the Relativistic Programming project |
| % and to members of PNW TEC for many valuable discussions. |
| % I am grateful to Dan Frye for his support of this effort. |