| % defer/rcuintro.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \subsection{Introduction to RCU} |
| \label{sec:defer:Introduction to RCU} |
| |
| The approaches discussed in the preceding sections have provided |
| good scalability but decidedly non-ideal performance for the |
| Pre-BSD routing table. |
| Therefore, in the spirit of ``only those who have gone too far |
| know how far you can go'',\footnote{ |
| With apologies to T.~S.~Eliot.} |
| we will go all the way, looking into algorithms in which concurrent |
| readers might well execute exactly the same sequence of assembly language |
| instructions as would a single-threaded lookup, despite the presence of |
| concurrent updates. |
| Of course, this laudable goal might raise serious implementability |
| questions, but we cannot possibly succeed if we don't even try! |
| |
| And should we succeed, we will have uncovered yet another of the |
| mysteries set forth on |
| \cpageref{sec:defer:Mysteries RCU}. |
| |
| \subsubsection{Minimal Insertion and Deletion} |
| \label{sec:defer:Minimal Insertion and Deletion} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/RCUListInsertClassic}} |
| \caption{Insertion With Concurrent Readers} |
| \label{fig:defer:Insertion With Concurrent Readers} |
| \end{figure} |
| |
| To minimize implementability concerns, we focus on a minimal |
| data structure, which consists of a single global pointer that is either |
| \co{NULL} or references a single structure. |
| Minimal though it might be, this data structure is heavily used in |
| production~\cite{GeoffRomer2018C++DeferredReclamationP0561R4}. |
| A classic approach for insertion is shown in |
| \cref{fig:defer:Insertion With Concurrent Readers}, |
| which shows four states with time advancing from top to bottom. |
| The first row shows the initial state, with \co{gptr} equal to \co{NULL}. |
| In the second row, we have allocated a structure which is uninitialized, |
| as indicated by the question marks. |
| In the third row, we have initialized the structure. |
| Finally, in the fourth and final row, we have updated \co{gptr} to |
| reference the newly allocated and initialized element. |
| |
| We might hope that this assignment to \co{gptr} could use a simple |
| C-language assignment statement. |
| Unfortunately, |
| \cref{sec:toolsoftrade:Shared-Variable Shenanigans} |
| dashes these hopes. |
| Therefore, the updater cannot use a simple C-language assignment, but |
| must instead use \co{smp_store_release()} as shown in the figure, |
| or, as will be seen, \co{rcu_assign_pointer()}. |
| |
| Similarly, one might hope that readers could use a single C-language |
| assignment to fetch the value of \co{gptr}, and be guaranteed to either |
| get the old value of \co{NULL} or to get the newly installed pointer, |
| but either way see a valid result. |
| Unfortunately, \cref{sec:toolsoftrade:Shared-Variable Shenanigans} |
| dashes these hopes as well. |
| To obtain this guarantee, readers must instead use \co{READ_ONCE()}, |
| or, as will be seen, \co{rcu_dereference()}. |
| However, on most modern computer systems, each of these read-side primitives |
| can be implemented with a single load instruction, exactly the instruction |
| that would normally be used in single-threaded code. |
| |
| Reviewing \cref{fig:defer:Insertion With Concurrent Readers} |
| from the viewpoint of readers, in the first three states all readers |
| see \co{gptr} having the value \co{NULL}. |
| Upon entering the fourth state, some readers might see \co{gptr} still |
| having the value \co{NULL} while others might see it referencing the |
| newly inserted element, but after some time, all readers will see this |
| new element. |
| At all times, all readers will see \co{gptr} as containing a valid pointer. |
| Therefore, it really is possible to add new data to linked data structures |
| while allowing concurrent readers to execute the same sequence of machine |
| instructions that is normally used in single-threaded code. |
| This no-cost approach to concurrent reading provides excellent performance |
| and scalability, and also is eminently suitable for real-time use. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/RCUListDeleteClassic}} |
| \caption{Deletion With Concurrent Readers} |
| \label{fig:defer:Deletion With Concurrent Readers} |
| \end{figure} |
| |
| Insertion is of course quite useful, but sooner or later, it will also |
| be necessary to delete data. |
| As can be seen in |
| \cref{fig:defer:Deletion With Concurrent Readers}, |
| the first step is easy. |
| Again taking the lessons from |
| \cref{sec:toolsoftrade:Shared-Variable Shenanigans} |
| to heart, \co{smp_store_release()} is used to \co{NULL} the pointer, |
| thus moving from the first row to the second in the figure. |
| At this point, pre-existing readers see the old structure with |
| \co{->addr} of~42 and \co{->iface} of~1, but new readers will see |
| a \co{NULL} pointer, that is, concurrent readers can disagree on |
| the state, as indicated by the ``2~Versions'' in the figure. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Why does |
| \cref{fig:defer:Deletion With Concurrent Readers} |
| use \co{smp_store_release()} given that it is storing |
| a \co{NULL} pointer? |
| Wouldn't \co{WRITE_ONCE()} work just as well in this case, |
| given that there is no structure initialization to order |
| against the store of the \co{NULL} pointer? |
| }\QuickQuizAnswerB{ |
| Yes, it would. |
| |
| Because a \co{NULL} pointer is being assigned, there is nothing |
| to order against, so there is no need for \co{smp_store_release()}. |
| In contrast, when assigning a non-\co{NULL} pointer, it is |
| necessary to use \co{smp_store_release()} in order to ensure |
| that initialization of the pointed-to structure is carried |
| out before assignment of the pointer. |
| |
| In short, \co{WRITE_ONCE()} would work, and would |
| save a little bit of CPU time on some architectures. |
| However, as we will see, software-engineering concerns |
| will motivate use of a special \co{rcu_assign_pointer()} |
| that is quite similar to \co{smp_store_release()}. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Readers running concurrently with each other and with the procedure |
| outlined in |
| \cref{fig:defer:Deletion With Concurrent Readers} |
| can disagree on the value of \co{gptr}. |
| Isn't that just a wee bit problematic??? |
| }\QuickQuizAnswerE{ |
| Not necessarily. |
| |
| As hinted at in |
| \cref{sec:cpu:Hardware Optimizations,sec:cpu:Hardware Free Lunch?}, |
| speed-of-light delays mean that a computer's data is always |
| stale compared to whatever external reality that data is intended |
| to model. |
| |
| Real-world algorithms therefore absolutely must tolerate |
| inconsistancies between external reality and the in-computer |
| data reflecting that reality. |
| Many of those algorithms are also able to tolerate some degree |
| of inconsistency within the in-computer data. |
| \Cref{sec:datastruct:RCU-Protected Hash Table Discussion} |
| discusses this point in more detail. |
| |
| Please note that this need to tolerate inconsistent and stale |
| data is not limited to RCU\@. |
| It also applies to reference counting, hazard pointers, sequence |
| locks, and even to some locking use cases. |
| For example, if you compute some quantity while holding a lock, |
| but use that quantity after releasing that lock, |
| you might well be using stale data. |
| After all, the data that quantity is based on might change |
| arbitrarily as soon as the lock is released. |
| |
| So yes, RCU readers can see stale and inconsistent data, but no, |
| this is not necessarily problematic. |
| And, when needed, there are RCU usage patterns that avoid both |
| staleness and inconsistency~\cite{Arcangeli03}. |
| }\QuickQuizEndE |
| } |
| |
| We get back to a single version simply by waiting for all the |
| pre-existing readers to complete, as shown in row~3. |
| At that point, all the pre-existing readers are done, and no later |
| reader has a path to the old data item, so there can no longer be |
| any readers referencing it. |
| It may therefore be safely freed, as shown on row~4. |
| |
| Thus, given a way to wait for pre-existing readers to complete, |
| it is possible to both add data to and remove data from a linked |
| data structure, despite the readers executing the same sequence |
| of machine instructions that would be appropriate for single-threaded |
| execution. |
| So perhaps going all the way was not too far after all! |
| |
| But how can we tell when all of the pre-existing readers have in |
| fact completed? |
| This question is the topic of \cref{sec:defer:Waiting for Readers}. |
| But first, the next section defines RCU's core API. |
| |
| \subsubsection{Core RCU API} |
| \label{sec:defer:Core RCU API} |
| |
| The full Linux-kernel API is quite extensive, with more than one |
| hundred API members. |
| However, this section will confine itself to six core RCU API members, |
| which suffices for the upcoming sections introducing RCU and covering |
| its fundamentals. |
| The full API is covered in \cref{sec:defer:RCU Linux-Kernel API}. |
| |
| Three members of the core APIs are used by readers. |
| The \apik{rcu_read_lock()} and \apik{rcu_read_unlock()} functions delimit |
| \IXhmrpl{RCU read-side}{critical section}. |
| These may be nested, so that one \co{rcu_read_lock()}--\co{rcu_read_unlock()} |
| pair can be enclosed within another. |
| In this case, the nested set of RCU read-side critical sections act as |
| one large critical section covering the full extent of the nested set. |
| The third read-side API member, \apik{rcu_dereference()}, fetches an |
| \IXr{RCU-protected pointer}. |
| Conceptually, \co{rcu_dereference()} simply loads from memory, but we |
| will see in |
| \cref{sec:defer:Publish-Subscribe Mechanism} |
| that \co{rcu_dereference()} must prevent the compiler and (in |
| one case) the CPU from reordering its load with later memory operations |
| that dereference this pointer. |
| |
| \QuickQuiz{ |
| What is an RCU-protected pointer? |
| }\QuickQuizAnswer{ |
| A pointer to \IXr{RCU-protected data}. |
| RCU-protected data is in turn a block of dynamically allocated |
| memory whose freeing will be deferred such that an RCU grace |
| period will elapse between the time that there were no longer |
| any RCU-reader-accessible pointers to that block and the time |
| that that block is freed. |
| This ensures that no RCU readers will have access to that block at |
| the time that it is freed. |
| |
| RCU-protected pointers must be handled carefully. |
| For example, any reader that intends to dereference an |
| RCU-protected pointer must use \co{rcu_dereference()} (or |
| stronger) to load that pointer. |
| In addition, any updater must use \co{rcu_assign_pointer()} |
| (or stronger) to store to that pointer. |
| }\QuickQuizEnd |
| |
| The other three members of the core APIs are used by updaters. |
| The \apik{synchronize_rcu()} function implements the ``wait for readers'' |
| operation from \cref{fig:defer:Deletion With Concurrent Readers}. |
| The \apik{call_rcu()} function is the asynchronous counterpart of |
| \co{synchronize_rcu()} by invoking the specified function after |
| all pre-existing RCU readers have completed. |
| Finally, the \apik{rcu_assign_pointer()} macro is used to update an |
| RCU-protected pointer. |
| Conceptually, this is simply an assignment statement, but we will |
| see in |
| \cref{sec:defer:Publish-Subscribe Mechanism} |
| that \co{rcu_assign_pointer()} must prevent the compiler and the CPU |
| from reordering this assignment to precede any prior assignments used |
| to initialize the pointed-to structure. |
| |
| \begin{table*} |
| \renewcommand*{\arraystretch}{1.25} |
| \rowcolors{2}{}{lightgray} |
| \small |
| \centering |
| \ebresizewidth{ |
| \begin{tabular}{llp{2.4in}} |
| \toprule |
| & |
| Primitive & |
| Purpose \\ |
| \midrule |
| \cellcolor{white}\emph{Readers} & |
| \tco{rcu_read_lock()} & |
| Start an RCU read-side critical section. \\ |
| & |
| \tco{rcu_read_unlock()} & |
| End an RCU read-side critical section. \\ |
| \cellcolor{white}& |
| \tco{rcu_dereference()} & |
| Safely load an RCU-protected pointer. \\ |
| \midrule |
| \emph{Updaters} & |
| \tco{synchronize_rcu()} & |
| Wait for all pre-existing RCU read-side critical sections to |
| complete. \\ |
| \cellcolor{white}& |
| \tco{call_rcu()} & |
| Invoke the specified function after all pre-existing RCU read-side |
| critical sections complete. \\ |
| & |
| \tco{rcu_assign_pointer()} & |
| Safely update an RCU-protected pointer. \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{Core RCU API} |
| \label{tab:defer:Core RCU API} |
| \end{table*} |
| |
| \QuickQuiz{ |
| What does \co{synchronize_rcu()} do if it starts at about |
| the same time as an \co{rcu_read_lock()}? |
| }\QuickQuizAnswer{ |
| If a \co{synchronize_rcu()} cannot prove that it started |
| before a given \co{rcu_read_lock()}, then it must wait |
| for the corresponding \co{rcu_read_unlock()}. |
| }\QuickQuizEnd |
| |
| The core RCU API is summarized in \cref{tab:defer:Core RCU API} for |
| easy reference. |
| With that, we are ready to continue this introduction to RCU with |
| the key RCU operation, waiting for readers. |
| |
| \subsubsection{Waiting for Readers} |
| \label{sec:defer:Waiting for Readers} |
| |
| It is tempting to base the reader-waiting functionality of |
| \co{synchronize_rcu()} and \co{call_rcu()} on a reference counter |
| updated by \co{rcu_read_lock()} and \co{rcu_read_unlock()}, but |
| \cref{fig:count:Atomic Increment Scalability on x86} |
| in |
| \cref{chp:Counting} |
| shows that concurrent reference counting results in extreme overhead. |
| This extreme overhead was confirmed in the specific case of |
| reference counters in |
| \cref{fig:defer:Pre-BSD Routing Table Protected by Reference Counting} |
| on |
| \cpageref{fig:defer:Pre-BSD Routing Table Protected by Reference Counting}. |
| Hazard pointers profoundly reduce this overhead, but, as we saw in |
| \cref{fig:defer:Pre-BSD Routing Table Protected by Hazard Pointers} |
| on |
| \cpageref{fig:defer:Pre-BSD Routing Table Protected by Hazard Pointers}, |
| not to zero. |
| Nevertheless, many RCU implementations use counters with carefully |
| controlled cache locality. |
| |
| A second approach observes that memory synchronization is expensive, |
| and therefore uses registers instead, namely each CPU's or thread's |
| program counter (PC), thus imposing no overhead on readers, at least |
| in the absence of concurrent updates. |
| The updater polls each relevant PC, and if that PC is not within read-side |
| code, then the corresponding CPU or thread is within a quiescent state, |
| in turn signaling the completion of any reader that might have access |
| to the newly removed data element. |
| Once all CPU's or thread's PCs have been observed to be outside of any |
| reader, the \IX{grace period} has completed. |
| Please note that this approach poses some serious challenges, including |
| memory ordering, functions that are \emph{sometimes} invoked from readers, |
| and ever-exciting code-motion optimizations. |
| Nevertheless, this approach is said to be used in |
| production~\cite{MikeAsh2015Apple}. |
| |
| A third approach is to simply wait for a fixed period of time that is |
| long enough to comfortably exceed the lifetime of any reasonable |
| reader~\cite{Jacobson93,AjuJohn95}. |
| This can work quite well in hard real-time systems~\cite{YuxinRen2018RTRCU}, |
| but in less exotic |
| settings, Murphy says that it is critically important to be prepared |
| even for unreasonably long-lived readers. |
| To see this, consider the consequences of failing do so: |
| A data item will be freed while the unreasonable reader is still |
| referencing it, and that item might well be immediately reallocated, |
| possibly even as a data item of some other type. |
| The unreasonable reader and the unwitting reallocator would then |
| be attempting to use the same memory for two very different purposes. |
| The ensuing mess will be exceedingly difficult to debug. |
| |
| A fourth approach is to wait forever, secure in the knowledge that |
| doing so will accommodate even the most unreasonable reader. |
| This approach is also called ``leaking memory'', and has a bad reputation |
| due to the fact that memory leaks often require untimely and |
| inconvenient reboots. |
| Nevertheless, this is a viable strategy when the update rate and the |
| uptime are both sharply bounded. |
| For example, this approach could work well in a high-availability |
| cluster where systems were periodically crashed in order to ensure |
| that cluster really remained highly available.\footnote{ |
| The program that forces the periodic crashing is sometimes |
| known as a ``chaos monkey'': |
| \url{https://netflix.github.io/chaosmonkey/}. |
| However, it might also be a mistake to neglect chaos caused |
| by systems running for too long.} |
| Leaking the memory is also a viable strategy in environments having |
| garbage collectors, in which case the garbage collector can be thought |
| of as plugging the leak~\cite{Kung80}. |
| However, if your environment lacks a garbage collector, read on! |
| |
| A fifth approach avoids the period crashes in favor of periodically |
| ``stopping the world'', as exemplified by the traditional stop-the-world |
| garbage collector. |
| This approach was also heavily used during the decades before |
| ubiquitous connectivity, when it was common practice to power systems |
| off at the end of each working day. |
| However, in today's always-connected always-on world, stopping the world |
| can gravely degrade response times, which has been one motivation for the |
| development of concurrent garbage collectors~\cite{DavidFBacon2003RTGC}. |
| Furthermore, although we need all pre-existing readers to complete, we do |
| not need them all to complete at the same time. |
| |
| This observation leads to the sixth approach, which is stopping |
| one CPU or thread at a time. |
| This approach has the advantage of not degrading reader response times |
| at all, let alone gravely. |
| Furthermore, numerous applications already have states (termed |
| \emph{\IXpl{quiescent state}}) that can be |
| reached only after all pre-existing readers are done. |
| In transaction-processing systems, the time between a pair of |
| successive transactions might be a quiescent state. |
| In reactive systems, the state between a pair of successive events |
| might be a quiescent state. |
| Within non-preemptive operating-systems kernels, a context switch can be |
| a quiescent state~\cite{McKenney98}. |
| Either way, once all CPUs and/or threads have passed through a quiescent |
| state, the system is said to have completed a \emph{grace period}, |
| at which point all readers in existence at the start of that grace period |
| are guaranteed to have completed. |
| As a result, it is also guaranteed to be safe to free any removed data |
| items that were removed prior to the start of that grace period.\footnote{ |
| It is possible to do much more with RCU than simply defer |
| reclamation of memory, but deferred reclamation is RCU's most |
| common use case, and is therefore an excellent place to start. |
| For an example of the more general case of deferred execution, |
| please see phased state change in \cref{sec:defer:Phased State |
| Change}.} |
| |
| Within a non-preemptive operating-system kernel, for context switch to be |
| a valid quiescent state, readers must be prohibited from blocking while |
| referencing a given instance data structure obtained via the \co{gptr} |
| pointer shown in |
| \cref{fig:defer:Insertion With Concurrent Readers,% |
| fig:defer:Deletion With Concurrent Readers}. |
| This no-blocking constraint is consistent with similar constraints |
| on pure spinlocks, where a CPU is forbidden from blocking while |
| holding a spinlock. |
| Without this constraint, all CPUs might be consumed by threads |
| spinning attempting to acquire a spinlock held by a blocked thread. |
| The spinning threads will not relinquish their CPUs until they acquire |
| the lock, but the thread holding the lock cannot possibly release it |
| until one of the spinning threads relinquishes a CPU\@. |
| This is a classic deadlock situation, and this \IX{deadlock} is avoided |
| by forbidding blocking while holding a spinlock. |
| |
| Again, this same constraint is imposed on reader threads dereferencing |
| \co{gptr}: |
| Such threads are not allowed to block until after |
| they are done using the pointed-to data item. |
| Returning to the second row of |
| \cref{fig:defer:Deletion With Concurrent Readers}, |
| where the updater has just completed executing the \co{smp_store_release()}, |
| imagine that CPU~0 executes a context switch. |
| Because readers are not permitted to block while traversing the linked |
| list, we are guaranteed that all prior readers that might have been running on |
| CPU~0 will have completed. |
| Extending this line of reasoning to the other CPUs, once each CPU has |
| been observed executing a context switch, we are guaranteed that all |
| prior readers have completed, and that there are no longer any reader |
| threads referencing the newly removed data element. |
| The updater can then safely free that data element, resulting in the |
| state shown at the bottom of |
| \cref{fig:defer:Deletion With Concurrent Readers}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/QSBRGracePeriod}} |
| \caption{QSBR\@: |
| Waiting for Pre-Existing Readers} |
| \label{fig:defer:QSBR: Waiting for Pre-Existing Readers} |
| \end{figure} |
| |
| This approach is termed \IXacrfst{qsbr}~\cite{ThomasEHart2006a}. |
| A QSBR schematic is shown in |
| \cref{fig:defer:QSBR: Waiting for Pre-Existing Readers}, |
| with time advancing from the top of the figure to the bottom. |
| The cyan-colored boxes depict RCU read-side critical sections, |
| each of which begins with \co{rcu_read_lock()} and ends with |
| \co{rcu_read_unlock()}. |
| CPU~1 does the \co{WRITE_ONCE()} that removes the current data |
| item (presumably having previously read the pointer value and |
| availed itself of appropriate synchronization), then waits |
| for readers. |
| This wait operation results in an immediate context switch, which is a |
| quiescent state (denoted by the pink circle), which in turn means that |
| all prior reads on CPU~1 have completed. |
| Next, CPU~2 does a context switch, so that all readers on CPUs~1 and~2 |
| are now known to have completed. |
| Finally, CPU~3 does a context switch. |
| At this point, all readers throughout the entire system are known to have |
| completed, so the grace period ends, permitting \co{synchronize_rcu()} to |
| return to its caller, in turn permitting CPU~1 to free the old data item. |
| |
| \QuickQuiz{ |
| In \cref{fig:defer:QSBR: Waiting for Pre-Existing Readers}, |
| the last of CPU~3's readers that could possibly have |
| access to the old data item ended before the grace period |
| even started! |
| So why would anyone bother waiting until CPU~3's later context |
| switch??? |
| }\QuickQuizAnswer{ |
| Because that waiting is exactly what enables readers to use |
| the same sequence of instructions that is appropriate for |
| single-theaded situations. |
| In other words, this additional ``redundant'' waiting enables |
| excellent read-side performance, scalability, and real-time |
| response. |
| }\QuickQuizEnd |
| |
| \subsubsection{Toy Implementation} |
| \label{sec:defer:Toy Implementation} |
| |
| Although production-quality QSBR implementations can be quite complex, |
| a toy non-preemptive Linux-kernel implementation is quite simple: |
| |
| \begin{VerbatimN}[samepage=true] |
| void synchronize_rcu(void) |
| { |
| int cpu; |
| |
| for_each_online_cpu(cpu) |
| sched_setaffinity(current->pid, cpumask_of(cpu)); |
| } |
| \end{VerbatimN} |
| |
| The \co{for_each_online_cpu()} primitive iterates over all CPUs, and |
| the \co{sched_setaffinity()} function causes the current thread to |
| execute on the specified CPU, which forces the destination CPU to execute |
| a context switch. |
| Therefore, once the \co{for_each_online_cpu()} has completed, each CPU |
| has executed a context switch, which in turn guarantees that |
| all pre-existing reader threads have completed. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:defer:Insertion and Deletion With Concurrent Readers] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| struct route *gptr; |
| |
| int access_route(int (*f)(struct route *rp)) |
| { |
| int ret = -1; |
| struct route *rp; |
| |
| rcu_read_lock(); |
| rp = rcu_dereference(gptr); |
| if (rp) |
| ret = f(rp); \lnlbl[access_rp] |
| rcu_read_unlock(); |
| return ret; |
| } |
| |
| struct route *ins_route(struct route *rp) |
| { |
| struct route *old_rp; |
| |
| spin_lock(&route_lock); |
| old_rp = gptr; |
| rcu_assign_pointer(gptr, rp); |
| spin_unlock(&route_lock); |
| return old_rp; |
| } |
| |
| int del_route(void) |
| { |
| struct route *old_rp; |
| |
| spin_lock(&route_lock); |
| old_rp = gptr; |
| RCU_INIT_POINTER(gptr, NULL); |
| spin_unlock(&route_lock); |
| synchronize_rcu(); |
| free(old_rp); |
| return !!old_rp; |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Insertion and Deletion With Concurrent Readers} |
| \label{lst:defer:Insertion and Deletion With Concurrent Readers} |
| \end{listing} |
| |
| Please note that this approach is \emph{not} production quality. |
| Correct handling of a number of corner cases and the need for a number |
| of powerful optimizations mean that production-quality implementations |
| are quite complex. |
| In addition, RCU implementations for preemptible environments |
| require that readers actually do something, which in non-real-time |
| Linux-kernel environments can be as simple as defining |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} as \co{preempt_disable()} |
| and \co{preempt_enable()}, respectively.\footnote{ |
| Some toy RCU implementations that handle preempted |
| read-side critical sections are shown in |
| \cref{chp:app:``Toy'' RCU Implementations}\@.} |
| However, this simple non-preemptible approach is conceptually complete, |
| and demonstrates that it really is possible to provide read-side |
| synchronization at zero cost, even in the face of concurrent updates. |
| In fact, |
| \cref{lst:defer:Insertion and Deletion With Concurrent Readers} |
| shows how reading (\co{access_route()}), |
| \cref{fig:defer:Insertion With Concurrent Readers}'s |
| insertion (\co{ins_route()}) and |
| \cref{fig:defer:Deletion With Concurrent Readers}'s |
| deletion (\co{del_route()}) can |
| be implemented. |
| (A slightly more capable routing table is shown in |
| \cref{sec:defer:RCU for Pre-BSD Routing}.) |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| What is the point of \co{rcu_read_lock()} and \co{rcu_read_unlock()} in |
| \cref{lst:defer:Insertion and Deletion With Concurrent Readers}? |
| Why not just let the quiescent states speak for themselves? |
| }\QuickQuizAnswerB{ |
| Recall that readers are not permitted to pass through a quiescent |
| state. |
| For example, within the Linux kernel, RCU readers are not permitted |
| to execute a context switch. |
| Use of \co{rcu_read_lock()} and \co{rcu_read_unlock()} enables |
| debug checks for improperly placed quiescent states, making it |
| easy to find bugs that would otherwise be difficult to find, |
| intermittent, and quite destructive. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| What is the point of \co{rcu_dereference()}, \co{rcu_assign_pointer()} |
| and \co{RCU_INIT_POINTER()} in |
| \cref{lst:defer:Insertion and Deletion With Concurrent Readers}? |
| Why not just use \co{READ_ONCE()}, \co{smp_store_release()}, and |
| \co{WRITE_ONCE()}, respectively? |
| }\QuickQuizAnswerE{ |
| The RCU-specific APIs do have similar semantics to the suggested |
| replacements, but also enable static-analysis debugging checks |
| that complain if an RCU-specific API is invoked on a non-RCU |
| pointer and vice versa. |
| }\QuickQuizEndE |
| } |
| |
| Referring back to |
| \cref{lst:defer:Insertion and Deletion With Concurrent Readers}, |
| note that \co{route_lock} is used to synchronize between concurrent updaters |
| invoking \co{ins_route()} and \co{del_route()}. |
| However, this lock is not acquired by readers invoking \co{access_route()}: |
| Readers are instead protected by the QSBR techniques described in |
| \cref{sec:defer:Waiting for Readers}. |
| |
| Note that \co{ins_route()} simply returns the old value of \co{gptr}, which |
| \cref{fig:defer:Insertion With Concurrent Readers} assumed would |
| always be \co{NULL}. |
| This means that it is the caller's responsibility to figure out what to |
| do with a non-\co{NULL} value, a task complicated by the fact that |
| readers might still be referencing it for an indeterminate period of time. |
| Callers might use one of the following approaches: |
| |
| \begin{enumerate} |
| \item Use \co{synchronize_rcu()} to safely free the pointed-to structure. |
| Although this approach is correct from an RCU perspective, it |
| arguably has software-engineering leaky-API problems. |
| \item Trip an assertion if the returned pointer is non-\co{NULL}. |
| \item Pass the returned pointer to a later invocation of |
| \co{ins_route()} to restore the earlier value. |
| \end{enumerate} |
| |
| In contrast, \co{del_route()} uses \co{synchronize_rcu()} and |
| \co{free()} to safely free the newly deleted data item. |
| |
| \QuickQuiz{ |
| But what if the old structure needs to be freed, but the caller |
| of \co{ins_route()} cannot block, perhaps due to performance |
| considerations or perhaps because the caller is executing within |
| an RCU read-side critical section? |
| }\QuickQuizAnswer{ |
| A \co{call_rcu()} function, which is described in |
| \cref{sec:defer:Wait For Pre-Existing RCU Readers}, |
| permits asynchronous grace-period waits. |
| }\QuickQuizEnd |
| |
| This example shows one general approach to reading and updating |
| RCU-protected data structures, however, there is quite a variety |
| of use cases, several of which are covered in |
| \cref{sec:defer:RCU Usage}. |
| |
| In summary, it is in fact possible to create concurrent linked data |
| structures that can be traversed by readers executing the same sequence |
| of machine instructions that would be executed by single-threaded readers. |
| The next section summarizes RCU's high-level properties. |
| |
| \subsubsection{RCU Properties} |
| \label{sec:defer:RCU Properties} |
| |
| A key RCU property is that reads need not wait for updates. |
| This property enables RCU implementations to provide low-cost or even |
| no-cost readers, resulting in low overhead and excellent scalability. |
| This property also allows RCU readers and updaters to make useful |
| concurrent forward progress. |
| In contrast, conventional synchronization primitives must enforce strict |
| mutual exclusion using expensive instructions, thus increasing overhead |
| and degrading scalability, but also typically prohibiting readers and |
| updaters from making useful concurrent forward progress. |
| |
| \QuickQuiz{ |
| Doesn't \cref{sec:defer:Sequence Locks}'s seqlock |
| also permit readers and updaters to make useful concurrent |
| forward progress? |
| }\QuickQuizAnswer{ |
| Yes and no. |
| Although seqlock readers can run concurrently with |
| seqlock writers, whenever this happens, the \co{read_seqretry()} |
| primitive will force the reader to retry. |
| This means that any work done by a seqlock reader running concurrently |
| with a seqlock updater will be discarded and then redone upon retry. |
| So seqlock readers can \emph{run} concurrently with updaters, |
| but they cannot actually get any work done in this case. |
| |
| In contrast, RCU readers can perform useful work even in presence |
| of concurrent RCU updaters. |
| |
| However, both reference counters |
| (\cref{sec:defer:Reference Counting}) |
| and hazard pointers |
| (\cref{sec:defer:Hazard Pointers}) |
| really do permit useful concurrent forward progress for both |
| updaters and readers, just at somewhat greater cost. |
| Please see |
| \cref{sec:defer:Which to Choose?} |
| for a comparison of these different solutions to the |
| deferred-reclamation problem. |
| }\QuickQuizEnd |
| |
| As noted earlier, RCU delimits readers with \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()}, and ensures that each reader has a coherent view |
| of each object (see \cref{fig:defer:Deletion With Concurrent Readers}) by |
| maintaining multiple versions of objects and using update-side primitives |
| such as \co{synchronize_rcu()} to ensure that objects are not |
| freed until after the completion of all readers that might be using them. |
| RCU uses \co{rcu_assign_pointer()} and \co{rcu_dereference()} to provide |
| efficient and scalable mechanisms for publishing and reading new versions |
| of an object, respectively. |
| These mechanisms distribute the work among read and |
| update paths in such a way as to make read paths extremely fast, using |
| replication and weakening optimizations in a manner similar to |
| \IXpl{hazard pointer}, but without the need for read-side retries. |
| In some cases, including \co{CONFIG_PREEMPT=n} Linux kernels, |
| RCU's read-side primitives have zero overhead. |
| |
| But are these properties actually useful in practice? |
| This question is taken up by the next section. |
| |
| \subsubsection{Practical Applicability} |
| \label{sec:defer:Practical Applicability} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/linux-RCU}} |
| \caption{RCU Usage in the Linux Kernel} |
| \label{fig:defer:RCU Usage in the Linux Kernel} |
| \end{figure} |
| |
| RCU has been used in the Linux kernel since |
| October 2002~\cite{Torvalds2.5.43}. |
| Use of the RCU API has increased substantially since that time, |
| as can be seen in |
| \cref{fig:defer:RCU Usage in the Linux Kernel}. |
| RCU has enjoyed heavy use both prior to and since its acceptance |
| in the Linux kernel, as discussed in |
| \cref{sec:defer:RCU Related Work}. |
| In short, RCU enjoys wide practical applicability. |
| |
| The minimal example discussed in this section is a good introduction to RCU\@. |
| However, effective use of RCU often requires that you think differently |
| about your problem. |
| It is therefore useful to examine RCU's fundamentals, a task taken up |
| by the following section. |