| % advsync/rcu.tex |
| |
| \section{Read-Copy Update} |
| \label{sec:advsync:Read-Copy Update} |
| |
| Read-copy update (RCU) can be thought of as a replacement for |
| reader-writer locking for |
| which the read-side primitives are extremely light weight, and where |
| readers and writers do not exclude each other. |
| This last means that writers must leave old versions |
| of RCU-protected data structures |
| in place until all concurrent readers have completed, |
| so that RCU write-side primitives can be quite expensive. |
| RCU is thus best-suited for read-mostly situations, or for situations |
| in which realtime constraints require deterministic readers. |
| |
| RCU provides the \co{rcu_read_lock()} and \co{rcu_read_unlock} |
| primitives to demark read-side critical sections, and |
| \co{synchronize_rcu} to allow writers to wait for completion |
| of all pre-existing readers. |
| The time that waiters much wait is termed a ``grace period''. |
| RCU read-side critical sections may be nested, which results in |
| the inner critical sections being subsumed into the outermost critical |
| section. |
| |
| \subsection{RCU Read-Side Critical Sections and Grace Periods} |
| \label{sec:advsync:RCU Read-Side Critical Sections and Grace Periods} |
| |
| Legal and illegal relationships of RCU read-side critical sections |
| to a grace periods are shown in |
| Figure~\ref{fig:advsync:RCU Readers and Grace Periods}. |
| Each blue box in the diagram represents an RCU read-side critical |
| section, as shown below. |
| The green box represents update-initiation code, for example, a |
| \co{list_del_rcu()} removing an element from an RCU-protected linked list, |
| with the color green representing the fact that readers can freely |
| access the element being removed. |
| The yellow box represents an RCU grace period, for example, a |
| \co{synchronize_rcu()}, with the color yell representing the fact that |
| pre-existing readers may still access the removed element, but new |
| readers cannot. |
| Finally, the red box represents completion |
| of the update, for example, a \co{kfree()} call to reclaim the removed |
| element's memory, with the color red indicating that no readers may |
| access the removed element. |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| rcu_read_lock(); |
| /* RCU read-side critical section. */ |
| rcu_read_unlock(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/RCUReaderGP}} |
| \caption{RCU Readers and Grace Periods} |
| \label{fig:advsync:RCU Readers and Grace Periods} |
| \end{figure} |
| |
| This color coding indicates why it is forbidden for Reader~8 to |
| begin before the grace period starts, but persist until after the |
| grace period has ended. |
| This situation might permit Reader~8 to still be holding a reference |
| to the removed element, which could in turn result in memory corruption |
| failures. |
| |
| So what happens if a reader executes for longer than planned? |
| |
| The grace period extends in this case for as long as necessary to |
| accommodate any pre-existing reader, as shown in |
| Figure~\ref{fig:advsync:RCU Readers Extending Grace Periods}. |
| Note, however, that the grace period need not extend to cover Readers~2 |
| and 8, since these two readers started after the beginning of the |
| grace period, and therefore cannot possibly have gained a reference |
| to the removed element. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/RCUReaderGPExtends}} |
| \caption{RCU Readers Extending Grace Period} |
| \label{fig:advsync:RCU Readers Extending Grace Periods} |
| \end{figure} |
| |
| The next sections show how these concepts can be applied to linked-list |
| data structures. |
| |
| \subsection{Deletion From an RCU-Protected Linked List} |
| \label{sec:advsync:Deletion From an RCU-Protected Linked List} |
| |
| Figure~\ref{fig:advsync:RCU Deletion From Linked List} |
| illustrates deletion from an RCU-protected linked list, |
| with time advancing from top to bottom. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/RCUDeletion}} |
| \caption{RCU Deletion From Linked List} |
| \label{fig:advsync:RCU Deletion From Linked List} |
| \end{figure} |
| |
| Initially, the list contains elements~A, B, and C, in that order. |
| RCU readers traverse the list concurrently with the update, |
| perhaps executing code like the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| rcu_read_lock(); |
| list_for_each_entry_rcu(p, h, le) |
| if (p->key == k) |
| break; |
| present = p != NULL; |
| rcu_read_unlock(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The \co{rcu_read_lock()} and \co{rcu_read_unlock()} primitives |
| delimit the RCU read-side critical section, preventing a grace |
| period from elapsing during this section of code. |
| Note that a grace period that commenced prior to the \co{rcu_read_lock()} |
| might complete during this critical section, but that a grace period |
| that begins after the \co{rcu_read_lock()} must not complete until |
| after completion of the \co{rcu_read_unlock()}. |
| |
| Updaters might be executing code like the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| /* p == element to delete. */ |
| down(&update_sema); |
| list_del_rcu(p); |
| up(&update_sema); |
| synchronize_rcu(); |
| kfree(p); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The \co{down()} and \co{up()} Linux-kernel primitives act as sleeplock, |
| with \co{down()} acquiring the sleeplock ``\co{update_sema}'' |
| and \co{up()} releasing it. |
| |
| Note that both the \co{rcu_read_lock()} and \co{rcu_read_unlock()} |
| primitives have $O\left(1\right)$ execution time, and in particular, that |
| neither of them block or spin. |
| This means that the update code has no way of excluding readers. |
| |
| Let us now walk through the execution of the update code |
| while referring to |
| Figure~\ref{fig:advsync:RCU Deletion From Linked List}. |
| A \co{list_del_rcu()} removes element~B from the list, so that |
| pre-existing readers might still be referencing |
| element~B, but new readers will advance from A directly to C. |
| Element~B's pointers remain intact, but so that readers still |
| referencing element~B will still advance to element~C. |
| A \co{synchronize_kernel()} waits for a grace period to elapse, |
| so that all readers that might have been referencing element~B have |
| advanced through the list. |
| Because new readers can no longer gain a reference to element~B, |
| it is now safe to execute \co{kfree()} to free up element~B's memory. |
| |
| The \co{synchronize_rcu()} primitive can be thought of as a temporal |
| mutual-exclusion primitive, since it prevents the following \co{kfree()} |
| from executing until after any pre-existing readers complete. |
| The \co{kfree()} might well run concurrently with \emph{new} |
| RCU read-side critical sections, but it is guaranteed not to run |
| until after all \emph{old} RCU read-side critical sections complete. |
| |
| The next section describes in-place update of an RCU-protected |
| linked list, the procedure that gave RCU its name. |
| |
| \subsection{Replacement in an RCU-Protected Linked List} |
| \label{sec:advsync:Replacement in an RCU-Protected Linked List} |
| |
| Figure~\ref{fig:advsync:RCU Replacement in Linked List} |
| shows how an RCU-protected data structure may be updated via |
| copying. |
| The read-side code is the same as for the previous example, but |
| the update code is outlined as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| down(&update_sema); |
| /* p == element to update */ |
| q = kmalloc(sizeof(*q), GFP_KERNEL); |
| *q = *p; |
| list_replace_rcu(p, q); |
| up(&update_sema); |
| synchronize_rcu(); |
| kfree(p); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| As before, we start with a linked list with elements~A, B, and C |
| in that order, and, again as before, readers might be traversing |
| this list throughout the update process. |
| To update element~B, we first allocate a new element and copy element~B |
| to it, then update the copy to produce element~B'. |
| We then execute \co{list_replace_rcu()} so that element~A now |
| references the new element~B'---however, element~B still references |
| element~C so that any pre-existing readers still referencing old element~B |
| are still able to advance to element~C. |
| New readers will find element~B'. |
| |
| After executing \co{synchronize_rcu()}, which again waits for a grace |
| period to elapse, all readers that were referencing old element~B' |
| will have advanced through the list, so that \co{kfree()} may now |
| be invoked on element~B, resulting in the final state with the list |
| now containing elements~A, B', and C. |
| |
| This procedure where \emph{readers} continue traversing the list |
| while a \emph{copy} operation is used to carry out an \emph{update} |
| is what gives RCU---or read-copy update---its name. |
| |
| \begin{figure}[p] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/RCUReplacement}} |
| \caption{RCU Replacement in Linked List} |
| \label{fig:advsync:RCU Replacement in Linked List} |
| \end{figure} |
| |
| \subsection{Insertion into an RCU-Protected Linked List} |
| \label{sec:advsync:Insertion into an RCU-Protected Linked List} |
| |
| Insertion into an RCU-protected linked list is straightforward, as |
| illustrated by |
| Figure~\ref{fig:advsync:RCU Insertion into Linked List}. |
| The read-side code is again that shown in |
| Section~\ref{sec:advsync:Deletion From an RCU-Protected Linked List}, |
| and the update code might be as follows, where ``h'' is the list header: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| down(&update_sema); |
| p = kmalloc(sizeof(*p), GFP_KERNEL); |
| initialize(p); |
| list_add_rcu(p, h); |
| up(&update_sema); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Referring to |
| Figure~\ref{fig:advsync:RCU Insertion into Linked List}, |
| the \co{kmalloc()} primitive and the programmer-supplied \co{initialize()} |
| function allocate and initialize the new element~B, |
| and the \co{list_add_rcu()} primitive inserts it into the list. |
| There is no need for grace periods in this case: readers will either |
| see the new element~B or not, depending on when they execute relative |
| to the update. |
| |
| \begin{figure}[htb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{advsync/RCUInsertion}} |
| \caption{RCU Insertion into Linked List} |
| \label{fig:advsync:RCU Insertion into Linked List} |
| \end{figure} |
| |
| \subsection{RCU API} |
| \label{sec:advsync:RCU API} |
| |
| RCU can be best thought of as an API, particularly given the large number |
| of implementations available for the RCU API. |
| Figure~\ref{fig:advsync:RCU API} |
| shows a color-coded guide to the major members of the Linux-kernel |
| RCU API, with the read-side primitives in blue boxes, the update-initiation |
| primitives in the green box, the grace-period primitives in the yellow box, |
| and the update-completion primitives in the red box. |
| |
| \begin{figure}[htb] |
| \centering |
| \includegraphics{advsync/RCU-API} |
| \caption{RCU API} |
| \label{fig:advsync:RCU API} |
| \end{figure} |
| |
| As noted earlier, \co{rcu_read_lock()} and \co{rcu_read_unlock()} |
| delimit RCU read-side critical sections. |
| The \co{list_for_each_entry_rcu(p,h,f)} primitive is an iterator that |
| traverses the RCU-protected linked list with header ``h'' threaded |
| through struct field ``f'' using |
| local variable ``p'' as the iteration pointer. |
| The \co{rcu_dereference()} primitive is used to traverse RCU-protected |
| pointers, as in \co{rcu_dereference(p)->next}. |
| |
| The \co{list_add_rcu()}, \co{list_del_rcu()}, and |
| \co{list_replace_rcu()} primitives perform the expected linked-list |
| operations, as illustrated in the examples in the preceding sections. |
| The \co{rcu_assign_pointer()} primitive assigns a new value to an |
| RCU-protected field, for example, \co{rcu_assign_pointer(p->next,q)} |
| would add the structure referenced by ``q'' following that referenced |
| by ``p''. |
| |
| The \co{synchronize_rcu()} primitive waits for an RCU grace period |
| to elapse, as see in the examples in the previous sections. |
| The \co{call_rcu()} primitive causes a function to be invoked |
| at the end of a subsequent grace period, so that \co{call_rcu(p,f)} |
| would cause \co{f(p)} to be invoked after a grace period. |
| The \co{call_rcu()} primitive is thus the continuation form of the |
| \co{synchronize_rcu()} primitive. |
| |
| Finally note that \co{kfree()} is simply an example, as any primitive |
| that frees memory can be used to complete an update. |
| |
| The RCU API in the Linux kernel is somewhat more elaborate in order to |
| allow for the kernel's variety of different contexts and environments, |
| including interrupt and NMI handlers. |
| |
| \QuickQuiz{} |
| Why is an \co{rcu_dereference()} primitive required? |
| \QuickQuizAnswer{ |
| The \co{rcu_dereference()} primitive provides the memory |
| barrier required by the DEC Alpha. |
| Without such a memory barrier, the Alpha can dereference |
| the new pointer and find pre-initialization |
| data~\cite{Compaq01,WilliamPugh2000Gharachorloo}. |
| Perhaps more importantly, it serves to document which |
| pointers are are protected by RCU. |
| But most important of all, it disables compiler optimizations |
| that might otherwise result in double fetches of the |
| argument to \co{rcu_dereference()}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why is a separate \co{list_for_each_entry_rcu()} required? |
| In other words, why can't the existing |
| \co{list_for_each_entry()} primitive be used instead? |
| \QuickQuizAnswer{ |
| The \co{list_for_each_entry_rcu()} primitive includes the |
| use of \co{rcu_dereference()} required by the DEC Alpha. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why would one need to use \co{rcu_assign_pointer(p->next,q)} |
| instead of the straightforward \co{p->next=q}? |
| \QuickQuizAnswer{ |
| On weakly consistent machines, a memory barrier is required |
| between the initialization of a data structure and the |
| assignment of the corresponding pointer that allows other |
| CPUs to access the data structure. |
| Without such a memory barrier, these other CPUs might |
| see the assignments out of order, thus seeing the preinitialization |
| contents of the data structure. |
| The \co{rcu_assign_pointer()} primitive includes such a |
| memory barrier on any architectures requiring it. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why are there separate \co{list_add_rcu()}, |
| \co{list_del_rcu()}, |
| and \co{list_replace_rcu()} primitives? |
| Why couldn't the existing \co{list_add()}, \co{list_del()}, |
| and \co{list_replace()} primitives be used instead? |
| \QuickQuizAnswer{ |
| The \co{list_add_rcu()} and \co{list_replace_rcu()} primitives |
| must use the \co{rcu_assign_pointer()} primitive |
| (or, alternatively, explicit memory barriers). |
| The \co{list_del_rcu()} primitive must omit the ``poisoning'' |
| of list pointers that is used by \co{list_del()} as a |
| debug check. |
| } \QuickQuizEnd |
| |
| \subsection{RCU Properties} |
| \label{sec:advsync:RCU Properties} |
| |
| The following are important properties of RCU, some of which are |
| beneficial, some adverse, and others dependent on the situation. |
| |
| \begin{enumerate} |
| \item Read-side primitives have very fast and deterministic |
| execution time. |
| \item Grace-period-detection primitives can have rather large |
| and non-deterministic overheads. |
| \item Read-side primitives cannot participate in a deadlock cycle. |
| \item Read-side primitives cannot participate in a priority-inversion |
| scenarios. |
| \item When updates are guarded by locking, read-side critical |
| sections can unconditionally acquire the update-side lock. |
| \item Read-side critical sections run concurrently with updates. |
| Without this read/update concurrency, the read-side primitives could |
| not possibly be fast and deterministic, however, some algorithms |
| do not straightforwardly tolerate such concurrency. |
| \item Some implementations of the \co{call_rcu()} primitive |
| can place very low overhead on the thread invoking |
| \co{call_rcu()}, although the large overhead of grace-period |
| detection is merely deferred, rather than avoided, in this |
| case. |
| \end{enumerate} |
| |
| RCU is therefore attractive in read-mostly situations where either |
| good performance and scalability or realtime response are required. |
| These properties will be discussed at greater length @@@. |
| |
| @@@@ move RCU usage to its own chapter? |
| |
| \begin{enumerate} |
| \item Pure RCU~(\ref{sec:advsync:Pure RCU}). |
| \item Reader-Writer-Lock/RCU |
| Analogy~(\ref{sec:advsync:Reader-Writer-Lock/RCU Analogy}). |
| \item RCU Existence Locks(\ref{sec:advsync:RCU Existence Locks}). |
| \item RCU Readers With NBS |
| Writers~(\ref{sec:advsync:RCU Readers With NBS Writers}). |
| \end{enumerate} |
| |
| \subsection{Pure RCU} |
| \label{sec:advsync:Pure RCU} |
| |
| Pure RCU is used to effect a lazy change in mode, for example, preventing |
| any new threads from accessing a subsystem, then waiting for all |
| pre-existing threads to complete before tearing it down. |
| |
| \emph{@@@ Focus on change-in-mode uses. (One can argue that the interrupt/NMI |
| uses are more closely related to reader-writer-lock/RCU analogy.) |
| Potential example: RCU guarding a variable prohibiting additional reference |
| counts for cleanup of the corresponding data structure.} |
| |
| \subsection{Reader-Writer-Lock/RCU Analogy} |
| \label{sec:advsync:Reader-Writer-Lock/RCU Analogy} |
| |
| \subsection{RCU Existence Locks} |
| \label{sec:advsync:RCU Existence Locks} |
| |
| \subsection{RCU Readers With NBS Writers} |
| \label{sec:advsync:RCU Readers With NBS Writers} |
| |
| \emph{@@@ simple intro, along with performance data on example from SMPdesign.} |
| |
| \emph{@@@ consider moving advanced RCU usage to a separate chapter. |
| Add more material from chapter 5 of dissertation, particularly transformational |
| patterns. |
| \url{http://www.rdrop.com/users/paulmck/RCU/RCUdissertation.2004.07.14e1.pdf}.} |
| |
| \emph{@@@ consider creating an RCU applications chapter/appendix, with |
| information from chapter 6 of dissertation.} |