| % defer/rcuAPI.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \subsection{RCU Linux-Kernel API} |
| \label{sec:defer:RCU Linux-Kernel API} |
| \OriginallyPublished{sec:defer:RCU Linux-Kernel API}{RCU Linux-Kernel API}{Linux Weekly News}{PaulEMcKenney2008WhatIsRCUAPI} |
| |
| This section looks at RCU from the viewpoint of its Linux-kernel API\@.\footnote{ |
| Userspace RCU's API is documented |
| elsewhere~\cite{PaulMcKenney2013LWNURCU}.} |
| \Cref{sec:defer:RCU has a Family of Wait-to-Finish APIs} |
| presents RCU's wait-to-finish APIs, |
| \cref{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs} |
| presents RCU's publish-subscribe and version-maintenance APIs, |
| \cref{sec:defer:RCU has List-Processing APIs} |
| presents RCU's list-processing APIs, |
| \cref{sec:defer:RCU Has Diagnostic APIs} |
| presents RCU's diagnostic APIs, and |
| \cref{sec:defer:Where Can RCU's APIs Be Used?} |
| describes in which contexts RCU's various APIs may be used. |
| Finally, |
| \cref{sec:defer:So; What is RCU Really?} |
| presents concluding remarks. |
| |
| Readers who are not excited about kernel internals may wish to skip |
| ahead to \cref{sec:defer:RCU Usage} |
| on \cpageref{sec:defer:RCU Usage}, |
| but preferably after reviewing the next section covering software-engineering |
| considerations. |
| |
| \subsubsection{RCU API and Software Engineering} |
| \label{sec:defer:RCU API and Software Engineering} |
| |
| Readers who have looked ahead to |
| \cref{tab:defer:RCU Wait-to-Finish APIs,% |
| tab:defer:RCU Publish-Subscribe and Version Maintenance APIs,% |
| tab:defer:RCU-Protected List APIs,% |
| tab:defer:RCU Diagnostic APIs} |
| might have noted that the full list of Linux-kernel APIs sports more |
| than 100 members. |
| This is in sharp (and perhaps dismaying) contrast to the mere six |
| API members shown in |
| \cref{tab:defer:Core RCU API}. |
| This situation clearly raises the question ``Why so many???'' |
| |
| This question is answered more thoroughly in the following sections, |
| but in the meantime the rest of this section summarizes the motivations. |
| |
| There is a wise old saying to the effect of ``To err is human.'' |
| This means that purpose of a significant fraction of the RCU API is to |
| provide diagnostics, most notably in \cref{tab:defer:RCU Diagnostic APIs}, |
| but elsewhere as well. |
| |
| Important causes of human error are the limits of the human brain, |
| for example, the limited capacity of short-term memory. |
| The toy examples shown in this book do not stress these limits. |
| This is out of necessity: |
| Many readers push their cognitive limits while learning new material, |
| so the examples need to be kept simple. |
| |
| These examples therefore keep \co{rcu_dereference()} invocations |
| in the same function as the enclosing \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} calls. |
| In contrast, real-world software must frequently invoke these API members |
| from different functions, and even from different translation units. |
| The Linux kernel RCU API has therefore expanded to accommodate lockdep, |
| which allows \co{rcu_dereference()} and friends to complain if it is |
| not protected by \co{rcu_read_lock()}. |
| Linux-kernel RCU also checks for some double-free errors, infinite |
| loops in RCU read-side critical sections, and attempts to invoke |
| quiescent states within RCU read-side critical sections. |
| |
| Another way that real-world software accommodates the limits of human |
| cognition is through abstraction. |
| The Linux-kernel API therefore includes members that operate on lists |
| in addition to the pointer-oriented core API of |
| \cref{tab:defer:Core RCU API}. |
| The Linux kernel itself also provides RCU-protected hash tables and |
| search trees. |
| |
| Operating-systems kernels such as Linux operate near the bottom of the |
| ``iron triangle'' of the software stack shown in |
| \cref{fig:intro:Software Layers and Performance; Productivity; and Generality}, |
| where performance is critically important. |
| There are thus specialized variants of a number of RCU APIs for use on |
| fastpaths, for example, as discussed in |
| \cref{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs}, |
| \co{RCU_INIT_POINTER()} may be used in |
| place of \co{rcu_assign_pointer()} in cases where the RCU-protected pointer |
| is being assigned to \co{NULL} or when that pointer is not yet accessible |
| by readers. |
| Use of \co{RCU_INIT_POINTER()} allows the compiler more leeway in |
| selecting instructions and carrying out optimizations, thus increasing |
| performance. |
| |
| On the other hand, when used incorrectly \co{RCU_INIT_POINTER()} can |
| result in silent memory corruption, so please be careful! |
| Yes, in some cases, the kernel can check for inappropriate use of |
| RCU API members from a given kernel context, but the constraints of |
| \co{RCU_INIT_POINTER()} use are not yet checkable. |
| |
| Finally, within the Linux kernel, the aforementioned limits of human |
| cognition are compounded by the variety and severity of workloads running |
| on Linux. |
| As of v5.16, this has given rise to no fewer than five flavors of |
| RCU, each designed to provide different performance, scalability, |
| response-time, and energy efficiency tradeoffs to RCU readers and writers. |
| These RCU flavors are the subject of the next section. |
| |
| \subsubsection{RCU has a Family of Wait-to-Finish APIs} |
| \label{sec:defer:RCU has a Family of Wait-to-Finish APIs} |
| |
| The most straightforward answer to ``what is RCU'' is that RCU is |
| an API\@. |
| For example, the RCU implementation used in the Linux kernel is |
| summarized by |
| \cref{tab:defer:RCU Wait-to-Finish APIs}, |
| which shows the wait-for-readers portions of the RCU, ``sleepable'' RCU |
| (SRCU), Tasks RCU, and generic APIs, respectively, |
| and by |
| \cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs}, |
| which shows the publish-subscribe portions of the |
| API~\cite{PaulEMcKenney2024RCUAPI}.\footnote{ |
| This citation covers v6.10 and later. |
| Documetation for earlier versions of the Linux-kernel RCU API may |
| be found elsewhere~\cite{PaulEMcKenney2008WhatIsRCUAPI,PaulEMcKenney2014RCUAPI,PaulEMcKenney2019RCUAPI}.} |
| |
| \begin{sidewaystable*}[tbp] |
| \rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.3} |
| \centering |
| \caption{RCU Wait-to-Finish APIs} |
| \label{tab:defer:RCU Wait-to-Finish APIs} |
| \scriptsize\IfEbookSize{\hspace*{-1.8in}}{\hspace*{-.125in}} |
| \ebresizeverb{0.7}{ |
| \begin{tabularx}{8.5in}{>{\raggedright\arraybackslash}p{0.94in} |
| >{\raggedright\arraybackslash}X |
| >{\raggedright\arraybackslash}X |
| >{\raggedright\arraybackslash}p{1.1in} |
| >{\raggedright\arraybackslash}p{1.35in} |
| >{\raggedright\arraybackslash}p{1.45in}} |
| \toprule |
| & |
| {\bf RCU}: Original & |
| {\bf SRCU}: Sleeping readers & |
| {\bf Tasks RCU}: Free tracing trampolines & |
| {\bf Tasks RCU Rude}: Free idle-task tracing trampolines & |
| {\bf Tasks RCU Trace}: Protect sleepable BPF programs \\ |
| \midrule |
| {\bf Initialization and Cleanup} & |
| & |
| \tco{DEFINE_SRCU()} \tco{DEFINE_STATIC_SRCU()} |
| \tco{init_srcu_struct()} \tco{cleanup_srcu_struct()} & |
| & |
| & |
| \\ |
| {\bf Read-side critical-section markers} & |
| \tco{rcu_read_lock()}~! \tco{rcu_read_unlock()}~! |
| \tco{rcu_read_lock_bh()} \tco{rcu_read_unlock_bh()} |
| \tco{rcu_read_lock_sched()} \tco{rcu_read_unlock_sched()} |
| (Plus anything disabing bottom halves, preemption, or interrupts.) & |
| \tco{srcu_read_lock()} \tco{srcu_read_unlock()} & |
| Voluntary context switch & |
| Voluntary context switch and preempt-enable regions of code & |
| \tco{rcu_read_lock_trace()} \tco{rcu_read_unlock_trace()} \\ |
| {\bf Update-side primitives (synchronous) } & |
| { \tco{synchronize_rcu()} |
| \tco{synchronize_net()} |
| \tco{synchronize_rcu_expedited()} } & |
| \tco{synchronize_srcu()} \tco{synchronize_srcu_expedited()} & |
| \tco{synchronize_rcu_tasks()} & |
| \tco{synchronize_rcu_tasks_rude()} & |
| \tco{synchronize_rcu_tasks_trace()} \\ |
| {\bf Update-side primitives (asynchronous / callback) } & |
| \tco{call_rcu()} ! & |
| \tco{call_srcu()} & |
| \tco{call_rcu_tasks()} & |
| \tco{call_rcu_tasks_rude()} & |
| \tco{call_rcu_tasks_trace()} \\ |
| {\bf Update-side primitives (wait for callbacks) } & |
| \tco{rcu_barrier()} & |
| \tco{srcu_barrier()} & |
| \tco{rcu_barrier_tasks()} & |
| \tco{rcu_barrier_tasks_rude()} & |
| \tco{rcu_barrier_tasks_trace()} \\ |
| {\bf Update-side primitives (initiate / wait)} & |
| \tco{get_state_synchronize_rcu()} |
| \tco{cond_synchronize_rcu()} & |
| & |
| & |
| & |
| \\ |
| {\bf Update-side primitives (free memory) } & |
| \tco{kfree_rcu()} & |
| & |
| & |
| & |
| \\ |
| {\bf Type-safe memory } & |
| \tco{SLAB_TYPESAFE_BY_RCU} & |
| & |
| & |
| & |
| \\ |
| {\bf Read side constraints } & |
| No blocking (only preemption) & |
| No \tco{synchronize_srcu()} with same \tco{srcu_struct} & |
| No voluntary context switch & |
| Neither blocking nor preemption & |
| No RCU tasks trace grace period \\ |
| {\bf Read side overhead } & |
| CPU-local accesses (\tco{barrier()} on \tco{PREEMPT=n}) & |
| Simple instructions, memory barriers & |
| Free & |
| CPU-local accesses (free on \tco{PREEMPT=n}) & |
| CPU-local accesses \\ |
| {\bf Asynchronous update-side overhead } & |
| sub-microsecond & |
| sub-microsecond & |
| sub-microsecond & |
| sub-microsecond & |
| sub-microsecond \\ |
| {\bf Grace-period latency } & |
| 10s of milliseconds & |
| Milliseconds & |
| Seconds & |
| Milliseconds & |
| 10s of milliseconds \\ |
| {\bf Expedited grace-period latency } & |
| 10s of microseconds & |
| Microseconds & |
| N/A & |
| N/A & |
| N/A \\ |
| \bottomrule |
| \end{tabularx} |
| } |
| \end{sidewaystable*} |
| |
| If you are new to RCU, you might consider focusing on just one |
| of the columns in |
| \cref{tab:defer:RCU Wait-to-Finish APIs}, |
| each of which summarizes one member of the Linux kernel's RCU API family. |
| For example, if you are primarily interested in understanding how RCU |
| is used in the Linux kernel, ``RCU'' would be the place to start, |
| as it is used most frequently. |
| On the other hand, if you want to understand RCU for its own sake, |
| ``Tasks RCU'' has the simplest API\@. |
| You can always come back for the other columns later. |
| |
| If you are already familiar with RCU, these tables can |
| serve as a useful reference. |
| |
| \QuickQuiz{ |
| Why do some of the cells in |
| \cref{tab:defer:RCU Wait-to-Finish APIs} |
| have exclamation marks (``!'')? |
| }\QuickQuizAnswer{ |
| The API members with exclamation marks (\co{rcu_read_lock()}, |
| \co{rcu_read_unlock()}, and \co{call_rcu()}) were the |
| only members of the Linux RCU API that Paul E. McKenney was aware |
| of back in the mid-90s. |
| During this timeframe, he was under the mistaken impression that |
| he knew all that there is to know about RCU\@. |
| }\QuickQuizEnd |
| |
| The ``RCU'' column corresponds to the consolidation of the |
| three Linux-kernel RCU |
| implementations~\cite{PaulEMcKenney2019RCUCVE,McKenney:2019:CRS:3319647.3325836}, |
| in which RCU read-side critical sections start with |
| \apik{rcu_read_lock()}, \apik{rcu_read_lock_bh()}, or \apik{rcu_read_lock_sched()} |
| and end with \apik{rcu_read_unlock()}, \apik{rcu_read_unlock_bh()}, |
| or \apik{rcu_read_unlock_sched()}, respectively. |
| Any region of code that disables bottom halves, interrupts, or preemption |
| also acts as an RCU read-side critical section. |
| RCU read-side critical sections may be nested. |
| The corresponding synchronous update-side primitives, |
| \apik{synchronize_rcu()} and \apik{synchronize_rcu_expedited()}, along with |
| their synonym \apik{synchronize_net()}, wait for any type of currently |
| executing RCU read-side critical sections to complete. |
| The length of this wait is known as a ``\IX{grace period}'', and |
| \apik{synchronize_rcu_expedited()} is designed to reduce \IXh{grace-period} |
| {latency} at the expense of increased CPU overhead and IPIs. |
| The asynchronous update-side primitive, \apik{call_rcu()}, |
| invokes a specified function with a specified argument after a |
| subsequent grace period. |
| For example, \co{call_rcu(p,f);} will result in |
| the ``RCU callback'' \co{f(p)} |
| being invoked after a subsequent grace period. |
| There are situations, |
| such as when unloading a Linux-kernel module that uses \co{call_rcu()}, |
| when it is necessary to wait for all |
| outstanding RCU callbacks to complete~\cite{PaulEMcKenney2007rcubarrier}. |
| The \apik{rcu_barrier()} primitive does this job. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How do you prevent a huge number of RCU read-side critical |
| sections from indefinitely blocking a \co{synchronize_rcu()} |
| invocation? |
| }\QuickQuizAnswerB{ |
| There is no need to do anything to prevent RCU read-side |
| critical sections from indefinitely blocking a |
| \co{synchronize_rcu()} invocation, because the |
| \co{synchronize_rcu()} invocation need wait only for |
| \emph{pre-existing} RCU read-side critical sections. |
| So as long as each RCU read-side critical section is |
| of finite duration, RCU grace periods will also remain |
| finite. |
| }\QuickQuizEndB |
| % |
| \QuickQuizM{ |
| The \co{synchronize_rcu()} API waits for all pre-existing |
| interrupt handlers to complete, right? |
| }\QuickQuizAnswerM{ |
| In v4.20 and later Linux kernels, |
| yes~\cite{PaulEMcKenney2019RCUCVE,McKenney:2019:CRS:3319647.3325836}. |
| |
| But not in earlier kernels, and especially not when using |
| preemptible RCU\@! |
| You instead want \apik{synchronize_irq()}. |
| Alternatively, you can place calls to \co{rcu_read_lock()} |
| and \co{rcu_read_unlock()} in the specific interrupt handlers that |
| you want \co{synchronize_rcu()} to wait for. |
| But even then, be careful, as preemptible RCU will not be guaranteed |
| to wait for that portion of the interrupt handler preceding the |
| \co{rcu_read_lock()} or following the \co{rcu_read_unlock()}. |
| }\QuickQuizEndM |
| % |
| \QuickQuizM{ |
| Given that any region of code that disables interrupts acts as |
| an RCU read-side critical section, do atomic operations and |
| single machine instructions also act as tiny RCU read-side |
| critical sections? |
| }\QuickQuizAnswerM{ |
| Yes, they do. |
| |
| But if you are tempted to use this trick, think again, and harder |
| this time. |
| |
| And if you have absolutely no alternative to using this trick, |
| then carefully document your use of it. |
| Otherwise, everyone reading the code (yourself included) will |
| hate you for it. |
| |
| Why so much hate? |
| |
| To see why, keep in mind that the big advantage of Jack |
| Slingwine's and my invention of RCU was that we marked the |
| RCU readers. |
| Take it from all the people who independently invented something |
| kind of like RCU, but failed to mark their read-side critical |
| sections: |
| You really do need to mark your RCU readers! |
| Including your single-instruction RCU readers. |
| }\QuickQuizEndM |
| % |
| \QuickQuizE{ |
| What is the difference between \co{synchronize_rcu()} and |
| \co{rcu_barrier()}? |
| }\QuickQuizAnswerE{ |
| They wait on different things. |
| While \co{synchronize_rcu()} waits for pre-existing RCU read-side |
| critical sections to complete, \co{rcu_barrier()} instead waits |
| for callbacks from prior calls to \co{call_rcu()} to be invoked. |
| |
| \begin{table} |
| \renewcommand*{\arraystretch}{1.2} |
| \centering |
| \small |
| \tcresizewidth{ |
| \begin{fcvref}[ln:defer:synchonize-rcu vs rcu-barrier] |
| \begin{tabular}{lll} |
| \toprule |
| & \multicolumn{2}{c}{Must Wait Until (in the snippet):} \\ |
| \cmidrule{2-3} |
| \multicolumn{1}{c}{Invoked at:} & \multicolumn{1}{c}{\tco{synchronize_rcu()}} |
| & \multicolumn{1}{c}{\tco{rcu_barrier()}} \\ |
| \cmidrule(r){1-1} \cmidrule{2-3} |
| \tco{do_something_1()} & & \\ |
| \tco{do_something_2()} & \tco{rcu_read_unlock()} (\clnref{rrul}) & \\ |
| \tco{do_something_3()} & \tco{rcu_read_unlock()} (\clnref{rrul}) |
| & \tco{f(&p->rh)} (\clnref{cb}) \\ |
| \tco{do_something_4()} & & \tco{f(&p->rh)} (\clnref{cb}) \\ |
| \tco{do_something_5()} & & \\ |
| \bottomrule |
| \end{tabular} |
| \end{fcvref} |
| } |
| |
| \begin{fcvlabel}[ln:defer:synchonize-rcu vs rcu-barrier] |
| \begin{VerbatimN}[commandchars=\\\@\$] |
| do_something_1(); \lnlbl@ds1$ |
| rcu_read_lock(); \lnlbl@rrl$ |
| do_something_2(); \lnlbl@ds2$ |
| call_rcu(&p->rh, f); \lnlbl@cr$ |
| do_something_3(); \lnlbl@ds3$ |
| rcu_read_unlock(); \lnlbl@rrul$ |
| do_something_4(); \lnlbl@ds4$ |
| // f(&p->rh) invoked \lnlbl@cb$ |
| do_something_5(); \lnlbl@ds5$ |
| \end{VerbatimN} |
| \end{fcvlabel} |
| \caption{\tco{synchronize_rcu()} vs. \tco{rcu_barrier()}} |
| \label{tab:defer:synchonize-rcu vs rcu-barrier} |
| \end{table} |
| |
| This distinction is illustrated by |
| \cref{tab:defer:synchonize-rcu vs rcu-barrier}, in which |
| the snippet at the bottom shows code being executed by a given CPU\@. |
| For simplicity, assume that no other CPU is executing |
| \co{rcu_read_lock()}, \co{rcu_read_unlock()}, or |
| \co{call_rcu()}. |
| |
| The table shows how long each primitive must wait if invoked |
| concurrently with each of the \co{do_something_*()} |
| functions, with empty cells indicating that no |
| waiting is necessary. |
| As you can see, \co{synchronize_rcu()} need not wait unless |
| it is in an RCU read-side critical section, in which case |
| it must wait for the \co{rcu_read_unlock()} that ends that |
| critical section. |
| In contrast, RCU read-side critical sections have no effect |
| on \co{rcu_barrier()}. |
| However, when \co{rcu_barrier()} executes after a |
| \co{call_rcu()} invocation, it must wait until the |
| corresponding RCU callback is invoked. |
| |
| All that said, there is a special case where each call to |
| \co{rcu_barrier()} can be replaced by a direct call to |
| \co{synchronize_rcu()}, and that is where \co{synchronize_rcu()} |
| is implemented in terms of \co{call_rcu()} and where there is |
| a single global list of callbacks. |
| But please do not do this in portable code!!! |
| }\QuickQuizEndE |
| } |
| |
| Finally, RCU may be used to provide |
| \IX{type-safe memory}~\cite{Cheriton96a}, as described in |
| \cref{sec:defer:Type-Safe Memory}. |
| In the context of RCU, type-safe memory guarantees that a given |
| data element will not change type during any RCU read-side critical section |
| that accesses it. |
| To make use of RCU-based type-safe memory, pass |
| \apik{SLAB_TYPESAFE_BY_RCU} to \apik{kmem_cache_create()}. |
| |
| The ``SRCU'' column in |
| \cref{tab:defer:RCU Wait-to-Finish APIs} |
| displays a specialized RCU API that permits general sleeping in SRCU |
| read-side critical |
| sections~\cite{PaulEMcKenney2006c} |
| delimited by \apik{srcu_read_lock()} and \apik{srcu_read_unlock()}. |
| However, unlike RCU, SRCU's \apik{srcu_read_lock()} returns a value that |
| must be passed into the corresponding \apik{srcu_read_unlock()}. |
| This difference is due to the fact that the SRCU user allocates an |
| \apik{srcu_struct} for each distinct SRCU usage, so that there is no |
| convenient place to store a per-task reader-nesting count. |
| (Keep in mind that although the Linux kernel provides dynamically |
| allocated per-CPU storage, there is not yet dynamically allocated |
| per-task storage.) |
| |
| A given \co{srcu_struct} structure may be defined as a global |
| variable with \co{DEFINE_SRCU()} if the structure must be used in |
| multiple translation units, or with \co{DEFINE_STATIC_SRCU()} otherwise. |
| For example, \co{DEFINE_SRCU(my_srcu)} would create a global variable |
| named \co{my_srcu} that could be used by any file in the program. |
| Alternatively, an \co{srcu_struct} structure may be either an on-stack |
| variable or a dynamically allocated region of memory. |
| In both of these non-global-variable cases, the memory must be initialized |
| using \co{init_srcu_struct()} prior to its first use and cleaned up |
| using \co{cleanup_srcu_struct()} after its last use (but before the |
| underlying storage disappears). |
| |
| However they are created, these distinct \co{srcu_struct} structures |
| prevent SRCU read-side critical sections from blocking unrelated |
| \apik{synchronize_srcu()} and \apik{synchronize_srcu_expedited()} |
| invocations. |
| Of course, use of either \apik{synchronize_srcu()} or |
| \apik{synchronize_srcu_expedited()} within an SRCU read-side critical |
| section can result in self-deadlock, so should be avoided. |
| As with RCU, SRCU's \co{synchronize_srcu_expedited()} decreases |
| grace-period latency compared to \co{synchronize_srcu()}, but at |
| the expense of increased CPU overhead. |
| |
| \QuickQuiz{ |
| Under what conditions can \co{synchronize_srcu()} be safely |
| used within an SRCU read-side critical section? |
| }\QuickQuizAnswer{ |
| In principle, you can use either \co{synchronize_srcu()} or |
| \co{synchronize_srcu_expedited()} with a given \co{srcu_struct} |
| within an SRCU read-side critical section that uses some other |
| \co{srcu_struct}. |
| In practice, however, doing this is almost certainly a bad idea. |
| In particular, the code shown in |
| \cref{lst:defer:Multistage SRCU Deadlocks} |
| could still result in deadlock. |
| |
| \begin{listing} |
| \begin{VerbatimL} |
| idx = srcu_read_lock(&ssa); |
| synchronize_srcu(&ssb); |
| srcu_read_unlock(&ssa, idx); |
| |
| /* . . . */ |
| |
| idx = srcu_read_lock(&ssb); |
| synchronize_srcu(&ssa); |
| srcu_read_unlock(&ssb, idx); |
| \end{VerbatimL} |
| \caption{Multistage SRCU Deadlocks} |
| \label{lst:defer:Multistage SRCU Deadlocks} |
| \end{listing} |
| }\QuickQuizEnd |
| |
| Similar to normal RCU, self-deadlock can be avoided using the |
| asynchronous \apik{call_srcu()} function. |
| However, special care must be taken when using \co{call_srcu()} because |
| a single task could register SRCU callbacks very quickly. |
| Given that SRCU allows readers to block for arbitrary periods of |
| time, this could consume an arbitrarily large quantity of memory. |
| In contrast, given the synchronous \co{synchronize_srcu()} |
| interface, a given task must finish waiting for a given grace period |
| before it can start waiting for the next one. |
| |
| Also similar to RCU, there is an \apik{srcu_barrier()} function that waits |
| for all prior \co{call_srcu()} callbacks to be invoked. |
| |
| In other words, SRCU compensates for its extremely weak |
| forward-progress guarantees by permitting the developer to restrict |
| its scope. |
| |
| The ``Tasks RCU'' column in |
| \cref{tab:defer:RCU Wait-to-Finish APIs} displays a specialized |
| RCU API that mediates freeing of the trampolines used in Linux-kernel |
| tracing. |
| These trampolines are used to transfer control from a point in the |
| code being traced to the code doing the actual tracing. |
| It is of course necessary to ensure that all code executing within |
| a given trampoline has finished before freeing that trampoline. |
| |
| Changes to the code being traced are typically limited to a single jump |
| or call instruction, and thus cannot accommodate the sequence of code |
| required to implement \co{rcu_read_lock()} and \co{rcu_read_unlock()}. |
| Nor can the trampoline contain these calls to \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()}. |
| To see this, consider a CPU that is just about to start executing a |
| given trampoline. |
| Because it has not yet executed the \co{rcu_read_lock()}, that |
| trampoline could be freed at any time, which would come as a fatal |
| surprise to this CPU\@. |
| Therefore, trampolines cannot be protected by synchronization primitives |
| executed in either the traced code or in the trampoline itself. |
| Which does raise the question of exactly how the trampoline is to be |
| protected. |
| |
| The key to answering this question is to note that trampoline code |
| never contains code that either directly or indirectly does a |
| voluntary context switch. |
| This code might be preempted, but it will never directly or indirectly |
| invoke \apik{schedule()}. |
| This suggests a variant of RCU having voluntary context switches and |
| idle execution as its only quiescent states. |
| This variant is Tasks RCU\@. |
| |
| Tasks RCU is unusual in having no read-side marking functions, which |
| is good given that its main use case has nowhere to put such markings. |
| Instead, calls to \co{schedule()} serve directly as quiescent states. |
| Updates can use \apik{synchronize_rcu_tasks()} to wait for all pre-existing |
| trampoline execution to complete, or they can use its asynchronous |
| counterpart, \apik{call_rcu_tasks()}. |
| There is also an \apik{rcu_barrier_tasks()} that waits for completion |
| of callbacks corresponding to all prior invocations of \co{call_rcu_tasks()}. |
| There is no \co{synchronize_rcu_tasks_expedited()} because there has |
| not yet been a request for it, though implementing a useful variant of |
| it would not be free of challenges. |
| |
| \QuickQuiz{ |
| In a kernel built with \co{CONFIG_PREEMPT_NONE=y}, won't |
| \co{synchronize_rcu()} wait for all trampolines, given |
| that preemption is disabled and that trampolines never |
| directly or indirectly invoke \co{schedule()}? |
| }\QuickQuizAnswer{ |
| You are quite right! |
| |
| In fact, in nonpreemptible kernels, \co{synchronize_rcu_tasks()} |
| is a wrapper around \co{synchronize_rcu()}. |
| }\QuickQuizEnd |
| |
| The ``Tasks RCU Rude'' column provides a more effective variant |
| of the toy implementation presented in |
| \cref{sec:defer:Toy Implementation}. |
| This variant causes each CPU to execute a context switch, |
| so that any voluntary context switch or any preemptible region of |
| code can serve as a quiescent state. |
| The Tasks RCU Rude variant uses the Linux-kernel workqueues facility to |
| force concurrent context switches, in contrast to the serial |
| CPU-by-CPU approach taken by the toy implementation. |
| The API mirrors that of Tasks RCU, including the lack of explicit |
| read-side markers. |
| |
| Finally, the ``Tasks RCU Trace'' column provides an RCU implementation |
| with functionality similar to that of SRCU, except with much faster |
| read-side markers.\footnote{ |
| And thus is unusual for the Tasks RCU family for having |
| explicit read-side markers!} |
| However, this speed is a consequence of the fact that these markers |
| do not execute memory-barrier instructions, which means that Tasks RCU |
| Trace grace periods must often send IPIs to all CPUs and must always |
| scan the entire task list, thus degrading real-time response and |
| consuming considerable CPU time. |
| Nevertheless, in the absence of readers, the resulting grace-period |
| latency is reasonably short, rivaling that of RCU\@. |
| |
| \subsubsection{RCU has Publish-Subscribe and Version-Maintenance APIs} |
| \label{sec:defer:RCU has Publish-Subscribe and Version-Maintenance APIs} |
| |
| Fortunately, the RCU publish-subscribe and version-maintenance |
| primitives shown in |
| \cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs} |
| apply to all of the variants of RCU discussed above. |
| This commonality can allow more code to be shared, and reduces API |
| proliferation. |
| The original purpose of the RCU publish-subscribe APIs was to |
| bury memory barriers into these APIs, so that Linux kernel |
| programmers could use RCU without needing to become expert on |
| the memory-ordering models of each of the 20+ CPU families |
| that Linux supports~\cite{Spraul01}. |
| |
| \begin{table*} |
| \renewcommand*{\arraystretch}{1.15} |
| \footnotesize |
| \centering\OneColumnHSpace{-.4in} |
| \ebresizewidth{ |
| \begin{tabular}{llp{2.2in}} |
| \toprule |
| Category & |
| Primitives & |
| Overhead \\ |
| \midrule |
| Pointer publish & |
| \tco{rcu_assign_pointer()} & |
| Memory barrier \\ |
| & |
| \tco{rcu_replace_pointer()} & |
| Memory barrier (two of them on Alpha) \\ |
| & |
| \tco{rcu_pointer_handoff()} & |
| Simple instructions \\ |
| & |
| \tco{RCU_INIT_POINTER()} & |
| Simple instructions \\ |
| & |
| \tco{RCU_POINTER_INITIALIZER()} & |
| Compile-time constant \\ |
| \midrule |
| Pointer subscribe (traversal) & |
| \tco{rcu_access_pointer()} & |
| Simple instructions \\ |
| & |
| \tco{rcu_dereference()} & |
| Simple instructions (memory barrier on Alpha) \\ |
| & |
| \tco{rcu_dereference_check()} & |
| Simple instructions (memory barrier on Alpha) \\ |
| & |
| \tco{rcu_dereference_protected()} & |
| Simple instructions \\ |
| & |
| \tco{rcu_dereference_raw()} & |
| Simple instructions (memory barrier on Alpha) \\ |
| & |
| \tco{rcu_dereference_raw_notrace()} & |
| Simple instructions (memory barrier on Alpha) \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{RCU Publish-Subscribe and Version Maintenance APIs} |
| \label{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs} |
| \end{table*} |
| |
| These primitives operate directly on pointers, and are useful for |
| creating RCU-protected linked data structures, such as RCU-protected |
| arrays and trees. |
| The special case of linked lists is handled by a separate set of |
| APIs described in |
| \cref{sec:defer:RCU has List-Processing APIs}. |
| |
| The first category publishes pointers to new data items. |
| The \apik{rcu_assign_pointer()} primitive ensures that any |
| prior initialization remains ordered before the assignment to the |
| pointer on weakly ordered machines. |
| The \apik{rcu_replace_pointer()} primitive updates the pointer just like |
| \co{rcu_assign_pointer()} does, but also returns the previous value, |
| just like \co{rcu_dereference_protected()} (see below) would, including |
| the lockdep expression. |
| This replacement is convenient when the updater must both publish a new |
| pointer and free the structure referenced by the old pointer. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Normally, any pointer subject to \co{rcu_dereference()} \emph{must} |
| always be updated using one of the pointer-publish functions in |
| \cref{tab:defer:RCU Publish-Subscribe and Version Maintenance APIs}, |
| for example, \co{rcu_assign_pointer()}. |
| |
| What is an exception to this rule? |
| }\QuickQuizAnswerB{ |
| One such exception is when a multi-element linked |
| data structure is initialized as a unit while inaccessible to other |
| CPUs, and then a single \co{rcu_assign_pointer()} is used |
| to plant a global pointer to this data structure. |
| The initialization-time pointer assignments need not use |
| \co{rcu_assign_pointer()}, though any such assignments that |
| happen after the structure is globally visible \emph{must} use |
| \co{rcu_assign_pointer()}. |
| |
| However, unless this initialization code is on an impressively hot |
| code-path, it is probably wise to use \co{rcu_assign_pointer()} |
| anyway, even though it is in theory unnecessary. |
| It is all too easy for a ``minor'' change to invalidate your cherished |
| assumptions about the initialization happening privately. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Are there any downsides to the fact that these traversal and update |
| primitives can be used with any of the RCU API family members? |
| }\QuickQuizAnswerE{ |
| It can sometimes be difficult for automated |
| code checkers such as ``sparse'' (or indeed for human beings) to |
| work out which type of RCU read-side critical section a given |
| RCU traversal primitive corresponds to. |
| For example, consider the code shown in |
| \cref{lst:defer:Diverse RCU Read-Side Nesting}. |
| |
| \begin{listing} |
| \begin{VerbatimL} |
| rcu_read_lock(); |
| preempt_disable(); |
| p = rcu_dereference(global_pointer); |
| |
| /* . . . */ |
| |
| preempt_enable(); |
| rcu_read_unlock(); |
| \end{VerbatimL} |
| \caption{Diverse RCU Read-Side Nesting} |
| \label{lst:defer:Diverse RCU Read-Side Nesting} |
| \end{listing} |
| |
| Is the \co{rcu_dereference()} primitive in a vanilla RCU critical |
| section or an RCU Sched critical section? |
| What would you have to do to figure this out? |
| |
| But perhaps after the consolidation of the RCU flavors in |
| the v4.20 Linux kernel we no longer need to care! |
| }\QuickQuizEndE |
| } |
| |
| The \apik{rcu_pointer_handoff()} primitive simply returns its sole argument, |
| but is useful to tooling checking for pointers being leaked from |
| RCU read-side critical sections. |
| Use of \co{rcu_pointer_handoff()} indicates to such tooling that protection |
| of the structure in question has been handed off from RCU to some other |
| mechanism, such as locking or reference counting. |
| |
| The \apik{RCU_INIT_POINTER()} macro can be used to initialize RCU-protected |
| pointers that have not yet been exposed to readers, or alternatively, |
| to set RCU-protected pointers to \co{NULL}. |
| In these restricted cases, the memory-barrier instructions provided by |
| \co{rcu_assign_pointer()} are not needed. |
| Similarly, \apik{RCU_POINTER_INITIALIZER()} provides a \GCC-style |
| structure initializer to allow easy initialization of RCU-protected |
| pointers in structures. |
| |
| The second category subscribes to pointers to data items, or, |
| alternatively, safely traverses RCU-protected pointers. |
| Again, simply loading these pointers using C-language accesses |
| could result in seeing pre-initialization garbage in the pointed-to data. |
| Similarly, loading these pointer by any means outside of an RCU |
| read-side critical section could result in the pointed-to object being |
| freed at any time. |
| However, if the pointer is merely to be tested and not dereferenced, |
| the freeing of the pointed-to object is not necessarily a problem. |
| In this case, \apik{rcu_access_pointer()} may be used. |
| Normally, however, RCU read-side protection is required, and so the |
| \apik{rcu_dereference()} primitive uses the Linux kernel's \co{lockdep} |
| facility~\cite{JonathanCorbet2006lockdep} to verify that this |
| \co{rcu_dereference()} invocation is under the protection of |
| \co{rcu_read_lock()}, \co{srcu_read_lock()}, or some other RCU read-side |
| marker. |
| In contrast, the \co{rcu_access_pointer()} primitive does not involve |
| \co{lockdep}, and thus will not provoke \co{lockdep} complaints when |
| used outside of an RCU read-side critical section. |
| |
| Another situation where protection is not required is when update-side code |
| accesses the RCU-protected pointer while holding the update-side lock. |
| The \apik{rcu_dereference_protected()} API member is provided for this |
| situation. |
| Its first parameter is the RCU-protected pointer, and the second |
| parameter takes a lockdep expression describing which locks must be |
| held in order for the access to be safe. |
| Code invoked both from readers and updaters can use |
| \apik{rcu_dereference_check()}, which also takes a lockdep expression, but |
| which may also be invoked from read-side code not holding the locks. |
| In some cases, the lockdep expressions can be very complex, for example, |
| when using fine-grained locking, any of a very large number of locks |
| might be held, and it might be quite difficult to work out which applies. |
| In these (hopefully rare) cases, \apik{rcu_dereference_raw()} provides |
| protection but does not check for being invoked within a reader or with |
| any particular lock being held. |
| The \apik{rcu_dereference_raw_notrace()} API member acts similarly, but |
| cannot be traced, and may therefore be safely used by tracing code. |
| |
| Although pretty much any linked structure can be accessed by manipulating |
| pointers, higher-level structures can be quite helpful. |
| The next section therefore looks at various sorts of RCU-protected |
| linked lists used by the Linux kernel. |
| |
| \subsubsection{RCU has List-Processing APIs} |
| \label{sec:defer:RCU has List-Processing APIs} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/Linux_list}} |
| \caption{Linux Circular Linked List (\tco{list})} |
| \label{fig:defer:Linux Circular Linked List (list)} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/Linux_list_abbr}} |
| \caption{Linux Linked List Abbreviated} |
| \label{fig:defer:Linux Linked List Abbreviated} |
| \end{figure} |
| |
| Although \co{rcu_assign_pointer()} and |
| \co{rcu_dereference()} can in theory be used to construct any |
| conceivable RCU-protected data structure, in practice it is often better |
| to use higher-level constructs. |
| Therefore, the \co{rcu_assign_pointer()} and |
| \co{rcu_dereference()} |
| primitives have been embedded in special RCU variants of Linux's |
| list-manipulation API\@. |
| Linux has four variants of doubly linked list, the circular |
| \co{struct list_head} and the linear |
| \co{struct hlist_head}/\co{struct hlist_node}, |
| \co{struct hlist_nulls_head}/\co{struct hlist_nulls_node}, and |
| \co{struct hlist_bl_head}/\co{struct hlist_bl_node} |
| pairs. |
| The former is laid out as shown in |
| \cref{fig:defer:Linux Circular Linked List (list)}, |
| where the green (leftmost) boxes represent the list header and the blue |
| (rightmost three) boxes represent the elements in the list. |
| This notation is cumbersome, and will therefore be abbreviated as shown in |
| \cref{fig:defer:Linux Linked List Abbreviated}, |
| which shows only the non-header (blue) elements. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/Linux_hlist}} |
| \caption{Linux Linear Linked List (\tco{hlist})} |
| \label{fig:defer:Linux Linear Linked List (hlist)} |
| \end{figure} |
| |
| Linux's \co{hlist}\footnote{ |
| The ``h'' stands for hashtable, in which it reduces memory |
| use by half compared to Linux's double-pointer circular |
| linked list.} |
| is a linear list, which means that |
| it needs only one pointer for the header rather than the two |
| required for the circular list, as shown in |
| \cref{fig:defer:Linux Linear Linked List (hlist)}. |
| Thus, use of \co{hlist} can halve the memory consumption for the hash-bucket |
| arrays of large hash tables. |
| As before, this notation is cumbersome, so \co{hlist} structures will |
| be abbreviated in the same way \co{list_head}-style lists are, as shown in |
| \cref{fig:defer:Linux Linked List Abbreviated}. |
| |
| A variant of Linux's \co{hlist}, named \co{hlist_nulls}, provides multiple |
| distinct \co{NULL} pointers, but otherwise uses the same layout as shown in |
| \cref{fig:defer:Linux Linear Linked List (hlist)}. |
| In this variant, a \co{->next} pointer having a zero low-order bit is |
| considered to be a pointer. |
| However, if the low-order bit is set to one, the upper bits identify |
| the type of \co{NULL} pointer. |
| This type of list is used to allow lockless readers to detect when a |
| node has been moved from one list to another. |
| For example, each bucket of a hash table might use its index to mark |
| its \co{NULL} pointer. |
| Should a reader encounter a \co{NULL} pointer not matching the index of |
| the bucket it started from, that reader knows that an element it was |
| traversing was moved to some other bucket during the traversal, taking |
| that reader with it. |
| The reader can use the \apik{is_a_nulls()} function (which returns \co{true} |
| if passed an \co{hlist_nulls} \co{NULL} pointer) to determine when |
| it reaches the end of a list, and the \apik{get_nulls_value()} function |
| (which returns its argument's \co{NULL}-pointer identifier) to fetch |
| the type of \co{NULL} pointer. |
| When \co{get_nulls_value()} returns an unexpected value, the reader |
| can take corrective action, for example, restarting its traversal from |
| the beginning. |
| |
| \QuickQuiz{ |
| But what if an \co{hlist_nulls} reader gets moved to some other |
| bucket and then back again? |
| }\QuickQuizAnswer{ |
| One way to handle this is to always move nodes to the beginning |
| of the destination bucket, ensuring that when the reader reaches |
| the end of the list having a matching \co{NULL} pointer, it will |
| have searched the entire list. |
| |
| Of course, if there are too many move operations in a hash table |
| with many elements per bucket, the reader might never reach the |
| end of a list. |
| One way of avoiding this in the common case is to keep hash |
| tables well-tuned, thus with short lists. |
| One way of detecting the problem and handling it is for the |
| reader to terminate the search after traversing some large |
| number of nodes, acquire the update-side lock, and redo the |
| search, but this might introduce deadlocks. |
| Another way of avoiding the problem entirely is for readers to |
| search within RCU read-side critical sections, and to wait for |
| an RCU grace period between successive updates. |
| An intermediate position might wait for an RCU grace period |
| every $N$ updates, for some suitable value of $N$. |
| }\QuickQuizEnd |
| |
| More information on \co{hlist_nulls} is available in the Linux-kernel |
| source tree, with helpful example code provided in the |
| \path{rculist_nulls.rst} file (\path{rculist_nulls.txt} in older kernels). |
| |
| Another variant of Linux's \co{hlist} incorporates bit-locking, |
| and is named \co{hlist_bl}. |
| This variant uses the same layout as shown in |
| \cref{fig:defer:Linux Linear Linked List (hlist)}, |
| but reserves the low-order bit of the head pointer (``first'' in the |
| figure) to lock the list. |
| This approach also reduces memory usage, as it allows what would otherwise |
| be a separate spinlock to be stored with the pointer itself. |
| |
| \begin{sidewaystable*}[htbp] |
| \rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.3} |
| \centering |
| \caption{RCU-Protected List APIs} |
| \label{tab:defer:RCU-Protected List APIs} |
| \footnotesize |
| \newlength{\cwa}\newlength{\cwb}\newlength{\cwc}\newlength{\cwd} |
| \IfNimbusAvail{ |
| \renewcommand{\ttdefault}{NimbusMonoN} |
| \setlength{\cwa}{1.9in}\setlength{\cwb}{2.1in} |
| \setlength{\cwc}{1.8in}\setlength{\cwd}{1.6in} |
| }{ |
| \setlength{\cwa}{1.95in}\setlength{\cwb}{2.15in} |
| \setlength{\cwc}{1.9in}\setlength{\cwd}{1.7in} |
| } |
| \ebresizewidthsw{ |
| \begin{tabular}{>{\raggedright\arraybackslash}p{\cwa} |
| >{\raggedright\arraybackslash}p{\cwb} |
| >{\raggedright\arraybackslash}p{\cwc} |
| >{\raggedright\arraybackslash}p{\cwd}} |
| \toprule |
| \pmb{\tco{list}}: Circular doubly linked list & |
| \pmb{\tco{hlist}}: Linear doubly linked list & |
| \pmb{\tco{hlist_nulls}}: Linear doubly linked list with marked |
| NULL pointer, with up to 31~bits of marking & |
| \pmb{\tco{hlist_bl}}: Linear doubly linked list with bit locking \\ |
| \midrule |
| \multicolumn{4}{l}{{\bf Structures}} \\ |
| \tco{struct list_head} & |
| \tco{struct}{\tt ~}\tco{hlist_head} ~~~~~~~~~~~~~~ |
| \tco{struct}{\tt ~}\tco{hlist_node} & |
| \tco{struct}{\tt ~}\tco{hlist_nulls_head} |
| \tco{struct}{\tt ~}\tco{hlist_nulls_node} & |
| \tco{struct}{\tt ~}\tco{hlist_bl_head} |
| \tco{struct}{\tt ~}\tco{hlist_bl_node} \\ |
| \multicolumn{4}{l}{{\bf Initialization}} \\ |
| & |
| \tco{INIT_LIST_HEAD_RCU()} & |
| & |
| \\ |
| \multicolumn{4}{l}{{\bf Full traversal}} \\ |
| \tco{list_for_each_entry_rcu()} |
| \tco{list_for_each_entry_lockless()} & |
| \tco{hlist_for_each_entry_rcu()} |
| \tco{hlist_for_each_entry_rcu_bh()} |
| \tco{hlist_for_each_entry_rcu_notrace()} & |
| \tco{hlist_nulls_for_each_entry_rcu()} |
| \tco{hlist_nulls_for_each_entry_safe()} & |
| \tco{hlist_bl_for_each_entry_rcu()} \\ |
| \multicolumn{4}{l}{{\bf Resume traversal}} \\ |
| \tco{list_for_each_entry_continue_rcu()} |
| \tco{list_for_each_entry_from_rcu()} & |
| \tco{hlist_for_each_entry_continue_rcu()} |
| \tco{hlist_for_each_entry_continue_rcu_bh()} |
| \tco{hlist_for_each_entry_from_rcu()} & |
| & |
| \\ |
| \multicolumn{4}{l}{{\bf Stepwise traversal}} \\ |
| \tco{list_entry_rcu()} |
| \tco{list_entry_lockless()} |
| \tco{list_first_or_null_rcu()} |
| \tco{list_next_rcu()} |
| \tco{list_next_or_null_rcu()} & |
| \multicolumn{1}{p{1.2in}}{\tco{hlist_first_rcu()} |
| \tco{hlist_next_rcu()} |
| \tco{hlist_pprev_rcu()}} & |
| \tco{hlist_nulls_first_rcu()} |
| \tco{hlist_nulls_next_rcu()} & |
| \tco{hlist_bl_first_rcu()} \\ |
| \multicolumn{4}{l}{{\bf Add}} \\ |
| \multicolumn{1}{p{1.2in}}{\tco{list_add_rcu()} |
| \tco{list_add_tail_rcu()}} & |
| \tco{hlist_add_before_rcu()} |
| \tco{hlist_add_behind_rcu()} |
| \tco{hlist_add_head_rcu()} |
| \tco{hlist_add_tail_rcu()} & |
| \tco{hlist_nulls_add_head_rcu()} & |
| \tco{hlist_bl_add_head_rcu()} |
| \tco{hlist_bl_set_first_rcu()} \\ |
| \multicolumn{4}{l}{{\bf Delete}} \\ |
| \tco{list_del_rcu()} & |
| \multicolumn{1}{p{1.2in}}{\tco{hlist_del_rcu()} |
| \tco{hlist_del_init_rcu()}} & |
| \tco{hlist_nulls_del_rcu()} |
| \tco{hlist_nulls_del_init_rcu()} & |
| \tco{hlist_bl_del_rcu()} |
| \tco{hlist_bl_del_init_rcu()} \\ |
| \multicolumn{4}{l}{{\bf Replace}} \\ |
| \tco{list_replace_rcu()} & |
| \tco{hlist_replace_rcu()} & |
| & |
| \\ |
| \multicolumn{4}{l}{{\bf Splice}} \\ |
| \tco{list_splice_init_rcu()} & |
| \tco{list_splice_tail_init_rcu()} & |
| & |
| \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \end{sidewaystable*} |
| |
| The API members for these linked-list variants are summarized in |
| \cref{tab:defer:RCU-Protected List APIs}. |
| More information is available in the \path{Documentation/RCU} |
| directory of the Linux-kernel source tree and at |
| Linux Weekly News~\cite{PaulEMcKenney2024RCUAPI}. |
| |
| However, the remainder of this section expands on the use of |
| \apik{list_replace_rcu()}, given that this API member gave RCU its name. |
| This API member is used to carry out more complex updates in which an |
| element in the middle of the list having multiple fields is atomically |
| updated, so that a given reader sees either the old set of values or |
| the new set of values, but not a mixture of the two sets. |
| For example, each node of a linked list might have integer fields |
| \co{->a}, \co{->b}, and~\co{->c}, and it might be necessary to update |
| a given node's fields from 5, 6, and~7 to 5, 2, and~3, respectively. |
| |
| The code implementing this atomic update is straightforward: |
| |
| \begin{fcvlabel}[ln:defer:Canonical RCU Replacement Example (2nd)] |
| \begin{VerbatimN}[samepage=true,commandchars=\\\[\],firstnumber=15] |
| q = kmalloc(sizeof(*p), GFP_KERNEL); \lnlbl[kmalloc] |
| *q = *p; \lnlbl[copy] |
| q->b = 2; \lnlbl[update1] |
| q->c = 3; \lnlbl[update2] |
| list_replace_rcu(&p->list, &q->list); \lnlbl[replace] |
| synchronize_rcu(); \lnlbl[sync_rcu] |
| kfree(p); \lnlbl[kfree] |
| \end{VerbatimN} |
| \end{fcvlabel} |
| |
| \begin{figure} |
| \centering |
| \IfEbookSize{ |
| \resizebox{2in}{!}{\includegraphics{defer/RCUReplacement}} |
| }{ |
| \resizebox{2.7in}{!}{\includegraphics{defer/RCUReplacement}} |
| } |
| \caption{RCU Replacement in Linked List} |
| \label{fig:defer:RCU Replacement in Linked List} |
| \end{figure} |
| |
| The following discussion walks through this code, using |
| \cref{fig:defer:RCU Replacement in Linked List} to illustrate |
| the state changes. |
| The triples in each element represent the values of fields \co{->a}, |
| \co{->b}, and~\co{->c}, respectively. |
| The red-shaded elements might be referenced by readers, |
| and because readers do not synchronize directly with updaters, |
| readers might run concurrently with this entire replacement process. |
| Please note that backwards pointers and the link from the tail to the |
| head are omitted for clarity. |
| |
| The initial state of the list, including the pointer \co{p}, |
| is the same as for the deletion example, as shown on the |
| first row of the figure. |
| |
| The following text describes how to replace the \co{5,6,7} element |
| with \co{5,2,3} in such a way that any given reader sees one of these |
| two values. |
| |
| \begin{fcvref}[ln:defer:Canonical RCU Replacement Example (2nd)] |
| \Clnref{kmalloc} allocates a replacement element, |
| resulting in the state as shown in the second row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| At this point, no reader can hold a reference to the newly allocated |
| element (as indicated by its green shading), and it is uninitialized |
| (as indicated by the question marks). |
| |
| \Clnref{copy} copies the old element to the new one, resulting in the |
| state as shown in the third row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| The newly allocated element still cannot be referenced by readers, but |
| it is now initialized. |
| |
| \Clnref{update1} updates \co{q->b} to the value ``2'', and |
| \clnref{update2} updates \co{q->c} to the value ``3'', |
| as shown on the fourth row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| Note that the newly allocated structure is still inaccessible to readers. |
| |
| Now, \clnref{replace} does the replacement, so that the new element is |
| finally visible to readers, and hence is shaded red, as shown on |
| the fifth row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| At this point, as shown below, we have two versions of the list. |
| Pre-existing readers might see the \co{5,6,7} element (which is |
| therefore now shaded yellow), but |
| new readers will instead see the \co{5,2,3} element. |
| But any given reader is guaranteed to see one set of values or the |
| other, not a mixture of the two. |
| |
| After the \co{synchronize_rcu()} on \clnref{sync_rcu} returns, |
| a grace period will have elapsed, and so all reads that started before the |
| \apik{list_replace_rcu()} will have completed. |
| In particular, any readers that might have been holding references |
| to the \co{5,6,7} element are guaranteed to have exited |
| their RCU read-side critical sections, and are thus prohibited from |
| continuing to hold a reference. |
| Therefore, there can no longer be any readers holding references |
| to the old element, as indicated its green shading in the sixth row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| As far as the readers are concerned, we are back to having a single version |
| of the list, but with the new element in place of the old. |
| |
| After the \apik{kfree()} on \clnref{kfree} completes, the list will |
| appear as shown on the final row of |
| \cref{fig:defer:RCU Replacement in Linked List}. |
| \end{fcvref} |
| |
| Despite the fact that RCU was named after the replacement case, |
| the vast majority of RCU usage within the Linux kernel relies on |
| the simple independent insertion and deletion, as was shown in |
| \cref{fig:defer:Multiple RCU Data-Structure Versions} in |
| \cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}. |
| |
| The next section looks at APIs that assist developers in debugging |
| their code that makes use of RCU\@. |
| |
| \subsubsection{RCU Has Diagnostic APIs} |
| \label{sec:defer:RCU Has Diagnostic APIs} |
| |
| \begin{table} |
| \renewcommand*{\arraystretch}{1.15} |
| \footnotesize |
| \centering |
| \begin{tabular}{ll} |
| \toprule |
| Category & |
| Primitives \\ |
| \midrule |
| Mark RCU pointer & |
| \tco{__rcu} \\ |
| \midrule |
| Debug-object support & |
| \tco{init_rcu_head()} \\ |
| & \tco{destroy_rcu_head()} \\ |
| & \tco{init_rcu_head_on_stack()} \\ |
| & \tco{destroy_rcu_head_on_stack()} \\ |
| \midrule |
| Stall-warning control & |
| \tco{rcu_cpu_stall_reset()} \\ |
| \midrule |
| Callback checking & |
| \tco{rcu_head_init()} \\ |
| & \tco{rcu_head_after_call_rcu()} \\ |
| \midrule |
| lockdep support & |
| \tco{rcu_read_lock_held()} \\ |
| & \tco{rcu_read_lock_bh_held()} \\ |
| & \tco{rcu_read_lock_sched_held()} \\ |
| & \tco{srcu_read_lock_held()} \\ |
| & \tco{rcu_is_watching()} \\ |
| & \tco{RCU_LOCKDEP_WARN()} \\ |
| & \tco{RCU_NONIDLE()} \\ |
| & \tco{rcu_sleep_check()} \\ |
| \bottomrule |
| \end{tabular} |
| \caption{RCU Diagnostic APIs} |
| \label{tab:defer:RCU Diagnostic APIs} |
| \end{table} |
| |
| \Cref{tab:defer:RCU Diagnostic APIs} |
| shows RCU's diagnostic APIs. |
| |
| The \co{__rcu} tag marks an RCU-protected pointer, for example, |
| \qtco{struct foo __rcu *p;}. |
| Pointers that might be passed to \co{rcu_dereference()} can be marked, |
| but pointers holding values returned from \co{rcu_dereference()} |
| should not be. |
| Providing these markings on variables, structure fields, function |
| parameters, and return values allows the Linux kernel's \co{sparse} |
| tool to detect situations where RCU-protected pointers are |
| incorrectly accessed using plain C-language loads and stores. |
| |
| Debug-object support is automatic for any \apik{rcu_head} structures |
| that are part of a structure obtained from the Linux kernel's |
| memory allocators, but those building their own special-purpose |
| memory allocators can use \apik{init_rcu_head()} and \apik{destroy_rcu_head()} |
| at allocation and free time, respectively. |
| Those using \co{rcu_head} structures allocated on the function-call |
| stack (it happens!\@) may use \apik{init_rcu_head_on_stack()} |
| before first use and \apik{destroy_rcu_head_on_stack()} after last use, |
| but before returning from the function. |
| Debug-object support allows detection of bugs involving passing the |
| same \co{rcu_head} structure to \co{call_rcu()} and friends in |
| quick succession, which is the \co{call_rcu()} counterpart to the |
| infamous double-free class of memory-allocation bugs. |
| |
| Stall-warning control is provided by \apik{rcu_cpu_stall_reset()}, which |
| allows the caller to suppress RCU CPU stall warnings for the remainder |
| of the current grace period. |
| RCU CPU stall warnings help pinpoint situations where an RCU read-side |
| critical section runs for an excessive length of time, and it is useful |
| for things like kernel debuggers to be able to suppress them, for example, |
| when encountering a breakpoint. |
| |
| Callback checking is provided by \apik{rcu_head_init()} and |
| \apik{rcu_head_after_call_rcu()}. |
| The former is invoked on an \co{rcu_head} structure before it is passed |
| to \co{call_rcu()}, and then \co{rcu_head_after_call_rcu()} will |
| check to see if the callback has been invoked with the specified |
| function. |
| |
| Support for lockdep~\cite{JonathanCorbet2006lockdep} includes |
| \apik{rcu_read_lock_held()}, |
| \apik{rcu_read_lock_bh_held()}, |
| \apik{rcu_read_lock_sched_held()}, and |
| \apik{srcu_read_lock_held()}, |
| each of which returns \co{true} if invoked within the corresponding |
| type of RCU read-side critical section. |
| |
| \QuickQuiz{ |
| Why isn't there a \co{rcu_read_lock_tasks_held()} for Tasks RCU? |
| }\QuickQuizAnswer{ |
| Because Tasks RCU does not have read-side markers. |
| Instead, Tasks RCU read-side critical sections are |
| bounded by voluntary context switches. |
| }\QuickQuizEnd |
| |
| Because \co{rcu_read_lock()} cannot be used from the idle loop, |
| and because energy-efficiency concerns have caused the idle loop |
| to become quite ornate, \apik{rcu_is_watching()} returns \co{true} if |
| invoked in a context where use of \co{rcu_read_lock()} is legal. |
| Note again that \co{srcu_read_lock()} may be used from idle and |
| even offline CPUs, which means that \co{rcu_is_watching()} does not |
| apply to SRCU\@. |
| |
| \apik{RCU_LOCKDEP_WARN()} emits a warning if lockdep is enabled and if |
| its argument evaluates to \co{true}. |
| For example, \co{RCU_LOCKDEP_WARN(!rcu_read_lock_held())} would emit a |
| warning if invoked outside of an RCU read-side critical section. |
| |
| \apik{RCU_NONIDLE()} may be used to force RCU to watch when executing |
| the statement that is passed in as the sole argument. |
| For example, \co{RCU_NONIDLE(WARN_ON(!rcu_is_watching()))} |
| would never emit a warning. |
| However, changes in the 2020--2021 timeframe extend RCU's reach deeper |
| into the idle loop, which should greatly reduce or even eliminate the |
| need for \co{RCU_NONIDLE()}. |
| |
| Finally, \apik{rcu_sleep_check()} emits a warning if invoked within |
| an RCU, RCU-bh, or RCU-sched read-side critical section. |
| |
| \subsubsection{Where Can RCU's APIs Be Used?} |
| \label{sec:defer:Where Can RCU's APIs Be Used?} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{defer/RCUenvAPI}} |
| \caption{RCU API Usage Constraints} |
| \label{fig:defer:RCU API Usage Constraints} |
| \end{figure} |
| |
| \Cref{fig:defer:RCU API Usage Constraints} |
| shows which APIs may be used in which in-kernel environments. |
| The RCU read-side primitives may be used in any environment, including NMI, |
| the RCU mutation and asynchronous grace-period primitives may be used in any |
| environment other than NMI, and, finally, the RCU synchronous grace-period |
| primitives may be used only in process context. |
| The RCU list-traversal primitives include \apik{list_for_each_entry_rcu()}, |
| \apik{hlist_for_each_entry_rcu()}, etc. |
| Similarly, the RCU list-mutation primitives include |
| \apik{list_add_rcu()}, \apik{hlist_del_rcu()}, etc. |
| |
| Note that primitives from other families of RCU may be substituted, |
| for example, \co{srcu_read_lock()} may be used in any context |
| in which \co{rcu_read_lock()} may be used. |
| |
| \subsubsection{So, What \emph{is} RCU Really?} |
| \label{sec:defer:So; What is RCU Really?} |
| |
| At its core, RCU is nothing more nor less than an API that supports |
| publication and subscription for insertions, waiting for all RCU readers |
| to complete, and maintenance of multiple versions. |
| That said, it is possible to build higher-level constructs |
| on top of RCU, including the reader-writer-locking, reference-counting, |
| and existence-guarantee constructs listed in |
| \cref{sec:defer:RCU Usage}. |
| Furthermore, I have no doubt that the Linux community will continue to |
| find interesting new uses for RCU, |
| just as they do for any of a number of synchronization |
| primitives throughout the kernel. |
| |
| Of course, a more-complete view of RCU would also include |
| all of the things you can do with these APIs. |
| |
| However, for many people, a complete view of RCU must include sample |
| RCU implementations. |
| \Cref{chp:app:``Toy'' RCU Implementations} therefore presents a series |
| of ``toy'' RCU implementations of increasing complexity and capability, |
| though others might prefer the classic |
| ``User-Level Implementations of Read-Copy |
| Update''~\cite{MathieuDesnoyers2012URCU}. |
| For everyone else, the next section gives an overview of some RCU use cases. |