defer/rcufundamental.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/rcufundamental.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \subsection{RCU Fundamentals}
 \label{sec:defer:RCU Fundamentals}
 \OriginallyPublished{Section}{sec:defer:RCU Fundamentals}{RCU Fundamentals}{Linux Weekly News}{PaulEMcKenney2007WhatIsRCUFundamentally}

 This section re-examines the ground covered in the previous section, but
 independent of any particular example or use case.
 People who prefer to live their lives very close to the actual code may
 wish to skip the underlying fundamentals presented in this section.

 RCU is made up of three fundamental mechanisms, the first being
 used for insertion, the second being used for deletion, and the third
 being used to allow readers to tolerate concurrent insertions and deletions.
 Section~\ref{sec:defer:Publish-Subscribe Mechanism}
 describes the publish-subscribe mechanism used for insertion,
 Section~\ref{sec:defer:Wait For Pre-Existing RCU Readers}
 describes how waiting for pre-existing RCU readers enabled deletion,
 and
 Section~\ref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}
 discusses how maintaining multiple versions of recently updated objects
 permits concurrent insertions and deletions.
 Finally,
 Section~\ref{sec:defer:Summary of RCU Fundamentals}
 summarizes RCU fundamentals.

 \subsubsection{Publish-Subscribe Mechanism}
 \label{sec:defer:Publish-Subscribe Mechanism}

 Because RCU readers are not excluded by RCU updaters, an RCU-protected
 data structure might change while a reader accesses it.
 The accessed data item might be moved, removed, or replaced.
 Because the data structure does not ``hold still'' for the reader,
 each reader's access can be thought of as subscribing to the current
 version of the RCU-protected data item.
 For their part, updaters can be thought of as publishing new versions.

 % @@@ Merge usage section into "Which to Choose?" and suggest choices.
 % @@@ Should "RCU Exercises" move to "RCU Rescues"?

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/pubsub}}
 \caption{Publication/Subscription Constraints}
 \label{fig:defer:Publication/Subscription Constraints}
 \end{figure}

 Unfortunately, as laid out in
 Section~\ref{sec:toolsoftrade:Shared-Variable Shenanigans}
 and reiterated in
 Section~\ref{sec:defer:Minimal Insertion and Deletion},
 it is unwise to use plain accesses for these publication and subscription
 operations.
 It is instead necessary to inform both the compiler and the CPU
 of the need for care, as can be seen from
 Figure~\ref{fig:defer:Publication/Subscription Constraints},
 which illustrates interactions between concurrent executions of
 \co{ins_route()} (and its caller) and \co{read_gptr()} from
 Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}.

 The \co{ins_route()} column from
 Figure~\ref{fig:defer:Publication/Subscription Constraints}
 shows \co{ins_route()}'s caller allocating a new \co{route} structure,
 which then contains pre-initialization garbage.
 The caller then initializes the newly allocated structure, and then
 invokes \co{ins_route()} to publish a pointer to the new \co{route}
 structure.
 Publication does not affect the contents of the structure, which
 therefore remain valid after publication.

 The \co{access_route()} column from this same figure shows
 the pointer being subscribed to and dereferenced.
 This dereference operation absolutely must see a valid \co{route}
 structure rather than pre-initialization garbage because referencing
 garbage could result in memory corruption, crashes, and hangs.
 As noted earlier, avoiding such garbage means that the publish and
 subscribe operations must inform both the compiler and the CPU of the
 need to maintain the needed ordering.

 Publication is carried out by \co{rcu_assign_pointer()}, which ensures that
 \co{ins_route()}'s caller's initialization is ordered before the actual
 publication operation's store of the pointer.
 In addition, \co{rcu_assign_pointer()} must be atomic in the sense that
 concurrent readers see either the old value of the pointer or the new
 value of the pointer, but not some mash-up of these two values.
 These requirements are met by the C11 store-release operation, and in
 fact in the Linux kernel, \co{rcu_assign_pointer()} is defined in terms
 of \co{smp_store_release()}, which is similar to C11 store-release.

 Note that if concurrent updates are required, some sort of synchronization
 mechanism will be required to mediate among multiple concurrent
 \co{rcu_assign_pointer()} calls on the same pointer.
 In the Linux kernel, locking is the mechanism of choice, but pretty
 much any synchronization mechanism may be used.
 An example of a particularly lightweight synchronization mechanism is
 Chapter~\ref{chp:Data Ownership}'s data ownership: If each pointer is
 owned by a particular thread, then that thread may execute
 \co{rcu_assign_pointer()} on that pointer with no additional
 synchronization overhead.

 \QuickQuiz{
 	Wouldn't use of data ownership for RCU updaters mean that
 	the updates could use exactly the same sequence of instructions
 	as would the corresponding single-threaded code?
 }\QuickQuizAnswer{
 	Sometimes, for example, on TSO systems such as x86 or the IBM
 	mainframe where a store-release operation emits a single store
 	instruction.
 	However, weakly ordered systems must also emit a memory barrier
 	or perhaps a store-release instruction.
 	In addition, removing data requires quite a bit of additional
 	work because it is necessary to wait for pre-existing readers
 	before freeing the removed data.
 }\QuickQuizEnd

 Subscription is carried out by \co{rcu_dereference()}, which orders
 the subscription operation's load from the pointer is before the
 dereference.
 Similar to \co{rcu_assign_pointer()}, \co{rcu_dereference()} must be
 atomic in the sense that the value loaded must be that from a single
 store, for example, the compiler must not tear the load.\footnote{
 	That is, the compiler must not break the load into multiple
 	smaller loads, as described under ``load tearing'' in
 	Section~\ref{sec:toolsoftrade:Shared-Variable Shenanigans}.}
 Unfortunately, compiler support for \co{rcu_dereference()} is at best
 a work in progress~\cite{PaulEMcKennneyConsumeP0190R4,PaulEMcKenney2017markconsumeP0462R1,JFBastien2018P0750R1consume}.
 In the meantime, the Linux kernel relies on volatile loads, the details of
 the various CPU architectures, coding
 restrictions~\cite{PaulEMcKenney2014rcu-dereference},
 and, on DEC Alpha~\cite{ALPHA2002}, a memory-barrier instruction.
 However, on other architectures, \co{rcu_dereference()} typically
 emits a single load instruction, just as would the equivalent single-threaded
 code.
 The coding restrictions are described in more detail in
 Section~\ref{sec:memorder:Address- and Data-Dependency Difficulties},
 however, the common case of field selection (\qtco{->}) works quite well.
 Software that does not require the ultimate in read-side performance
 can instead use C11 acquire loads, which provide the needed ordering and
 more, albeit at a cost.
 It is hoped that lighter-weight compiler support for \co{rcu_dereference()}
 will appear in due course.

 In short, use of \co{rcu_assign_pointer()} for publishing pointers and
 use of \co{rcu_dereference()} for subscribing to them successfully avoids the
 ``Not OK'' garbage loads depicted in
 Figure~\ref{fig:defer:Publication/Subscription Constraints}.
 These two primitives can therefore be used to add new data to linked
 structures without disrupting concurrent readers.

 \QuickQuiz{
 	But suppose that updaters are adding and removing multiple data
 	items from a linked list while a reader is iterating over that
 	same list.
 	Specifically, suppose that a list initially contains elements
 	A, B, and C, and that an updater removes element A and then
 	adds a new element D at the end of the list.
 	The reader might well see \{A, B, C, D\}, when that sequence of
 	elements never actually ever existed!
 	In what alternate universe would that qualify as ``not disrupting
 	concurrent readers''???
 }\QuickQuizAnswer{
 	In the universe where an iterating reader is only required to
 	traverse elements that were present throughout the full duration
 	of the iteration.
 	In the example, that would be elements B and C.
 	Because elements A and D were each present for only part of the
 	iteration, the reader is permitted to iterate over them, but not
 	obliged to.
 	Note that this supports the common case where the reader is simply
 	looking up a single item, and does not know or care about the
 	presence or absence of other items.

 	If stronger consistency is required, then higher-cost
 	synchronization mechanisms are required, for example, sequence
 	locking or reader-writer locking.
 	But if stronger consistency is \emph{not} required (and it very often
 	is not), then why pay the higher cost?
 }\QuickQuizEnd

 Adding data to a linked structure without disrupting readers is a good thing,
 as are the cases where this can be done with no added read-side cost compared
 to single-threaded readers.
 However, in most cases it is also necessary to remove data, and this is the
 subject of the next section.

 \subsubsection{Wait For Pre-Existing RCU Readers}
 \label{sec:defer:Wait For Pre-Existing RCU Readers}

 In its most basic form, RCU is a way of waiting for things to finish.
 Of course, there are a great many other ways of waiting for things to
 finish, including reference counts, reader-writer locks, events, and so on.
 The great advantage of RCU is that it can wait for each of
 (say) 20,000 different things without having to explicitly
 track each and every one of them, and without having to worry about
 the performance degradation, scalability limitations, complex deadlock
 scenarios, and memory-leak hazards that are inherent in schemes
 using explicit tracking.

 In RCU's case, each of the things waited on is called an
 \emph{RCU read-side critical section}.
 As hinted at in
 Section~\ref{sec:defer:Toy Implementation}, an RCU read-side critical
 section starts with an \co{rcu_read_lock()} primitive, and ends with a
 corresponding \co{rcu_read_unlock()} primitive.
 RCU read-side critical sections can be nested, and may contain pretty
 much any code, as long as that code does not contain a quiescent state,
 for example, within the Linux kernel, it is illegal to sleep within
 an RCU read-side critical section because a context switch is a quiescent
 state.\footnote{
 	However, a special form of RCU called SRCU~\cite{PaulEMcKenney2006c}
 	does permit general sleeping in SRCU read-side critical sections.}
 If you abide by these conventions, you can use RCU to wait for \emph{any}
 pre-existing RCU read-side critical section to complete, and
 \co{synchronize_rcu()} uses indirect means to do the actual
 waiting~\cite{MathieuDesnoyers2012URCU,McKenney:2013:SDS:2483852.2483867}.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/RCUGuaranteeFwd}}
 \caption{RCU Reader and Later Grace Period}
 \label{fig:defer:RCU Reader and Later Grace Period}
 \end{figure}

 The relationship between an RCU read-side critical section and a later
 RCU grace period is an if-then relationship, as illustrated by
 Figure~\ref{fig:defer:RCU Reader and Later Grace Period}.
 If any portion of a given critical section precedes the beginning of
 a given grace period, then RCU guarantees that all of that critical
 section will precede the end of that grace period.
 In the figure, because \co{P0()}'s access to \co{y} precedes
 \co{P1()}'s access to this same variable, it is guaranteed that
 \co{P0()}'s access to \co{x} will precede \co{P1()}'s access.
 In this case, if \co{y}'s final value is 2, then \co{x}'s
 final value is guaranteed to also be 2.

 \QuickQuiz{
 	What other final values of \co{x} and \co{y} are possible in
 	Figure~\ref{fig:defer:RCU Reader and Later Grace Period}?
 }\QuickQuizAnswer{
 	The \co{x == 2 && y == 2} possibility was called out in the text.
 	Given that \co{y == 2} implies \co{x == 2}, we know that
 	\co{x == 1 && y == 2} is forbidden.
 	The following discussion will show that both
 	\co{x == 1 && y == 1} and \co{x == 2 && y == 1} are possible.
 }\QuickQuizEnd

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/RCUGuaranteeRev}}
 \caption{RCU Reader and Earlier Grace Period}
 \label{fig:defer:RCU Reader and Earlier Grace Period}
 \end{figure}

 The relationship between an RCU read-side critical section and an earlier
 RCU grace period is also an if-then relationship, as illustrated by
 Figure~\ref{fig:defer:RCU Reader and Earlier Grace Period}.
 If any portion of a given critical section follows the end of
 a given grace period, then RCU guarantees that all of that critical
 section will follow the beginning of that grace period.
 In the figure, because \co{P0()}'s access to \co{y} follows
 \co{P1()}'s access to this same variable, it is guaranteed that
 \co{P0()}'s access to \co{x} will follow \co{P1()}'s access.
 In this case, if \co{y}'s final value is 1, then \co{x}'s
 final value is guaranteed to also be 1.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/RCUGuaranteeMid}}
 \caption{RCU Reader Within Grace Period}
 \label{fig:defer:RCU Reader Within Grace Period}
 \end{figure}

 Finally, as shown in
 Figure~\ref{fig:defer:RCU Reader Within Grace Period},
 an RCU read-side critical section can be completely overlapped by
 an RCU grace period.
 In this case, \co{x}'s final value is 1 and \co{y}'s final value is 2.

 However, it cannot be the case that \co{x}'s final value is 2 and \co{y}'s
 final value is 1.
 This would mean that an RCU read-side critical section had completely
 overlapped a grace period, which is forbidden.
 RCU's wait-for-readers guarantee therefore has two parts:
 (1)~If any part of a given RCU read-side critical section precedes
 the beginning of a given grace period, then the entirety of that
 critical section precedes the end of that grace period.
 (2)~If any part of a given RCU read-side critical section follows
 the end of a given grace period, then the entirety of that
 critical section follows the beginning of that grace period.
 This definition is sufficient for almost all RCU-based algorithms, but
 for those wanting more,
 simple executable formal models of RCU are available
 as part of Linux kernel v4.17 and later, as discussed in
 Section~\ref{sec:formal:Axiomatic Approaches and RCU}.
 In addition, RCU's ordering properties are examined in much
 greater detail in Section~\ref{sec:memorder:RCU}.

 Although RCU's wait-for-readers capability really is sometimes used to
 order the assignment of values to variables as shown in
 \crefrange{fig:defer:RCU Reader and Later Grace Period}
 {fig:defer:RCU Reader Within Grace Period},
 it is more frequently used to safely free data elements removed from
 a linked structure, as was done in
 Section~\ref{sec:defer:Introduction to RCU}.
 The general process is illustrated by the following pseudocode:

 \begin{enumerate}
 \item	Make a change, for example, remove an element from a linked list.
 \item	Wait for all pre-existing RCU read-side critical sections to
 	completely finish (for example, by using
 	\co{synchronize_rcu()}).
 \item	Clean up, for example, free the element that was replaced above.
 \end{enumerate}

 \begin{figure}[tbp]
 \centering
 \resizebox{2.5in}{!}{\includegraphics{defer/RCUGPorderingSummary}}
 \caption{Summary of RCU Grace-Period Ordering Guarantees}
 \label{fig:defer:Summary of RCU Grace-Period Ordering Guarantees}
 \end{figure}

 This more abstract procedure requires a more abstract diagram than
 \crefrange{fig:defer:RCU Reader and Later Grace Period}
 {fig:defer:RCU Reader Within Grace Period},
 which are specific to a particular litmus test.
 After all, and RCU implementation must work correctly regardless of
 the form of the RCU updates and the RCU read-side critical sections.
 Figure~\ref{fig:defer:Summary of RCU Grace-Period Ordering Guarantees}
 fills this need, showing the four possible scenarios, with time
 advancing from top to bottom within each scenario.
 Within each scenario, an RCU reader is represented by the left-hand
 stack of boxes and and RCU updater by the right-hand stack.

 In the first scenario, the reader starts execution before the
 updater starts the removal, so it is possible that this reader
 has a reference to the removed data element.
 Therefore, the updater must not free this element until after the
 reader completes.
 In the second scenario, the reader does not start execution until
 after the removal has completed.
 The reader cannot possibly obtain a reference to the already-removed
 data element, so this element may be freed before the reader completes.
 The third scenario is like the second, but illustrates that when the
 reader cannot possibly obtain a reference to element, it is still
 permissible to defer the freeing of that element until after the
 reader completes.
 In the fourth and final scenario, the reader starts execution before
 the updater starts removing the data element, but this element
 is (incorrectly) freed before the reader completed.
 A correct RCU implementation will not allow this fourth scenario to
 occur.
 This diagram thus illustrates RCU's wait-for-readers functionality:
 Given a grace period, each reader ends before the end of that grace
 period, starts after the beginning of that grace period, or both, in
 which case it is wholly contained within that grace period.

 Because RCU readers can make forward progress while updates
 are in progress, different readers might disagree about the state
 of the data structure, a topic taken up by the next section.

 \subsubsection{Maintain Multiple Versions of Recently Updated Objects}
 \label{sec:defer:Maintain Multiple Versions of Recently Updated Objects}

 This section discusses how RCU accommodates synchronization-free readers
 by maintaining multiple versions of data.
 This discussion builds on the introduction of multiple versions by
 Figure~\ref{fig:defer:Deletion With Concurrent Readers}
 in
 Section~\ref{sec:defer:Minimal Insertion and Deletion},
 in which readers running concurrently with \co{del_route()}
 (see Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers})
 might see the old \co{route} structure or an empty list, but either
 way get a valid result.
 Of course, a closer look at
 Figure~\ref{fig:defer:Insertion With Concurrent Readers}
 shows that calls to \co{ins_route()} can also result in concurrent
 readers seeing different versions: Either the initial empty list
 or the newly inserted \co{route} structure.
 Note that both reference counting
 (Section~\ref{sec:defer:Reference Counting})
 and hazard pointers
 (Section~\ref{sec:defer:Hazard Pointers})
 can also cause concurrent readers to see different versions, but
 RCU's lightweight readers make this more likely.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{defer/multver}}
 \caption{Multiple RCU Data-Structure Versions}
 \label{fig:defer:Multiple RCU Data-Structure Versions}
 \end{figure}

 However, maintaining multiple versions can be even more surprising.
 For example, consider
 Figure~\ref{fig:defer:Multiple RCU Data-Structure Versions},
 in which a reader is traversing a linked list that is concurrently
 updated.\footnote{
 	RCU linked-list APIs may be found in
 	Section~\ref{sec:defer:RCU Linux-Kernel API}.}
 In the first row of the figure, the reader is referencing data item~A,
 and in the second row, it advances to~B, having thus far seen A followed by~B.
 In the third row, an updater removes element~A and in the fourth row
 an updater adds element~E to the end of the list.
 In the fifth and final row, the reader completes its traversal, having
 seeing elements A through E.

 Except that there was no time at which such a list existed.
 This situation might be even more surprising than that shown in
 Figure~\ref{fig:defer:Deletion With Concurrent Readers},
 in which different concurrent readers see different versions.
 In contrast, in
 Figure~\ref{fig:defer:Multiple RCU Data-Structure Versions}
 the reader sees a version that never actually existed!

 One way to resolve this strange situation is via weaker semanitics.
 A reader traversal must encounter any data item that was present
 during the full traversal (B, C, and~D), and might or might not
 encounter data items that were present for only part of the
 traversal (A and~E).
 Therefore, in this particular case, it is perfectly legitimate for
 the reader traveral to encounter all five elements.
 If this outcome is problematic, another way to resolve this situation is
 through use of stronger synchronization mechanisms, such as reader-writer
 locking or clever use of timestamps or versioning.
 Of course, stronger mechanisms will be more expensive, but then again
 the engineering life is all about choices and tradeoffs.

 Strange though this situation might seem, it is entirely consistent with
 the real world.
 As we saw in
 Section~\ref{sec:cpu:Overheads},
 the finite speed of light cannot be ignored within a computer system,
 and it most certainly cannot be ignored outside of this system.
 This in turn means that any data within the system representing state
 in the real world outside of the system is always and forever outdated,
 and thus inconsistent with the real world.
 As a result, algorithms operating on real-world data must account for
 inconsistent data.
 In many cases, these algorithms are also perfectly capable of dealing
 with inconsistencies within the system.

 The pre-BSD packet routing example laid out in
 Section~\ref{sec:defer:Running Example}
 is a case in point.
 The contents of a routing list is set by routing protocols, and these
 protocols feature significant delays (seconds or even minutes) to avoid
 routing instabilities.
 Therefore, once a routing update reaches a given system,
 it might well have been sending packets the wrong way for quite some time.
 Sending a few more packets the wrong way for the few microseconds during
 which the update is in flight is clearly not a problem because the same
 higher-level protocol actions that deal with delayed routing updates
 will also deal with internal inconsistencies.

 Nor is Internet routing the only situation tolerating inconsistencies.
 To repeat, any algorithm in which data within a system tracks
 outside-of-system state must tolerate inconsistencies, which includes
 security policies (often set by committees of humans), storage configuration,
 and WiFi access points, to say nothing of removable hardware such as
 microphones, headsets, cameras, mice, printers, and much else besides.
 Furthermore, the large number of Linux-kernel RCU API uses shown in
 Figure~\ref{fig:defer:RCU Usage in the Linux Kernel},
 combined with the Linux kernel's heavy use of reference counting
 and with increasing use of hazard pointers in other projects, demonstrates
 that tolerance for such inconsistencies is more common than one might
 imagine.
 This is especially the case given that single-item lookups are much more
 common than traversals:  After all, (1)~concurrent updates are less likely
 to affect a single-item lookup than they are a full traversal, and
 (2)~an isolated single-item lookup cannot detect such inconsistencies.

 From a more theoretical viewpoint, there are even some special cases where
 RCU readers can be considered to be fully ordered with updaters, despite
 the fact that these readers might be executing the exact same sequence of
 machine instructions that would be executed by a single-threaded program.
 For example, referring back to
 Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}
 on page~\pageref{lst:defer:Insertion and Deletion With Concurrent Readers},
 suppose that each reader thread invokes \co{access_route()} exactly
 once during its lifetime, and that there is no other communication among
 reader and updater threads.
 \begin{fcvref}[ln:defer:Insertion and Deletion With Concurrent Readers]
 Then each invocation of \co{access_route()} can be ordered after the
 \co{ins_route()} invocation that produced the \co{route} structure
 accessed by \clnref{access_rp} of the listing in \co{access_route()}
 and ordered before any subsequent
 \co{ins_route()} or \co{del_route()} invocation.
 \end{fcvref}

 In summary, maintaining multiple versions is exactly what enables the
 extremely low overheads of RCU readers, and as noted earlier, many
 algorithms are unfazed by multiple versions.
 However, there are algorithms that absolutely cannot handle multiple versions.
 There are techniques for adapting such algorithms to
 RCU~\cite{PaulEdwardMcKenneyPhD},
 but these are beyond the scope of this section.

 \paragraph{Exercises}
 \label{sec:defer:Exercises}

 These examples assumed that a mutex was held across the entire
 update operation, which would mean that there could be at most two
 versions of the list active at a given time.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How would you modify the deletion example to permit more than two
 	versions of the list to be active?
 }\QuickQuizAnswerB{
 	One way of accomplishing this is as shown in
 	Listing~\ref{lst:defer:Concurrent RCU Deletion}.

 \begin{listing}[htbp]
 \begin{VerbatimL}
 spin_lock(&mylock);
 p = search(head, key);
 if (p == NULL)
 	spin_unlock(&mylock);
 else {
 	list_del_rcu(&p->list);
 	spin_unlock(&mylock);
 	synchronize_rcu();
 	kfree(p);
 }
 \end{VerbatimL}
 \caption{Concurrent RCU Deletion}
 \label{lst:defer:Concurrent RCU Deletion}
 \end{listing}

 	Note that this means that multiple concurrent deletions might be
 	waiting in \co{synchronize_rcu()}.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	How many RCU versions of a given list can be
 	active at any given time?
 }\QuickQuizAnswerE{
 	That depends on the synchronization design.
 	If a semaphore protecting the update is held across the grace period,
 	then there can be at most two versions, the old and the new.

 	However, suppose that only the search, the update, and the
 	\co{list_replace_rcu()} were protected by a lock, so that
 	the \co{synchronize_rcu()} was outside of that lock, similar
 	to the code shown in
 	Listing~\ref{lst:defer:Concurrent RCU Deletion}.
 	Suppose further that a large number of threads undertook an
 	RCU replacement at about the same time, and that readers
 	are also constantly traversing the data structure.

 	Then the following sequence of events could occur, starting from
 	the end state of
 	Figure~\ref{fig:defer:Multiple RCU Data-Structure Versions}:

 	\begin{enumerate}
 	\item	Thread~A traverses the list, obtaining a reference to
 		Element~C.
 	\item	Thread~B replaces Element~C with a new
 		Element~F, then waits for its \co{synchronize_rcu()}
 		call to return.
 	\item	Thread~C traverses the list, obtaining a reference to
 		Element~F.
 	\item	Thread~D replaces Element~F with a new
 		Element~G, then waits for its \co{synchronize_rcu()}
 		call to return.
 	\item	Thread~E traverses the list, obtaining a reference to
 		Element~G.
 	\item	Thread~F replaces Element~G with a new
 		Element~H, then waits for its \co{synchronize_rcu()}
 		call to return.
 	\item	Thread~G traverses the list, obtaining a reference to
 		Element~H.
 	\item	And the previous two steps repeat quickly with additional
 		new elements, so that all of them happen before any of
 		the \co{synchronize_rcu()} calls return.
 	\end{enumerate}

 	Thus, there can be an arbitrary number of versions active,
 	limited only by memory and by how many updates could be completed
 	within a grace period.
 	But please note that data structures that are updated so frequently
 	are not likely to be good candidates for RCU.
 	Nevertheless, RCU can handle high update rates when necessary.
 }\QuickQuizEndE
 }

 \subsubsection{Summary of RCU Fundamentals}
 \label{sec:defer:Summary of RCU Fundamentals}

 This section has described the three fundamental components of RCU-based
 algorithms:

 \begin{enumerate}
 \item	a publish-subscribe mechanism for adding new data,

 \item	a way of waiting for pre-existing RCU readers to finish
 	(see Section~\ref{sec:memorder:RCU} for more detail),
 	and

 \item	a discipline of maintaining multiple versions to permit
 	change without harming or unduly delaying concurrent RCU readers.
 \end{enumerate}

 \QuickQuiz{
 	How can RCU updaters possibly delay RCU readers, given that
 	neither \co{rcu_read_lock()} nor \co{rcu_read_unlock()}
 	spin or block?
 }\QuickQuizAnswer{
 	The modifications undertaken by a given RCU updater will cause the
 	corresponding CPU to invalidate cache lines containing the data,
 	forcing the CPUs running concurrent RCU readers to incur expensive
 	cache misses.
 	(Can you design an algorithm that changes a data structure
 	\emph{without}
 	inflicting expensive cache misses on concurrent readers?
 	On subsequent readers?)
 }\QuickQuizEnd

 These three RCU components allow data to be updated in face of concurrent
 readers that might be executing the same sequence of machine instructions
 that would be used by a reader in a single-threaded implementation.
 These RCU components can be combined in different ways to implement a
 surprising variety of different types of RCU-based algorithms.
 However, it is usually better to work at higher levels of abstraction.
 To this end, the next section describes the Linux-kernel API, which
 includes simple data structures such as lists.