defer/rcuintro.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/rcuintro.tex

 \subsection{Introduction to RCU}
 \label{sec:defer:Introduction to RCU}

 Suppose that you are writing a parallel real-time program that needs
 to access
 data that is subject to gradual change, perhaps due to changes in
 temperature, humidity, and barometric pressure.
 The real-time response constraints on this program are so severe that
 it is not permissible to spin or block, thus ruling out locking,
 nor is it permissible to use a retry loop, thus ruling out sequence locks.
 Fortunately, the temperature and pressure are normally controlled,
 so that a default hard-coded set of data is usually sufficient.

 However, the temperature, humidity, and pressure occasionally deviate too far
 from the defaults, and in such situations it is necessary to provide
 data that replaces the defaults.
 Because the temperature, humidity, and pressure change gradually,
 providing the updated values is not a matter of urgency, though
 it must happen within a few minutes.
 The program is to use a global pointer imaginatively named \co{gptr}
 that is normally \co{NULL}, which indicates that the default values
 are to be used.
 Otherwise, \co{gptr} points to a structure providing values imaginatively
 named \co{a}, \co{b}, and \co{c} that are to be
 used in the real-time calculations.

 How can we safely provide updated values when needed without impeding
 real-time readers?

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{defer/RCUListInsertClassic}}
 \end{center}
 \caption{Insertion With Concurrent Readers}
 \label{fig:defer:Insertion With Concurrent Readers}
 \end{figure}

 A classic approach is shown in
 Figure~\ref{fig:defer:Insertion With Concurrent Readers}.
 The first row shows the default state, with \co{gptr} equal to \co{NULL}.
 In the second row, we have allocated a structure which is uninitialized,
 as indicated by the question marks.
 In the third row, we have initialized the structure.
 Next, we assign \co{gptr} to reference this new element.\footnote{
 	On many computer systems, simple assignment is insufficient
 	due to interference from both the compiler and the CPU.
 	These issues will be covered in
 	Section~\ref{sec:defer:RCU Fundamentals}.}
 On modern general-purpose systems, this assignment is atomic in the
 sense that concurrent readers will see either a \co{NULL} pointer
 or a pointer to the new structure \co{p}, but not some mash-up
 containing bits from both values.
 Each reader is therefore guaranteed to either get the
 default value of \co{NULL} or to get the newly installed
 non-default values, but either way each reader will see
 a consistent result.
 Even better, readers need not use any expensive synchronization
 primitives, so this approach is quite suitable for real-time use.\footnote{
 	Again, on many computer systems, additional work is required
 	to prevent interference from the compiler, and, on DEC Alpha
 	systems, the CPU as well.
 	This will be covered in
 	Section~\ref{sec:defer:RCU Fundamentals}.}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{defer/RCUListDeleteClassic}}
 \end{center}
 \caption{Deletion From Linked List With Concurrent Readers}
 \label{fig:defer:Deletion From Linked List With Concurrent Readers}
 \end{figure}

 But sooner or later, it will be necessary to remove data that is
 being referenced by concurrent readers.
 Let us move to a more complex example where we are removing an element
 from a linked list, as shown in
 Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.
 This list initially contains elements~\co{A}, \co{B}, and \co{C},
 and we need to remove element~\co{B}.
 First, we use \co{list_del()} to carry out the removal,\footnote{
 	And yet again, this approximates reality, which will be expanded
 	on in Section~\ref{sec:defer:RCU Fundamentals}.}
 at which point all new readers will see element~\co{B} as having been
 deleted from the list.
 However, there might be old readers still referencing this element.
 Once all these old readers have finished, we can safely free
 element~\co{B}, resulting in the situation shown at the bottom of
 the figure.

 But how can we tell when the readers are finished?

 It is tempting to consider a reference-counting scheme, but
 Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
 in
 Chapter~\ref{chp:Counting}
 shows that this can also result in long delays, just as can
 the locking and sequence-locking approaches that we already rejected.

 Let's consider the logical extreme where the readers do absolutely
 nothing to announce their presence.
 This approach clearly allows optimal performance for readers
 (after all, free is a very good price),
 but leaves open the question of how the updater can possibly
 determine when all the old readers are done.
 We clearly need some additional constraints if we are to provide
 a reasonable answer to this question.

 One constraint that fits well with some types of real-time operating
 systems (as well as some operating-system kernels) is to consider
 the case where threads are not subject to preemption.
 In such non-preemptible environments, each thread runs until it
 explicitly and voluntarily blocks.
 This means that an infinite loop without blocking will render a CPU
 useless for any other purpose from the start of the infinite loop
 onwards.\footnote{
 	In contrast, an infinite loop in a preemptible environment
 	might be preempted.
 	This infinite loop might still waste considerable CPU time,
 	but the CPU in question would nevertheless be able to do
 	other work.}
 Non-preemptibility also requires that threads be prohibited from blocking
 while holding spinlocks.
 Without this prohibition, all CPUs might be consumed by threads
 spinning attempting to acquire a spinlock held by a blocked thread.
 The spinning threads will not relinquish their CPUs until they acquire
 the lock, but the thread holding the lock cannot possibly release it
 until one of the spinning threads relinquishes a CPU.
 This is a classic deadlock situation.

 Let us impose this same constraint on reader threads traversing the
 linked list: such threads are not allowed to block until after
 completing their traversal.
 Returning to the second row of
 Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers},
 where the updater has just completed executing \co{list_del()},
 imagine that CPU~0 executes a context switch.
 Because readers are not permitted to block while traversing the linked
 list, we are guaranteed that all prior readers that might have been running on
 CPU~0 will have completed.
 Extending this line of reasoning to the other CPUs, once each CPU has
 been observed executing a context switch, we are guaranteed that all
 prior readers have completed, and that there are no longer any reader
 threads referencing element~\co{B}.
 The updater can then safely free element~\co{B}, resulting in the
 state shown at the bottom of
 Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{defer/QSBRGracePeriod}}
 \end{center}
 \caption{Waiting for Pre-Existing Readers}
 \label{fig:defer:Waiting for Pre-Existing Readers}
 \end{figure}

 A schematic of this approach is shown in
 Figure~\ref{fig:defer:Waiting for Pre-Existing Readers},
 with time advancing from the top of the figure to the bottom.

 Although production-quality implementations of this approach can be
 quite complex, a toy implementatoin is exceedingly simple:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 for_each_online_cpu(cpu)
   2   run_on(cpu);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 The \co{for_each_online_cpu()} primitive iterates over all CPUs, and
 the \co{run_on()} function causes the current thread to execute on the
 specified CPU, which forces the destination CPU to execute a context
 switch.
 Therefore, once the \co{for_each_online_cpu()} has completed, each CPU
 has executed a context switch, which in turn guarantees that
 all pre-existing reader threads have completed.

 Please note that this approach is \emph{not} production quality.
 Correct handling of a number of corner cases and the need for a number
 of powerful optimizations mean that production-quality implementations
 have significant additional complexity.
 In addition, RCU implementations for preemptible environments
 require that readers actually do something.
 However, this simple non-preemptible approach is conceptually complete,
 and forms a good initial basis for understanding the RCU fundamentals
 covered in the following section.
	% defer/rcuintro.tex

	\subsection{Introduction to RCU}
	\label{sec:defer:Introduction to RCU}

	Suppose that you are writing a parallel real-time program that needs
	to access
	data that is subject to gradual change, perhaps due to changes in
	temperature, humidity, and barometric pressure.
	The real-time response constraints on this program are so severe that
	it is not permissible to spin or block, thus ruling out locking,
	nor is it permissible to use a retry loop, thus ruling out sequence locks.
	Fortunately, the temperature and pressure are normally controlled,
	so that a default hard-coded set of data is usually sufficient.

	However, the temperature, humidity, and pressure occasionally deviate too far
	from the defaults, and in such situations it is necessary to provide
	data that replaces the defaults.
	Because the temperature, humidity, and pressure change gradually,
	providing the updated values is not a matter of urgency, though
	it must happen within a few minutes.
	The program is to use a global pointer imaginatively named \co{gptr}
	that is normally \co{NULL}, which indicates that the default values
	are to be used.
	Otherwise, \co{gptr} points to a structure providing values imaginatively
	named \co{a}, \co{b}, and \co{c} that are to be
	used in the real-time calculations.

	How can we safely provide updated values when needed without impeding
	real-time readers?

	\begin{figure}[tb]
	\begin{center}
	\resizebox{3in}{!}{\includegraphics{defer/RCUListInsertClassic}}
	\end{center}
	\caption{Insertion With Concurrent Readers}
	\label{fig:defer:Insertion With Concurrent Readers}
	\end{figure}

	A classic approach is shown in
	Figure~\ref{fig:defer:Insertion With Concurrent Readers}.
	The first row shows the default state, with \co{gptr} equal to \co{NULL}.
	In the second row, we have allocated a structure which is uninitialized,
	as indicated by the question marks.
	In the third row, we have initialized the structure.
	Next, we assign \co{gptr} to reference this new element.\footnote{
	On many computer systems, simple assignment is insufficient
	due to interference from both the compiler and the CPU.
	These issues will be covered in
	Section~\ref{sec:defer:RCU Fundamentals}.}
	On modern general-purpose systems, this assignment is atomic in the
	sense that concurrent readers will see either a \co{NULL} pointer
	or a pointer to the new structure \co{p}, but not some mash-up
	containing bits from both values.
	Each reader is therefore guaranteed to either get the
	default value of \co{NULL} or to get the newly installed
	non-default values, but either way each reader will see
	a consistent result.
	Even better, readers need not use any expensive synchronization
	primitives, so this approach is quite suitable for real-time use.\footnote{
	Again, on many computer systems, additional work is required
	to prevent interference from the compiler, and, on DEC Alpha
	systems, the CPU as well.
	This will be covered in
	Section~\ref{sec:defer:RCU Fundamentals}.}

	\begin{figure}[tb]
	\begin{center}
	\resizebox{3in}{!}{\includegraphics{defer/RCUListDeleteClassic}}
	\end{center}
	\caption{Deletion From Linked List With Concurrent Readers}
	\label{fig:defer:Deletion From Linked List With Concurrent Readers}
	\end{figure}

	But sooner or later, it will be necessary to remove data that is
	being referenced by concurrent readers.
	Let us move to a more complex example where we are removing an element
	from a linked list, as shown in
	Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.
	This list initially contains elements~\co{A}, \co{B}, and \co{C},
	and we need to remove element~\co{B}.
	First, we use \co{list_del()} to carry out the removal,\footnote{
	And yet again, this approximates reality, which will be expanded
	on in Section~\ref{sec:defer:RCU Fundamentals}.}
	at which point all new readers will see element~\co{B} as having been
	deleted from the list.
	However, there might be old readers still referencing this element.
	Once all these old readers have finished, we can safely free
	element~\co{B}, resulting in the situation shown at the bottom of
	the figure.

	But how can we tell when the readers are finished?

	It is tempting to consider a reference-counting scheme, but
	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
	in
	Chapter~\ref{chp:Counting}
	shows that this can also result in long delays, just as can
	the locking and sequence-locking approaches that we already rejected.

	Let's consider the logical extreme where the readers do absolutely
	nothing to announce their presence.
	This approach clearly allows optimal performance for readers
	(after all, free is a very good price),
	but leaves open the question of how the updater can possibly
	determine when all the old readers are done.
	We clearly need some additional constraints if we are to provide
	a reasonable answer to this question.

	One constraint that fits well with some types of real-time operating
	systems (as well as some operating-system kernels) is to consider
	the case where threads are not subject to preemption.
	In such non-preemptible environments, each thread runs until it
	explicitly and voluntarily blocks.
	This means that an infinite loop without blocking will render a CPU
	useless for any other purpose from the start of the infinite loop
	onwards.\footnote{
	In contrast, an infinite loop in a preemptible environment
	might be preempted.
	This infinite loop might still waste considerable CPU time,
	but the CPU in question would nevertheless be able to do
	other work.}
	Non-preemptibility also requires that threads be prohibited from blocking
	while holding spinlocks.
	Without this prohibition, all CPUs might be consumed by threads
	spinning attempting to acquire a spinlock held by a blocked thread.
	The spinning threads will not relinquish their CPUs until they acquire
	the lock, but the thread holding the lock cannot possibly release it
	until one of the spinning threads relinquishes a CPU.
	This is a classic deadlock situation.

	Let us impose this same constraint on reader threads traversing the
	linked list: such threads are not allowed to block until after
	completing their traversal.
	Returning to the second row of
	Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers},
	where the updater has just completed executing \co{list_del()},
	imagine that CPU~0 executes a context switch.
	Because readers are not permitted to block while traversing the linked
	list, we are guaranteed that all prior readers that might have been running on
	CPU~0 will have completed.
	Extending this line of reasoning to the other CPUs, once each CPU has
	been observed executing a context switch, we are guaranteed that all
	prior readers have completed, and that there are no longer any reader
	threads referencing element~\co{B}.
	The updater can then safely free element~\co{B}, resulting in the
	state shown at the bottom of
	Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.

	\begin{figure}[tb]
	\begin{center}
	\resizebox{3in}{!}{\includegraphics{defer/QSBRGracePeriod}}
	\end{center}
	\caption{Waiting for Pre-Existing Readers}
	\label{fig:defer:Waiting for Pre-Existing Readers}
	\end{figure}

	A schematic of this approach is shown in
	Figure~\ref{fig:defer:Waiting for Pre-Existing Readers},
	with time advancing from the top of the figure to the bottom.

	Although production-quality implementations of this approach can be
	quite complex, a toy implementatoin is exceedingly simple:

	\vspace{5pt}
	\begin{minipage}[t]{\columnwidth}
	\scriptsize
	\begin{verbatim}
	1 for_each_online_cpu(cpu)
	2 run_on(cpu);
	\end{verbatim}
	\end{minipage}
	\vspace{5pt}

	The \co{for_each_online_cpu()} primitive iterates over all CPUs, and
	the \co{run_on()} function causes the current thread to execute on the
	specified CPU, which forces the destination CPU to execute a context
	switch.
	Therefore, once the \co{for_each_online_cpu()} has completed, each CPU
	has executed a context switch, which in turn guarantees that
	all pre-existing reader threads have completed.

	Please note that this approach is \emph{not} production quality.
	Correct handling of a number of corner cases and the need for a number
	of powerful optimizations mean that production-quality implementations
	have significant additional complexity.
	In addition, RCU implementations for preemptible environments
	require that readers actually do something.
	However, this simple non-preemptible approach is conceptually complete,
	and forms a good initial basis for understanding the RCU fundamentals
	covered in the following section.