blob: 0b4a51ccca85a6ae92a0e2054e407a5528219044 [file] [log] [blame]
% defer/rcuintro.tex
\subsection{Introduction to RCU}
\label{sec:defer:Introduction to RCU}
Suppose that you are writing a parallel real-time program that needs
to access
data that is subject to gradual change, perhaps due to changes in
temperature, humidity, and barometric pressure.
The real-time response constraints on this program are so severe that
it is not permissible to spin or block, thus ruling out locking,
nor is it permissible to use a retry loop, thus ruling out sequence locks.
Fortunately, the temperature and pressure are normally controlled,
so that a default hard-coded set of data is usually sufficient.
However, the temperature, humidity, and pressure occasionally deviate too far
from the defaults, and in such situations it is necessary to provide
data that replaces the defaults.
Because the temperature, humidity, and pressure change gradually,
providing the updated values is not a matter of urgency, though
it must happen within a few minutes.
The program is to use a global pointer imaginatively named \co{gptr}
that is normally \co{NULL}, which indicates that the default values
are to be used.
Otherwise, \co{gptr} points to a structure providing values imaginatively
named \co{a}, \co{b}, and \co{c} that are to be
used in the real-time calculations.
How can we safely provide updated values when needed without impeding
real-time readers?
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{defer/RCUListInsertClassic}}
\end{center}
\caption{Insertion With Concurrent Readers}
\label{fig:defer:Insertion With Concurrent Readers}
\end{figure}
A classic approach is shown in
Figure~\ref{fig:defer:Insertion With Concurrent Readers}.
The first row shows the default state, with \co{gptr} equal to \co{NULL}.
In the second row, we have allocated a structure which is uninitialized,
as indicated by the question marks.
In the third row, we have initialized the structure.
Next, we assign \co{gptr} to reference this new element.\footnote{
On many computer systems, simple assignment is insufficient
due to interference from both the compiler and the CPU.
These issues will be covered in
Section~\ref{sec:defer:RCU Fundamentals}.}
On modern general-purpose systems, this assignment is atomic in the
sense that concurrent readers will see either a \co{NULL} pointer
or a pointer to the new structure \co{p}, but not some mash-up
containing bits from both values.
Each reader is therefore guaranteed to either get the
default value of \co{NULL} or to get the newly installed
non-default values, but either way each reader will see
a consistent result.
Even better, readers need not use any expensive synchronization
primitives, so this approach is quite suitable for real-time use.\footnote{
Again, on many computer systems, additional work is required
to prevent interference from the compiler, and, on DEC Alpha
systems, the CPU as well.
This will be covered in
Section~\ref{sec:defer:RCU Fundamentals}.}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{defer/RCUListDeleteClassic}}
\end{center}
\caption{Deletion From Linked List With Concurrent Readers}
\label{fig:defer:Deletion From Linked List With Concurrent Readers}
\end{figure}
But sooner or later, it will be necessary to remove data that is
being referenced by concurrent readers.
Let us move to a more complex example where we are removing an element
from a linked list, as shown in
Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.
This list initially contains elements~\co{A}, \co{B}, and \co{C},
and we need to remove element~\co{B}.
First, we use \co{list_del()} to carry out the removal,\footnote{
And yet again, this approximates reality, which will be expanded
on in Section~\ref{sec:defer:RCU Fundamentals}.}
at which point all new readers will see element~\co{B} as having been
deleted from the list.
However, there might be old readers still referencing this element.
Once all these old readers have finished, we can safely free
element~\co{B}, resulting in the situation shown at the bottom of
the figure.
But how can we tell when the readers are finished?
It is tempting to consider a reference-counting scheme, but
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
in
Chapter~\ref{chp:Counting}
shows that this can also result in long delays, just as can
the locking and sequence-locking approaches that we already rejected.
Let's consider the logical extreme where the readers do absolutely
nothing to announce their presence.
This approach clearly allows optimal performance for readers
(after all, free is a very good price),
but leaves open the question of how the updater can possibly
determine when all the old readers are done.
We clearly need some additional constraints if we are to provide
a reasonable answer to this question.
One constraint that fits well with some types of real-time operating
systems (as well as some operating-system kernels) is to consider
the case where threads are not subject to preemption.
In such non-preemptible environments, each thread runs until it
explicitly and voluntarily blocks.
This means that an infinite loop without blocking will render a CPU
useless for any other purpose from the start of the infinite loop
onwards.\footnote{
In contrast, an infinite loop in a preemptible environment
might be preempted.
This infinite loop might still waste considerable CPU time,
but the CPU in question would nevertheless be able to do
other work.}
Non-preemptibility also requires that threads be prohibited from blocking
while holding spinlocks.
Without this prohibition, all CPUs might be consumed by threads
spinning attempting to acquire a spinlock held by a blocked thread.
The spinning threads will not relinquish their CPUs until they acquire
the lock, but the thread holding the lock cannot possibly release it
until one of the spinning threads relinquishes a CPU.
This is a classic deadlock situation.
Let us impose this same constraint on reader threads traversing the
linked list: such threads are not allowed to block until after
completing their traversal.
Returning to the second row of
Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers},
where the updater has just completed executing \co{list_del()},
imagine that CPU~0 executes a context switch.
Because readers are not permitted to block while traversing the linked
list, we are guaranteed that all prior readers that might have been running on
CPU~0 will have completed.
Extending this line of reasoning to the other CPUs, once each CPU has
been observed executing a context switch, we are guaranteed that all
prior readers have completed, and that there are no longer any reader
threads referencing element~\co{B}.
The updater can then safely free element~\co{B}, resulting in the
state shown at the bottom of
Figure~\ref{fig:defer:Deletion From Linked List With Concurrent Readers}.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{defer/QSBRGracePeriod}}
\end{center}
\caption{Waiting for Pre-Existing Readers}
\label{fig:defer:Waiting for Pre-Existing Readers}
\end{figure}
A schematic of this approach is shown in
Figure~\ref{fig:defer:Waiting for Pre-Existing Readers},
with time advancing from the top of the figure to the bottom.
Although production-quality implementations of this approach can be
quite complex, a toy implementatoin is exceedingly simple:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 for_each_online_cpu(cpu)
2 run_on(cpu);
\end{verbatim}
\end{minipage}
\vspace{5pt}
The \co{for_each_online_cpu()} primitive iterates over all CPUs, and
the \co{run_on()} function causes the current thread to execute on the
specified CPU, which forces the destination CPU to execute a context
switch.
Therefore, once the \co{for_each_online_cpu()} has completed, each CPU
has executed a context switch, which in turn guarantees that
all pre-existing reader threads have completed.
Please note that this approach is \emph{not} production quality.
Correct handling of a number of corner cases and the need for a number
of powerful optimizations mean that production-quality implementations
have significant additional complexity.
In addition, RCU implementations for preemptible environments
require that readers actually do something.
However, this simple non-preemptible approach is conceptually complete,
and forms a good initial basis for understanding the RCU fundamentals
covered in the following section.