blob: 5633818eb7ecbef5dcdfdfa4689c736055277744 [file] [log] [blame]
% defer/rcuintro.tex
\subsection{Introduction to RCU}
\label{sec:defer:Introduction to RCU}
The approaches discussed in the preceding sections have provided
some scalability but decidedly non-ideal performance for the
Pre-BSD routing table.
Therefore, in the spirit of ``only those who have gone too far
know how far you can go'',\footnote{
With apologies to T.~S.~Eliot.}
we will go all the way, looking into algorithms in which concurrent
readers execute the same sequence of assembly language instructions as
would a single-threaded lookup, despite the presence of concurrent
updates.
Of course, this laudable goal might raise serious implementability
questions, but we cannot possibly succeed if we don't even try!
\subsubsection{Minimal Insertion and Deletion}
\label{sec:defer:Minimal Insertion and Deletion}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{defer/RCUListInsertClassic}}
\caption{Insertion With Concurrent Readers}
\label{fig:defer:Insertion With Concurrent Readers}
\end{figure}
To minimize implementability concerns, we focus on a minimal
data structure, which consists of a single global pointer that is either
\co{NULL} or references a single structure.
Interestingly enough, this simple data structure is used in
production~\cite{GeoffRomer2018C++DeferredReclamationP0561R4}.
A classic approach for insertion is shown in
Figure~\ref{fig:defer:Insertion With Concurrent Readers}.
The first row shows the initial state, with \co{gptr} equal to \co{NULL}.
In the second row, we have allocated a structure which is uninitialized,
as indicated by the question marks.
In the third row, we have initialized the structure.
We might hope to assign \co{gptr} to reference this new element
using a simple C-language assignment statement, resulting in the
state shown in the fourth and final row.
Unfortunately, a quick review of
Section~\ref{sec:toolsoftrade:Shared-Variable Shenanigans}
dashes these hopes.
Therefore, the updater cannot use a simple C-language assignment, but
must instead use \co{smp_store_release()} as shown in the figure,
or, as will be seen, \co{rcu_assign_pointer()}.
Similarly, one might hope that readers could use a single C-language
assignment statement to fetch the value of \co{gptr}, and be guaranteed
to either get the old value of \co{NULL} or to get the newly installed
pointer, but either way see a valid result.
Unfortunately, Section~\ref{sec:toolsoftrade:Shared-Variable Shenanigans}
also dashes these hopes.
To obtain this guarantee, readers must instead use \co{READ_ONCE()},
or, as will be seen, \co{rcu_dereference()}.
However, on most modern computer systems, each of these two primitives
can be implemented with a single load instruction, exactly the instruction
that would be used in single-threaded code.
Therefore, despite the serious implementability questions, it really
is possible to add new data to linked data structures while allowing
concurrent readers to execute the same sequence of machine instructions
that is required in single-threaded code.
This no-cost approach to concurrent reading provides excellent performance
and scalability, and also is eminently suitable for real-time use.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{defer/RCUListDeleteClassic}}
\caption{Deletion With Concurrent Readers}
\label{fig:defer:Deletion With Concurrent Readers}
\end{figure}
But sooner or later, it will be necessary to remove data that is
being referenced by concurrent readers.
As can be seen in
Figure~\ref{fig:defer:Deletion With Concurrent Readers},
the first step is easy.
Again taking the lessons from
Section~\ref{sec:toolsoftrade:Shared-Variable Shenanigans}
to heart, use \co{smp_store_release()} to \co{NULL} the pointer,\footnote{
Strictly speaking, \co{WRITE_ONCE()} suffices for a \co{NULL}
pointer, but \co{smp_store_release()} is absolutely required
when storing non-\co{NULL} pointers.
Or, as we will see, \co{rcu_assign_pointer()}.}
thus moving from the first row to the second in the figure.
At this point, pre-existing readers will see the old structure with
\co{->addr} of 42 and \co{->iface} of 1, but new readers will see
a \co{NULL} pointer, that is, concurrent readers can disagree on
the state, as indicated by the ``2 Versions'' in the figure.
\QuickQuiz{}
Why does
Figure~\ref{fig:defer:Deletion With Concurrent Readers}
use \co{smp_store_release()} given that it is storing
a \co{NULL} pointer?
Wouldn't \co{WRITE_ONCE()} work just as well in this case,
given that there is no structure initialization to order
against the store of the \co{NULL} pointer?
\QuickQuizAnswer{
Yes, it would.
Because a \co{NULL} pointer is being assigned, there is nothing
to order against, so no need for \co{smp_store_release()}.
In contrast, when assigning a non-\co{NULL} pointer, it is
necessary to use \co{smp_store_release()} in order to ensure
that initialization of the pointed-to structure is carried
out before assignment of the pointer.
In short, \co{WRITE_ONCE()} would work, and would
save a little bit of CPU time on some architectures.
However, as we will see, software-engineering concerns
will motivate use of a special \co{rcu_assign_pointer()}
that is quite similar to \co{smp_store_release()}.
} \QuickQuizEnd
\QuickQuiz{}
Readers running concurrently each other and with the procedure
outlined in
Figure~\ref{fig:defer:Deletion With Concurrent Readers}
can disagree on the value of \co{gptr}.
Isn't that just a wee bit problematic???
\QuickQuizAnswer{
Not necessarily.
As hinted at in Sections~\ref{sec:cpu:Hardware Optimizations}
and~\ref{sec:cpu:Hardware Free Lunch?},
speed-of-light delays mean that a computer's data is always
stale compared to whatever external reality that data is intended
to model.
Real-world algorithms therefore absolutely must tolerate
inconsistancies between external reality and the in-computer
data reflecting that reality.
A significant fraction of those algorithms are also able to
tolerate some degree of inconsistency within the in-computer
data.
Section~\ref{sec:datastruct:RCU-Protected Hash Table Discussion}
discusses this point in more detail.
Please note that this need to tolerate inconsistent and stale
data is not limited to RCU.
It also applies to reference counting, hazard pointers, sequence
locks, and even to some locking use cases.
For example, if you compute some quantity while holding a lock,
but use that computed quantity after releasing that lock,
you might well be using stale data.
After all, the data from which that quantity was computed could
change arbitrarily as soon as the lock was released.
So yes, RCU readers can see stale and inconsistent data, but no,
this is not necessarily problematic.
} \QuickQuizEnd
We get back to a single version simply by waiting for all the
pre-existing readers to complete, as shown in row~3.
At that point, all the pre-existing readers are done, and no later
reader has a path to the old data item, so there can no longer be
any readers referencing it.
It may therefore be safely freed, as shown on row~4.
Thus, given a way to wait for pre-existing readers to complete,
it is possible to both add data to and remove data from a linked
data structure, despite the readers executing the same sequence
of machine instructions that would be appropriate for single-threaded
execution!
So perhaps going too far was not too far after all.
But how can we tell when all of the pre-existing readers have in
fact completed?
This question is the topic of the next section.
\subsubsection{Waiting for Readers}
\label{sec:defer:Waiting for Readers}
It is tempting to base reader waiting on reference counting, but
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
in
Chapter~\ref{chp:Counting}
shows that concurrent reference counting results in extreme overhead,
as we already saw in
Section~\ref{sec:defer:Reference Counting}.
Hazard pointers profoundly reduces this overhead, but, as we saw in
Section~\ref{sec:defer:Hazard Pointers}, not to zero.
A second approach observes that memory synchronization is expensive,
and therefore uses registers instead, namely each CPU's or thread's
program counter (PC), thus imposing no overhead on readers, at least
in the absence of concurrent updates.
The updater polls each relevant PC, and if that PC is not within read-side
code, then the corresponding CPU or thread is within a quiescent state,
in turn signalling the completion of any reader that might have access
to the newly removed data element.
Once all CPU's or thread's PCs have been observed to be outside of any
reader, the grace period has completed.
Please note that this approach poses some serious challenges, including
memory ordering, functions that are \emph{sometimes} invoked from readers,
and ever-exciting code-motion optimizations.
Nevertheless, this approach is said to be used in
production~\cite{MikeAsh2015Apple}.
A third approach is to simply wait for a fixed period of time that is
long enough to comfortably exceed the lifetime of any reasonable
reader~\cite{Jacobson93,AjuJohn95}.
This can work quite well in hard real-time systems~\cite{YuxinRen2018RTRCU},
but in less exotic
settings, Murphy says that it is critically important to be prepared
even for unreasonably long-lived readers.
To see this, consider the consequences of failing to wait long enough:
A data item will be freed while the unreasonable reader is still
referencing it, and that item might well be immediately reallocated,
possibly even as a data item of some other type.
The unreasonable reader and the unwitting reallocator would then
be attempting to use the same memory for two very different purposes.
The ensuing mess will at best be exceedingly difficult to debug.
A fourth approach is to wait forever, secure in the knowledge that
doing so will outwait even the most unreasonable reader.
This approach is also called ``leaking memory'', and has a bad reputation
due to the fact that memory leaks often require untimely and
highly inconvenient reboots.
However, reputation notwithstanding, this is a viable strategy when
the update rate and the uptime are both sharply bounded.
For example, this approach could work well in a high-availability
cluster where systems were periodically crashed in order to ensure
that cluster really remained highly available.\footnote{
The program that forces the periodic crashing is sometimes
known as a ``chaos monkey'':
\url{https://netflix.github.io/chaosmonkey/}.}
Leaking the memory is also a viable strategy in environments having
garbage collectors, in which case the garbage collector can be thought
of as plugging the leak~\cite{Kung80}.
However, if your environment lacks a garbage collector, read on!
A fifth approach avoids the period crashes in favor of periodically
``stopping the world'', as exemplified by the traditional stop-the-world
garbage collector.
This approach was also heavily used during the decades before
ubiquitous connectivity, when it was common practice to power systems
off at the end of each working day.
However, in today's always-connected always-on world, stopping the world
can gravely degrade response times, which has been one motivation for the
development of concurrent garbage collectors~\cite{DavidFBacon2003RTGC}.
Furthermore, we only need all pre-existing readers to complete, not to
complete all at the same time.
This observation leads to the sixth approach, which is stopping
one CPU or thread at a time.
This approach has the advantage of not degrading reader response times
at all, let alone gravely.
Furthermore, numerous applications already have states (termed
\emph{quiescent states}) that can be
reached only after all pre-existing readers are done.
In transaction-processing systems, the time between a pair of
successive transactions might be a quiescent state.
In reactive systems, the state between a pair of successive events
might be a quiescent state.
Within non-preemptive operating-systems kernels, a context switch can be
a quiescent state~\cite{McKenney98}.
Either way, one all CPUs and/or threads have passed through a quiescent
state, the system is said to have completed a \emph{grace period},
such that all pre-existing readers have completed, and it is now
safe to free any previously-removed data items.\footnote{
It is possible to do much more with RCU than simply defer
reclamation of memory, but deferred reclamation is an
excellent place to start.}
Within a non-preemptive operating-system kernel, for context switch to be
a valid quiescent state, readers must be prohibited from blocking while
referencing a given instance data structure obtained via the \co{gptr}
pointer shown in
Figures~\ref{fig:defer:Insertion With Concurrent Readers}
and~\ref{fig:defer:Deletion With Concurrent Readers}.
This no-blocking constraint is consistent with similar constraints
on pure spinlocks, where a CPU is forbidden from blocking while
holding a spinlock.
Without this prohibition, all CPUs might be consumed by threads
spinning attempting to acquire a spinlock held by a blocked thread.
The spinning threads will not relinquish their CPUs until they acquire
the lock, but the thread holding the lock cannot possibly release it
until one of the spinning threads relinquishes a CPU.
This is a classic deadlock situation, and this deadlock is avoided
by prohibiting blocking while holding a spinlock.
Again, this same constraint is imposed on reader threads dereferencing
\co{gptr}: such threads are not allowed to block until after
they are done using the pointed-to data item.
Returning to the second row of
Figure~\ref{fig:defer:Deletion With Concurrent Readers},
where the updater has just completed executing the \co{WRITE_ONCE()},
imagine that CPU~0 executes a context switch.
Because readers are not permitted to block while traversing the linked
list, we are guaranteed that all prior readers that might have been running on
CPU~0 will have completed.
Extending this line of reasoning to the other CPUs, once each CPU has
been observed executing a context switch, we are guaranteed that all
prior readers have completed, and that there are no longer any reader
threads referencing the newly removed data element.
The updater can then safely free that data element, resulting in the
state shown at the bottom of
Figure~\ref{fig:defer:Deletion With Concurrent Readers}.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{defer/QSBRGracePeriod}}
\caption{QSBR: Waiting for Pre-Existing Readers}
\label{fig:defer:QSBR: Waiting for Pre-Existing Readers}
\end{figure}
This approach is termed \emph{quiescent state based reclamation}
(QSBR)~\cite{ThomasEHart2006a}.
A QSBR schematic is shown in
Figure~\ref{fig:defer:QSBR: Waiting for Pre-Existing Readers},
with time advancing from the top of the figure to the bottom.
CPU~1 does the \co{WRITE_ONCE()} that removes the current data
item (presumably having previously read the pointer value and
availed itself of appropriate synchronization), then waits
for readers.
This wait operation results in an immediate context switch, which is
a quiescent state, which in turn means that all prior reads on CPU~1
have completed.
Next, CPU~2 does a context switch, so that all readers on CPUs~1 and~2
are now known to have completed.
Finally, CPU~3 does a context switch.
At this point, all readers throughout the entire system are known to
have completed, so the grace period ends, permitting CPU~1 to free
the old data item.
\QuickQuiz{}
In Figure~\ref{fig:defer:QSBR: Waiting for Pre-Existing Readers},
the last of CPU~3's readers that could possibly have
access to the old data item ended before the grace period
even started!
So why would any anyone bother waiting until CPU~3's later context
switch???
\QuickQuizAnswer{
Because that waiting is exactly what enables readers to use
the same sequence of instructions that is appropriate for
single-theaded situations.
In other words, waiting enables excellent read-side performance,
scalability, and real-time response.
} \QuickQuizEnd
\subsubsection{Toy Implementation}
\label{sec:defer:Toy Implementation}
Although production-quality QSBR implementations can be quite complex,
a toy non-preemptive Linux-kernel implementation is exceedingly simple:
\begin{VerbatimN}[samepage=true]
void synchronize_rcu(void)
{
int cpu;
for_each_online_cpu(cpu)
sched_setaffinity(current->pid, cpumask_of(cpu));
}
\end{VerbatimN}
The \co{for_each_online_cpu()} primitive iterates over all CPUs, and
the \co{sched_setaffinity_on()} function causes the current thread to
execute on the specified CPU, which forces the destination CPU to execute
a context switch.
Therefore, once the \co{for_each_online_cpu()} has completed, each CPU
has executed a context switch, which in turn guarantees that
all pre-existing reader threads have completed.
\begin{listing}[tbp]
{ \scriptsize
\begin{verbbox}[\LstLineNo]
struct route *gptr;
int access_route(int (*f)(struct route *rp))
{
int ret = -1;
struct route *rp;
rcu_read_lock();
rp = rcu_dereference(gptr);
if (rp)
ret = f(rp);
rcu_read_unlock();
return ret;
}
struct route *ins_route(struct route *rp)
{
struct route *old_rp;
spin_lock(&route_lock);
old_rp = gptr;
rcu_assign_pointer(gptr, rp);
spin_unlock(&route_lock);
}
int del_route(void)
{
struct route *old_rp;
spin_lock(&route_lock);
old_rp = gptr;
RCU_INIT_POINTER(gptr, NULL);
spin_unlock(&route_lock);
synchronize_rcu();
free(old_rp);
return !!old_rp;
}
\end{verbbox}
}
\centering
\theverbbox
\caption{Insertion and Deletion With Concurrent Readers}
\label{lst:defer:Insertion and Deletion With Concurrent Readers}
\end{listing}
Please note that this approach is \emph{not} production quality.
Correct handling of a number of corner cases and the need for a number
of powerful optimizations mean that production-quality implementations
have significant additional complexity.
In addition, RCU implementations for preemptible environments
require that readers actually do something, which in non-real-time
Linux-kernel environments can be as simple as defining
\co{rcu_read_lock()} and \co{rcu_read_unlock()} as \co{preempt_disable()}
and \co{preempt_enable()}, respectively.\footnote{
Some toy RCU implementations that handle preempted
read-side critical sections are shown in
Appendix~\ref{chp:app:``Toy'' RCU Implementations}.}
However, this simple non-preemptible approach is conceptually complete,
and demonstrates that it really is possible to provide read-side
synchronization at zero cost, even in the face of concurrent updates.
In fact,
Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}
shows how reading (\co{access_route()}),
Figure~\ref{fig:defer:Insertion With Concurrent Readers}'s
insertion (\co{ins_route()}) and
Figure~\ref{fig:defer:Deletion With Concurrent Readers}'s
deletion (\co{del_route()}) can
be implemented.
(A slightly more capable routing table is shown in
Section~\ref{sec:defer:RCU for Pre-BSD Routing}.)
\QuickQuiz{}
What is the point of \co{rcu_read_lock()} and \co{rcu_read_unlock()} in
Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}?
Why not just let the quiescent states speak for themselves?
\QuickQuizAnswer{
Recall that readers are not permitted to pass through a quiescent
state.
For example, within the Linux kernel, RCU readers are not permitted
to execute a context switch.
Use of \co{rcu_read_lock()} and \co{rcu_read_unlock()} enables
debug checks for improperly placed quiescent states, making it
easy to find bugs that would otherwise be difficult to find,
intermittent, and quite destructive.
} \QuickQuizEnd
\QuickQuiz{}
What is the point of \co{rcu_dereference()}, \co{rcu_assign_pointer()}
and \co{RCU_INIT_POINTER()} in
Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}?
Why not just use \co{READ_ONCE()}, \co{smp_store_release()}, and
\co{WRITE_ONCE()}, respectively?
\QuickQuizAnswer{
The RCU-specific APIs do have similar semantics to the suggested
replacements, but also enable static-analysis debugging checks
that complain if an RCU-specific API is invoked on a non-RCU
pointer and vice versa.
} \QuickQuizEnd
Referring back to
Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers},
note that \co{route_lock} is used to synchronize between concurrent updaters
invoking \co{ins_route()} and \co{del_route()}.
However, this lock is not acquired by readers invoking \co{access_route()}:
Readers are instead protected by the QSBR techniques described in this section.
Note that \co{ins_route()} simply returns the old value of \co{gptr}, which
Figure~\ref{fig:defer:Insertion With Concurrent Readers} assumed would
always be \co{NULL}.
This means that it is the caller's responsibility to figure out what to
do with a non-\co{NULL} value, a task complicated by the fact that
readers might still be referencing it for an indeterminate period of time.
Callers might use one of the following approaches:
\begin{enumerate}
\item Use \co{synchronize_rcu()} to safely free the pointed-to structure.
Although this approach is correct from an RCU perspective, it
arguably has software-engineering leaky-API problems.
\item Trip an assertion if the returned pointer is non-\co{NULL}.
\item Pass the returned pointer to a later invocation of
\co{ins_route()} to restore the earlier value.
\end{enumerate}
In contrast, \co{del_route()} uses \co{synchronize_rcu()} and
\co{free()} to safely free the newly deleted data item.
This example shows one general approach to reading and updating
RCU-protected data structures, however, there is quite a variety
of use cases, several of which are covered in
Section~\ref{sec:defer:RCU Usage}.
In summary, it is in fact possible to create concurrent linked data
structures that can be traversed by readers executing the same sequence
of machine instructions that would be executed by single-threaded readers.
The next section summarizes RCU's high-level properties.
\subsubsection{RCU Properties}
\label{sec:defer:RCU Properties}
RCU achieves scalability
improvements by allowing reads to make useful forward progress
concurrently with updates, which are also allowed to make useful
forward progress.
This property enables RCU implementations to provide low-cost
or even no-cost readers,
in contrast with conventional synchronization primitives that
provide mutual exclusion, thus prohibiting useful concurrent forward
progress.
RCU delimits readers with \co{rcu_read_lock()} and \co{rcu_read_unlock()},
and ensures that each reader has a coherent view by
maintaining multiple versions of objects and using update-side primitives
such as \co{synchronize_rcu()} to ensure that objects are not
freed until all pre-existing read-side critical sections complete.
RCU defines and uses \co{rcu_assign_pointer()} and \co{rcu_dereference()}
to provide efficient and scalable mechanisms for publishing
and reading new versions of an object, respectively.
These mechanisms distribute the work among read and
update paths in such a way as to make read paths extremely fast, using
replication and weakening optimizations in a manner similar to
hazard pointers, but without the need for read-side retries.
In some cases, including \co{CONFIG_PREEMPT=n} Linux kernels,
RCU's read-side primitives have zero overhead.
\QuickQuiz{}
But doesn't Section~\ref{sec:defer:Sequence Locks}'s seqlock
also permit readers and updaters to get work done concurrently?
\QuickQuizAnswer{
Yes and no.
Although seqlock readers can run concurrently with
seqlock writers, whenever this happens, the \co{read_seqretry()}
primitive will force the reader to retry.
This means that any work done by a seqlock reader running concurrently
with a seqlock updater will be discarded and redone.
So seqlock readers can \emph{run} concurrently with updaters,
but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence
of concurrent RCU updaters.
However, both reference counters
(Section~\ref{sec:defer:Reference Counting})
and hazard pointers
(Section~\ref{sec:defer:Hazard Pointers})
really do permit useful concurrent forward progress for both
updaters and readers, just at somewhat greater cost.
Please see
Section~\ref{sec:defer:Which to Choose?}
for a comparison of these different solutions to the
deferred-reclamation problem.
} \QuickQuizEnd
But are these properties actually useful in practice?
This question is taken up by the next section.
\subsubsection{Practical Applicability}
\label{sec:defer:Practical Applicability}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{defer/linux-RCU}}
\caption{RCU Usage in the Linux Kernel}
\label{fig:defer:RCU Usage in the Linux Kernel}
\end{figure}
It turns out that RCU has been used in the Linux kernel since
October 2002~\cite{Torvalds2.5.43}.
Use of the RCU API has increased substantially since that time,
as can be seen in
Figure~\ref{fig:defer:RCU Usage in the Linux Kernel}.
In fact, code very similar to that in
Listing~\ref{lst:defer:Insertion and Deletion With Concurrent Readers}
could be used in the Linux kernel.
Prior to that, RCU was used in production in Sequent Computer Systems's
DYNIX/ptx operating system from the early 1990s~\cite{McKenney98},
and prior to that, a similar mechanism was used in IBM's
VM/XA, possibly as early as the mid-1980s~\cite{Hennessy89}.
Finally, as discussed in
Section~\ref{sec:defer:RCU Related Work},
mechanisms roughly similar to RCU have also been used
in academic and industrial-research projects as early as
1980~\cite{Kung80},
and more recently in production by a number of userspace
applications~\cite{MathieuDesnoyers2009URCU,MikeDay2013RCUqemu,GeoffRomer2018C++DeferredReclamationP0561R4}.
It is therefore safe to say that RCU enjoys wide practical applicability.
The minimal example discussed in this section is a good introduction to RCU.
However, effective use of RCU often requires that you think differently
about your problem.
It is therefore useful to examine RCU's fundamentals, a task taken up
by the following section.