blob: 6225c672e3e4aaa280580cb2942aa468cbc44ff2 [file] [log] [blame]
% defer/rcuusage.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\subsection{RCU Usage}
\label{sec:defer:RCU Usage}
\OriginallyPublished{sec:defer:RCU Usage}{RCU Usage}{Linux Weekly News}{PaulEMcKenney2008WhatIsRCUUsage}
This section answers the question ``What is RCU?'' from the viewpoint
of the uses to which RCU can be put.
Because RCU is most frequently used to replace some existing mechanism,
we look at it primarily in terms of its relationship to such mechanisms,
as listed in \cref{tab:defer:RCU Usage}
and as displayed in \cref{fig:defer:Relationships Between RCU Use Cases}.
Following the sections listed in this table,
\cref{sec:defer:RCU Usage Summary} provides a summary,
which is expanded on in the following sections and
elsewhere~\cite{PaulEMcKenney2021RCUMysteriesFundamental,PaulEMcKenney2022RCUMysteriesUseCases}.
\begin{table}
\renewcommand*{\arraystretch}{1.2}
\centering
\small
\begin{tabular}{ll}
\toprule
Mechanism RCU Replaces & Page \\
\midrule
RCU for pre-BSD routing &
\pageref{sec:defer:RCU for Pre-BSD Routing} \\
Wait for pre-existing things to finish &
\pageref{sec:defer:Wait for Pre-Existing Things to Finish} \\
Phased state change &
\pageref{sec:defer:Phased State Change} \\
Add-only list (publish/subscribe) &
\pageref{sec:defer:Add-Only List} \\
Type-safe memory &
\pageref{sec:defer:Type-Safe Memory} \\
Existence Guarantee &
\pageref{sec:defer:Existence Guarantee} \\
Light-weight garbage collector &
\pageref{sec:defer:Light-Weight Garbage Collector} \\
Delete-only list &
\pageref{sec:defer:Delete-Only List} \\
Quasi reader-writer lock &
\pageref{sec:defer:Quasi Reader-Writer Lock} \\
Quasi reference count &
\pageref{sec:defer:Quasi Reference Count} \\
Quasi multi-version concurrency control (MVCC) &
\pageref{sec:defer:Quasi Multi-Version Concurrency Control} \\
\bottomrule
\end{tabular}
\caption{RCU Usage}
\label{tab:defer:RCU Usage}
\end{table}
\subsubsection{RCU for Pre-BSD Routing}
\label{sec:defer:RCU for Pre-BSD Routing}
In contrast to the later sections, this section focuses on a very
specific use case for the purpose of comparison with other mechanisms.
\Cref{lst:defer:RCU Pre-BSD Routing Table Lookup,%
lst:defer:RCU Pre-BSD Routing Table Add/Delete}
show code for an RCU-protected Pre-BSD routing table
(\path{route_rcu.c}).
The former shows data structures and \co{route_lookup()},
and the latter shows \co{route_add()} and \co{route_del()}.
\begin{listing}
\input{CodeSamples/defer/route_rcu@lookup.fcv}
\caption{RCU Pre-BSD Routing Table Lookup}
\label{lst:defer:RCU Pre-BSD Routing Table Lookup}
\end{listing}
\begin{listing}
\input{CodeSamples/defer/route_rcu@add_del.fcv}
\caption{RCU Pre-BSD Routing Table Add/Delete}
\label{lst:defer:RCU Pre-BSD Routing Table Add/Delete}
\end{listing}
\begin{fcvref}[ln:defer:route_rcu:lookup]
In \cref{lst:defer:RCU Pre-BSD Routing Table Lookup},
\clnref{rh} adds the \co{->rh} field used by RCU reclamation,
\clnref{re_freed} adds the \co{->re_freed} use-after-free-check field,
\clnref{lock,unlock1,unlock2}
add RCU read-side protection,
and \clnref{chk_freed,abort} add the use-after-free check.
\end{fcvref}
\begin{fcvref}[ln:defer:route_rcu:add_del]
In \cref{lst:defer:RCU Pre-BSD Routing Table Add/Delete},
\clnref{add:lock,add:unlock,del:lock,%
del:unlock1,del:unlock2} add update-side locking,
\clnref{add:add_rcu,del:del_rcu} add RCU update-side protection,
\clnref{del:call_rcu} causes \co{route_cb()} to be invoked after
a grace period elapses,
and \clnrefrange{cb:b}{cb:e} define \co{route_cb()}.
This is minimal added code for a working concurrent implementation.
\end{fcvref}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.17a/perf-rcu}}
\caption{Pre-BSD Routing Table Protected by RCU}
\label{fig:defer:Pre-BSD Routing Table Protected by RCU}
\end{figure}
\Cref{fig:defer:Pre-BSD Routing Table Protected by RCU}
shows the performance on the read-only workload.
RCU scales quite well, and offers nearly ideal performance.
However, this data was generated using the \co{RCU_SIGNAL}
flavor of userspace
RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU},
for which \co{rcu_read_lock()} and \co{rcu_read_unlock()}
generate a small amount of code.
What happens for the \IXacr{qsbr} flavor of RCU, which generates no code at all
for \co{rcu_read_lock()} and \co{rcu_read_unlock()}?
(See \cref{sec:defer:Introduction to RCU},
and especially
\cref{fig:defer:QSBR: Waiting for Pre-Existing Readers},
for a discussion of RCU QSBR\@.)
The answer to this is shown in
\cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR},
which shows that RCU QSBR's performance and scalability actually exceeds
that of the ideal synchronization-free workload.
\QuickQuizSeries{%
\QuickQuizB{
Wait, what???
How can RCU QSBR possibly be better than ideal?
Just what rubbish definition of ideal would fail to be the best
of all possible results???
}\QuickQuizAnswerB{
This is an excellent question, and the answer is that modern
CPUs and compilers are extremely complex.
But before getting into that, it is well worth noting that
RCU QSBR's performance advantage appears only in the
one-hardware-thread-per-core regime.
Once the system is fully loaded, RCU QSBR's performance drops
back to ideal.
The RCU variant of the \co{route_lookup()} search loop actually
has one more x86 instruction than does the sequential version,
namely the \co{lea} in the sequence
\co{cmp}, \co{je}, \co{mov}, \co{cmp}, \co{lea}, and \co{jne}.
This extra instruction is due to the \co{rcu_head} structure
at the beginning of the RCU variant's \co{route_entry} structure,
so that, unlike the sequential variant, the RCU variant's
\co{->re_next.next} pointer has a non-zero offset.
Back in the 1980s, this additional \co{lea} instruction might
have reliably resulted in the RCU variant being slower, but we
are now in the 21\textsuperscript{st} century, and the 1980s
are long gone.
But those of you who read
\cref{sec:cpu:Pipelined CPUs}
carefully already knew all of this!
These counter-intuitive results of course means that any
performance result on modern microprocessors must be subject to
some skepticism.
In theory, it really does not make sense to obtain performance
results that are better than ideal, but it really can happen
on modern microprocessors.
Such results can be thought of as similar to the celebrated
super-linear speedups (see
\cref{sec:SMPdesign:Beyond Partitioning}
for one such example), that is, of interest but also of limited
practical importance.
Nevertheless, one of the strengths of RCU is that its read-side
overhead is so low that tiny effects such as this one are visible
in real performance measurements.
\begin{figure}
\centering
% Run with initial rcu_head structure in route_entry moved down.
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.17a/perf-rcu-qsbr}}
\caption{Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial \tco{rcu_head}}
\label{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head}
\end{figure}
This raises the question as to what would happen if the
\co{rcu_head} structure were to be moved so that RCU's
\co{->re_next.next} pointer also had zero offset, just the
same as the sequential variant.
And the answer, as can be seen in
\cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head},
is that this causes RCU QSBR's performance to decrease to where
it is still very nearly ideal, but no longer super-ideal.
}\QuickQuizEndB
%
\QuickQuizE{
Given RCU QSBR's read-side performance, why bother with any
other flavor of userspace RCU?
}\QuickQuizAnswerE{
Because RCU QSBR places constraints on the overall application
that might not be tolerable,
for example, requiring that each and every thread in the
application regularly pass through a quiescent state.
Among other things, this means that RCU QSBR is not helpful
to library writers, who might be better served by other
flavors of userspace RCU~\cite{PaulMcKenney2013LWNURCU}.
}\QuickQuizEndE
}
\begin{figure}
\centering
% Run with initial rcu_head structure at beginning of route_entry structure.
% This need to run twice is the reason for the oddball directory.
% @@@ Make this generate both files with a single run?
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/hps.2019.12.02a/perf-rcu-qsbr}}
\caption{Pre-BSD Routing Table Protected by RCU QSBR}
\label{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR}
\end{figure}
\begin{figure*}
\centering
\IfTwoColumn{
\resizebox{5.5in}{!}{\includegraphics{defer/RCUusecases}}
}{
\resizebox{.96\textwidth}{!}{\includegraphics{defer/RCUusecases}}
% eb builds require .96
}
\caption{Relationships Between RCU Use Cases}
\label{fig:defer:Relationships Between RCU Use Cases}
\end{figure*}
Although Pre-BSD routing is an excellent RCU use case, it is worthwhile
looking at the relationships betweeen the wider spectrum of use cases
shown in
\cref{fig:defer:Relationships Between RCU Use Cases}.
This task is taken up by the following sections.
While reading these sections, please ask yourself which of these
use cases best describes Pre-BSD routing.
\subsubsection{Wait for Pre-Existing Things to Finish}
\label{sec:defer:Wait for Pre-Existing Things to Finish}
As noted in \cref{sec:defer:RCU Fundamentals}
an important component
of RCU is a way of waiting for RCU readers to finish.
One of
RCU's great strength is that it allows you to wait for each of
thousands of different things to finish without having to explicitly
track each and every one of them, and without incurring
the performance degradation, scalability limitations, complex deadlock
scenarios, and memory-leak hazards that are inherent in schemes that
use explicit tracking.
In this section, we will show how \co{synchronize_sched()}'s
read-side counterparts (which include anything that disables preemption,
along with hardware operations and
primitives that disable interrupts) permit you to interaction with
non-maskable interrupt
(NMI) handlers, which is quite difficult using locking.
This approach has been called ``Pure RCU''~\cite{PaulEdwardMcKenneyPhD},
and it is used in a few places in the Linux kernel.
The basic form of such ``Pure RCU'' designs is as follows:
\begin{enumerate}
\item Make a change, for example, to the way that the OS reacts to an NMI\@.
\item Wait for all pre-existing read-side critical sections to
completely finish (for example, by using the
\co{synchronize_sched()} primitive).\footnote{
In Linux kernel v5.1 and later, \co{synchronize_sched()} has
been subsumed into \co{synchronize_rcu()}.}
The key observation here is that subsequent RCU read-side critical
sections are guaranteed to see whatever change was made.
\item Clean up, for example, return status indicating that the
change was successfully made.
\end{enumerate}
The remainder of this section presents example code adapted from
the Linux kernel.
In this example, the \co{nmi_stop()} function in the now-defunct
oprofile facility uses \co{synchronize_sched()} to ensure that all
in-flight NMI notifications have completed before freeing the associated
resources.
A simplified version of this code is shown in
\cref{lst:defer:Using RCU to Wait for NMIs to Finish}.
\begin{listing}
\begin{fcvlabel}[ln:defer:Using RCU to Wait for NMIs to Finish]
\begin{VerbatimL}[commandchars=\\\@\$]
struct profile_buffer { \lnlbl@struct:b$
long size;
atomic_t entry[0];
}; \lnlbl@struct:e$
static struct profile_buffer *buf = NULL; \lnlbl@struct:buf$
void nmi_profile(unsigned long pcvalue) \lnlbl@nmi_profile:b$
{
struct profile_buffer *p = rcu_dereference(buf);\lnlbl@nmi_profile:rcu_deref$
if (p == NULL) \lnlbl@nmi_profile:if_NULL$
return; \lnlbl@nmi_profile:ret:a$
if (pcvalue >= p->size) \lnlbl@nmi_profile:if_oor$
return; \lnlbl@nmi_profile:ret:b$
atomic_inc(&p->entry[pcvalue]); \lnlbl@nmi_profile:inc$
} \lnlbl@nmi_profile:e$
void nmi_stop(void) \lnlbl@nmi_stop:b$
{
struct profile_buffer *p = buf; \lnlbl@nmi_stop:fetch$
if (p == NULL) \lnlbl@nmi_stop:if_NULL$
return; \lnlbl@nmi_stop:ret$
rcu_assign_pointer(buf, NULL); \lnlbl@nmi_stop:NULL$
synchronize_sched(); \lnlbl@nmi_stop:sync_sched$
kfree(p); \lnlbl@nmi_stop:kfree$
} \lnlbl@nmi_stop:e$
\end{VerbatimL}
\end{fcvlabel}
\caption{Using RCU to Wait for NMIs to Finish}
\label{lst:defer:Using RCU to Wait for NMIs to Finish}
\end{listing}
\begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:struct]
\Clnrefrange{b}{e} define a \co{profile_buffer} structure, containing a
size and an indefinite array of entries.
\Clnref{buf} defines a pointer to a profile buffer, which is
presumably initialized elsewhere to point to a dynamically allocated
region of memory.
\end{fcvref}
\begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:nmi_profile]
\Clnrefrange{b}{e} define the \co{nmi_profile()} function,
which is called from within an NMI handler.
As such, it cannot be preempted, nor can it be interrupted by a normal
interrupt handler, however, it is still subject to delays due to cache misses,
ECC errors, and cycle stealing by other hardware threads within the same
core.
\Clnref{rcu_deref} gets a local pointer to the profile buffer using the
\co{rcu_dereference()} primitive to ensure memory ordering on
DEC Alpha, and
\clnref{if_NULL,ret:a} exit from this function if there is no
profile buffer currently allocated, while \clnref{if_oor,ret:b}
exit from this function if the \co{pcvalue} argument
is out of range.
Otherwise, \clnref{inc} increments the profile-buffer entry indexed
by the \co{pcvalue} argument.
Note that storing the size with the buffer guarantees that the
range check matches the buffer, even if a large buffer is suddenly
replaced by a smaller one.
\end{fcvref}
\begin{fcvref}[ln:defer:Using RCU to Wait for NMIs to Finish:nmi_stop]
\Clnrefrange{b}{e} define the \co{nmi_stop()} function,
where the caller is responsible for mutual exclusion (for example,
holding the correct lock).
\Clnref{fetch} fetches a pointer to the profile buffer, and
\clnref{if_NULL,ret} exit the function if there is no buffer.
Otherwise, \clnref{NULL} \co{NULL}s out the profile-buffer pointer
(using the \co{rcu_assign_pointer()} primitive to maintain
memory ordering on weakly ordered machines),
and \clnref{sync_sched} waits for an RCU Sched grace period to elapse,
in particular, waiting for all non-preemptible regions of code,
including NMI handlers, to complete.
Once execution continues at \clnref{kfree}, we are guaranteed that
any instance of \co{nmi_profile()} that obtained a
pointer to the old buffer has returned.
It is therefore safe to free the buffer, in this case using the
\co{kfree()} primitive.
\end{fcvref}
\QuickQuiz{
Suppose that the \co{nmi_profile()} function was preemptible.
What would need to change to make this example work correctly?
}\QuickQuizAnswer{
One approach would be to use
\co{rcu_read_lock()} and \co{rcu_read_unlock()}
in \co{nmi_profile()}, and to replace the
\co{synchronize_sched()} with \co{synchronize_rcu()},
perhaps as shown in
\cref{lst:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}.
%
\begin{listing}
\begin{VerbatimL}
struct profile_buffer {
long size;
atomic_t entry[0];
};
static struct profile_buffer *buf = NULL;
void nmi_profile(unsigned long pcvalue)
{
struct profile_buffer *p;
rcu_read_lock();
p = rcu_dereference(buf);
if (p == NULL) {
rcu_read_unlock();
return;
}
if (pcvalue >= p->size) {
rcu_read_unlock();
return;
}
atomic_inc(&p->entry[pcvalue]);
rcu_read_unlock();
}
void nmi_stop(void)
{
struct profile_buffer *p = buf;
if (p == NULL)
return;
rcu_assign_pointer(buf, NULL);
synchronize_rcu();
kfree(p);
}
\end{VerbatimL}
\caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish}
\label{lst:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}
\end{listing}
But why on earth would an NMI handler be preemptible???
}\QuickQuizEnd
In short, RCU makes it easy to dynamically switch among profile
buffers (you just \emph{try} doing this efficiently with atomic
operations, or at all with locking!).
This is a rare use of RCU in its pure form.
RCU is normally used at higher levels of abstraction, as
will be shown in the following sections.
\subsubsection{Phased State Change}
\label{sec:defer:Phased State Change}
\Cref{fig:defer:Phased State Change for Maintenance Operation}
shows a timeline for an example phased state change to efficiently
handle maintenance operations.
If there is no maintenance operation in progress, common-case operations
must proceed quickly, for example, without acquiring a reader-writer lock.
However, if there is a maintenance operation in progress, the common-case
operations must be undertaken carefully, taking into account added
complexities due to their running concurrently with that maintenance
operation.
This means that common-case operations will incur higher overhead during
maintenance operations, which is one reason that maintenance operations
are normally scheduled to take place during times of low load.
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{defer/RCUphasedstatechange}}
\caption{Phased State Change for Maintenance Operation}
\label{fig:defer:Phased State Change for Maintenance Operation}
\end{figure}
In the figure, these apparently conflicting requirements are resolved by
having a prepare phase prior to the maintenance operation and a cleanup
phase after it, during which the common-case operations can proceed
either quickly or carefully.
\begin{listing}
\begin{fcvlabel}[ln:defer:Phased State Change for Maintenance Operations]
\begin{VerbatimL}[commandchars=\\\@\$]
bool be_careful;
void cco(void)
{
rcu_read_lock(); \lnlbl@rrl$
if (READ_ONCE(be_careful)) \lnlbl@if$
cco_carefully(); \lnlbl@careful$
else
cco_quickly(); \lnlbl@quick$
rcu_read_unlock(); \lnlbl@rul$
}
void maint(void)
{
WRITE_ONCE(be_careful, true); \lnlbl@tocareful$
synchronize_rcu(); \lnlbl@sr1$
do_maint(); \lnlbl@maint$
synchronize_rcu(); \lnlbl@sr2$
WRITE_ONCE(be_careful, false); \lnlbl@toquick$
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Phased State Change for Maintenance Operations}
\label{lst:defer:Phased State Change for Maintenance Operations}
\end{listing}
\begin{fcvref}[ln:defer:Phased State Change for Maintenance Operations]
Example pseudo-code for this phased state change is shown in
\cref{lst:defer:Phased State Change for Maintenance Operations}.
The common-case operations are carried out by \co{cco()} within an RCU
read-side critical section extending from \clnref{rrl} to \clnref{rul}.
Here, \clnref{if} checks a global \co{be_careful} flag, invoking
\co{cco_carefully()} or \co{cco_quickly()}, as indicated.
\end{fcvref}
\begin{fcvref}[ln:defer:Phased State Change for Maintenance Operations]
This allows the \co{maint()} function to set the \co{be_careful} flag
on \clnref{tocareful} and wait for an RCU grace period on \clnref{sr1}.
When control reaches \clnref{maint}, all \co{cco()} functions that saw a
\co{false} value of \co{be_careful} (and thus which might invoke
the \co{cco_quickly()} function) will have completed their operations,
so that all currently executing \co{cco()} functions will be invoking
\co{cco_carefully()}.
This means that it is safe for the \co{do_maint()} function to be
invoked.
\Clnref{sr2} then waits for all \co{cco()} functions that might have
run concurrently with \co{do_maint()} to complete, and finally
\clnref{toquick} sets the \co{be_careful} flag back to \co{false}.
\end{fcvref}
\QuickQuizSeries{%
\QuickQuizB{
What is the point of the second call to \co{synchronize_rcu()}
in function
\co{maint()} in \cref{lst:defer:Phased State Change for Maintenance Operations}?
Isn't it OK for any \co{cco()} invocations in the clean-up
phase to invoke either \co{cco_carefully()} or \co{cco_quickly()}?
}\QuickQuizAnswerB{
The problem is that there is no ordering between the \co{cco()}
function's load from \co{be_careful} and any memory loads
executed by the \co{cco_quickly()} function.
Because there is no ordering, without that second call to
\co{syncrhonize_rcu()}, memory ordering could cause loads
in \co{cco_quickly()} to overlap with stores by \co{do_maint()}.
Another alternative would be to compensate for the removal of
that second call to \co{synchronize_rcu()} by changing the
\co{READ_ONCE()} to \co{smp_load_acquire()} and the
\co{WRITE_ONCE()} to \co{smp_store_release()}, thus restoring
the needed ordering.
}\QuickQuizEndB
\QuickQuizE{
How can you be sure that the code shown in
\co{maint()} in \cref{lst:defer:Phased State Change for Maintenance Operations}
really works?\@
}\QuickQuizAnswerE{
By one popular school of thought, you cannot.
But in this case, those willing to jump ahead to
\cref{chp:Formal Verification}
and
\cref{chp:Advanced Synchronization: Memory Ordering}
might find a couple of LKMM litmus tests to be interesting
(\path{C-RCU-phased-state-change-1.litmus} and
\path{C-RCU-phased-state-change-2.litmus}).
These tests could be argued to demonstrate that this code
and a variant of it really do work.
}\QuickQuizEndE
}% End of \QuickQuizSeries
Phased state change allows frequent operations to use light-weight
checks, without the need for expensive lock acquisitions or atomic
read-modify-write operations, and is used in the Linux kernel in the
guise of \co{rcu_sync}~\cite{OlegNesterov2013rcusync} to implement a
variant of reader-writer semaphores with lightweight readers.
Phased state change adds only a checked state variable
to the wait-to-finish use case
(\cref{sec:defer:Wait for Pre-Existing Things to Finish}),
thus also residing at a rather low level of abstraction.
\subsubsection{Add-Only List}
\label{sec:defer:Add-Only List}
Add-only data structures, exemplified by the add-only list, can be used
for a surprisingly common set of use cases, perhaps most commonly the
logging of changes.
Add-only data structures are a pure use of RCU's underlying
publish/subscribe mechanism.
An add-only variant of a pre-BSD routing table can be derived from
\cref{lst:defer:RCU Pre-BSD Routing Table Lookup,lst:defer:RCU Pre-BSD Routing Table Add/Delete}.
Because there is no deletion, the \co{route_del()} and \co{route_cb()}
functions may be dispensed with, along with the \co{->rh}
and \co{->re_freed} fields of the \co{route_entry} structure, the
\co{rcu_read_lock()}, the \co{rcu_read_unlock()} invocations in the
\co{route_lookup()} function, and all uses of the \co{->re_freed} field
in all remaining functions.
Of course, if there are many concurrent invocations of the \co{route_add()}
function, there will be heavy contention on \co{routelock}, and if lockless
techniques are used, heavy memory contention on \co{routelist}.
The usual way to avoid this contention is to use a concurrency-friendly
data structure such as a hash table (see \cref{chp:Data Structures}).
Alternatively, per-CPU data structures might be periodically merged
into a single global data structure.
On the other hand, if there is never any deletion, extended time periods
featuring many concurrent invocations of \co{route_add()} will eventually
consume all available memory.
Therefore, most RCU-protected data structures also implement deletion.
\subsubsection{Type-Safe Memory}
\label{sec:defer:Type-Safe Memory}
A number of lockless algorithms do not require that a given data
element keep the same identity through a given RCU read-side critical
section referencing it---but only if that data element retains the
same type.
In other words, these lockless algorithms can tolerate a given data
element being freed and reallocated as the same type of structure
while they are referencing it, but must prohibit a change in type.
This guarantee, called ``\IX{type-safe memory}'' in
academic literature~\cite{Cheriton96a},
is weaker than the \IXpl{existence guarantee} discussed
in \cref{sec:defer:Existence Guarantee},
and is therefore quite a bit harder to work with.
Type-safe memory algorithms in the Linux kernel make use of slab caches,
specially marking these caches with \co{SLAB_TYPESAFE_BY_RCU}
so that RCU is used when returning a freed-up
slab to system memory.
This use of RCU guarantees that any in-use element of
such a slab will remain in that slab, thus retaining its type,
for the duration of any pre-existing RCU read-side critical sections.
\QuickQuiz{
But what if there is an arbitrarily long series of RCU
read-side critical sections in multiple threads, so that at
any point in time there is at least one thread in the system
executing in an RCU read-side critical section?
Wouldn't that prevent any data from a \co{SLAB_TYPESAFE_BY_RCU}
slab ever being returned to the system, possibly resulting
in OOM events?
}\QuickQuizAnswer{
There could certainly be an arbitrarily long period of time
during which at least one thread is always in an RCU read-side
critical section.
However, the key words in the description in
\cref{sec:defer:Type-Safe Memory}
are ``in-use'' and ``pre-existing''.
Keep in mind that a given RCU read-side critical section is
conceptually only permitted to gain references to data elements
that were visible to readers during that critical section.
Furthermore, remember that a slab cannot be returned to the
system until all of its data elements have been freed, in fact,
the RCU grace period cannot start until after they have all been
freed.
Therefore, the slab cache need only wait for those RCU read-side
critical sections that started before the freeing of the last element
of the slab.
This in turn means that any RCU grace period that begins after
the freeing of the last element will do---the slab may be returned
to the system after that grace period ends.
}\QuickQuizEnd
It is important to note that \co{SLAB_TYPESAFE_BY_RCU} will
\emph{in no way}
prevent \co{kmem_cache_alloc()} from immediately reallocating
memory that was just now freed via \co{kmem_cache_free()}!
In fact, the \co{SLAB_TYPESAFE_BY_RCU}-protected data structure
just returned by \co{rcu_dereference()} might be freed and reallocated
an arbitrarily large number of times, even when under the protection
of \co{rcu_read_lock()}.
Instead, \co{SLAB_TYPESAFE_BY_RCU} operates by preventing
\co{kmem_cache_free()}
from returning a completely freed-up slab of data structures
to the system until after an RCU grace period elapses.
In short, although a given RCU read-side critical section might see a
given \co{SLAB_TYPESAFE_BY_RCU} data element being freed and reallocated
arbitrarily often, the element's type is guaranteed not to change until
that critical section has completed.
These algorithms therefore typically use a validation step that checks
to make sure that the newly referenced data structure really is the one
that was requested~\cite[Section~2.5]{LaninShasha1986TSM}.
These validation checks require that portions of the data structure
remain untouched by the free-reallocate process.
Such validation checks are usually very hard to get right, and can
hide subtle and difficult bugs.
Therefore, although type-safety-based lockless algorithms can be extremely
helpful in a very few difficult situations, you should instead use existence
guarantees where possible.
Simpler is after all almost always better!
On the other hand, type-safety-based lockless algorithms can
provide improved cache locality, and thus improved performance.
This improved cache locality is provided by the fact that such
algorithms can immediately reallocate a newly freed block of memory.
In contrast, algorithms based on existence guarantees must wait for
all pre-existing readers before reallocating memory, by which time
that memory may have been ejected from CPU caches.
As can be seen in \cref{fig:defer:Relationships Between RCU Use Cases},
RCU's type-safe-memory use case combines both the wait-to-finish
and publish-subscribe components, but in the Linux kernel also includes
the slab allocator's deferred reclamation specified by the
\co{SLAB_TYPESAFE_BY_RCU} flag.
\subsubsection{Existence Guarantee}
\label{sec:defer:Existence Guarantee}
Gamsa et al.~\cite{Gamsa99}
discuss \IXpl{existence guarantee} and describe how a mechanism
resembling RCU can be used to provide these existence guarantees
(see Section~5 on page~7 of the PDF), and
\cref{sec:locking:Lock-Based Existence Guarantees}
discusses how to guarantee existence via locking, along with the
ensuing disadvantages of doing so.
The effect is that if any RCU-protected data element is accessed
within an RCU read-side critical section, that data element is
guaranteed to remain in existence for the duration of that RCU
read-side critical section.
\begin{listing}
\begin{fcvlabel}[ln:defer:Existence Guarantees Enable Per-Element Locking]
\begin{VerbatimL}[commandchars=\\\@\$]
int delete(int key)
{
struct element *p;
int b;
b = hashfunction(key); \lnlbl@hash$
rcu_read_lock(); \lnlbl@rdlock$
p = rcu_dereference(hashtable[b]);
if (p == NULL || p->key != key) { \lnlbl@chkkey$
rcu_read_unlock(); \lnlbl@rdunlock1$
return 0; \lnlbl@ret_0:a$
}
spin_lock(&p->lock); \lnlbl@acq$
if (hashtable[b] == p && p->key == key) {\lnlbl@chkkey2$
rcu_read_unlock(); \lnlbl@rdunlock2$
rcu_assign_pointer(hashtable[b], NULL);\lnlbl@remove$
spin_unlock(&p->lock); \lnlbl@rel1$
synchronize_rcu(); \lnlbl@sync_rcu$
kfree(p); \lnlbl@kfree$
return 1; \lnlbl@ret_1$
}
spin_unlock(&p->lock); \lnlbl@rel2$
rcu_read_unlock(); \lnlbl@rdunlock3$
return 0; \lnlbl@ret_0:b$
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Existence Guarantees Enable Per-Element Locking}
\label{lst:defer:Existence Guarantees Enable Per-Element Locking}
\end{listing}
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
\Cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
demonstrates how RCU-based existence guarantees can enable
per-element locking via a function that deletes an element from
a hash table.
\Clnref{hash} computes a hash function, and \clnref{rdlock} enters an RCU
read-side critical section.
If \clnref{chkkey} finds that the corresponding bucket of the hash table is
empty or that the element present is not the one we wish to delete,
then \clnref{rdunlock1} exits the RCU read-side critical section and
\clnref{ret_0:a}
indicates failure.
\end{fcvref}
\QuickQuiz{
What if the element we need to delete is not the first element
of the list on
\clnrefr{ln:defer:Existence Guarantees Enable Per-Element Locking:chkkey} of
\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}?
}\QuickQuizAnswer{
As with the (bug-ridden)
\cref{lst:locking:Per-Element Locking Without Existence Guarantees (Buggy!)},
this is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is again invited to adapt this example to a hash table with
full chaining.
Less energetic readers might wish to refer to
\cref{chp:Data Structures}.
}\QuickQuizEnd
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
Otherwise, \clnref{acq} acquires the update-side spinlock, and
\clnref{chkkey2} then checks that the element is still the one that we want.
If so, \clnref{rdunlock2} leaves the RCU read-side critical section,
\clnref{remove} removes it from the table, \clnref{rel1} releases
the lock, \clnref{sync_rcu} waits for all pre-existing RCU read-side critical
sections to complete, \clnref{kfree} frees the newly removed element,
and \clnref{ret_1} indicates success.
If the element is no longer the one we want, \clnref{rel2} releases
the lock, \clnref{rdunlock3} leaves the RCU read-side critical section,
and \clnref{ret_0:b} indicates failure to delete the specified key.
\end{fcvref}
\QuickQuizSeries{%
\QuickQuizB{
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
Why is it OK to exit the RCU read-side critical section on
\clnref{rdunlock2} of
\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on \clnref{rel1}?
\end{fcvref}
}\QuickQuizAnswerB{
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
First, please note that the second check on \clnref{chkkey2} is
necessary because some other
CPU might have removed this element while we were waiting
to acquire the lock.
However, the fact that we were in an RCU read-side critical section
while acquiring the lock guarantees that this element could not
possibly have been re-allocated and re-inserted into this
hash table.
Furthermore, once we acquire the lock, the lock itself guarantees
the element's existence, so we no longer need to be in an
RCU read-side critical section.
% The question as to whether it is necessary to re-check the
% element's key is left as an exercise to the reader.
% A re-check is necessary if the key can mutate or if it is
% necessary to reject deleted entries (in cases where deletion
% is recorded by mutating the key.
\end{fcvref}
}\QuickQuizEndB
%
\QuickQuizM{
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
Why not exit the RCU read-side critical section on
\clnref{rdunlock3} of
\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on \clnref{rel2}?
\end{fcvref}
}\QuickQuizAnswerM{
Suppose we reverse the order of these two lines.
Then this code is vulnerable to the following sequence of
events:
\begin{enumerate}
\begin{fcvref}[ln:defer:Existence Guarantees Enable Per-Element Locking]
\item CPU~0 invokes \co{delete()}, and finds the element
to be deleted, executing through \clnref{rdunlock2}.
It has not yet actually deleted the element, but
is about to do so.
\item CPU~1 concurrently invokes \co{delete()}, attempting
to delete this same element.
However, CPU~0 still holds the lock, so CPU~1 waits
for it at \clnref{acq}.
\item CPU~0 executes \clnref{remove,rel1},
and blocks at \clnref{sync_rcu} waiting for CPU~1
to exit its RCU read-side critical section.
\item CPU~1 now acquires the lock, but the test on \clnref{chkkey2}
fails because CPU~0 has already removed the element.
CPU~1 now executes \clnref{rel2}
(which we switched with \clnref{rdunlock3}
for the purposes of this Quick Quiz)
and exits its RCU read-side critical section.
\item CPU~0 can now return from \co{synchronize_rcu()},
and thus executes \clnref{kfree}, sending the element to
the freelist.
\item CPU~1 now attempts to release a lock for an element
that has been freed, and, worse yet, possibly
reallocated as some other type of data structure.
This is a fatal memory-corruption error.
\end{fcvref}
\end{enumerate}
}\QuickQuizEndM
%
\QuickQuizE{
The RCU-based algorithm shown in
\cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
locks very similar to that in
\cref{lst:locking:Per-Element Locking With Lock-Based Existence Guarantees},
so why should the RCU-based approach be any better?
}\QuickQuizAnswerE{
\Cref{lst:defer:Existence Guarantees Enable Per-Element Locking}
replaces the per-element \co{spin_lock()} and \co{spin_unlock()}
shown in
\cref{lst:locking:Per-Element Locking With Lock-Based Existence Guarantees}
with a much cheaper \co{rcu_read_lock()} and \co{rcu_read_unlock()},
thus greatly improving both performance and scalability.
For more detail, please see
\cref{sec:datastruct:RCU-Protected Hash Table Performance}.
}\QuickQuizEndE
}
Alert readers will recognize this as only a slight variation on
the original wait-to-finish theme
(\cref{sec:defer:Wait for Pre-Existing Things to Finish}),
adding publish/subscribe, linked structures, a heap allocator
(typically), and deferred reclamation, as shown in
\cref{fig:defer:Relationships Between RCU Use Cases}.
They might also note the deadlock-immunity advantages over the lock-based
existence guarantees discussed in
\cref{sec:locking:Lock-Based Existence Guarantees}.
\subsubsection{Light-Weight Garbage Collector}
\label{sec:defer:Light-Weight Garbage Collector}
A not-uncommon exclamation made by people first learning about
RCU is ``RCU is sort of like a garbage
collector!''~\cite{MattKline2023LockBasedLocklessFresh}.
This exclamation has a large grain of truth, especially when using RCU
to implement non-blocking algorithms that rely on garbage collection,
but it can sometimes be misleading.
Perhaps the best way to think of the relationship between RCU
and automatic garbage collectors (GCs) is that RCU resembles
a GC in that the \emph{timing} of collection is automatically
determined, but that RCU differs from a GC in that:
\begin{enumerate*}[(1)]
\item The programmer must manually indicate when a given data
structure is eligible to be collected and
\item The programmer must manually mark the RCU read-side critical
sections where references might be held.
\end{enumerate*}
Despite these differences, the resemblance does go quite deep.
In fact, the first RCU-like mechanism I am aware of used a
reference-count-based garbage collector to handle the grace
periods~\cite{Kung80}, and the connection between RCU and
garbage collection has been noted more
recently~\cite{HarshalSheth2016goRCU}.
The light-weight garbage collector use case is very similar to
the existence-guarantee use case, adding only the desired non-blocking
algorithm to the mix.
This light-weight garbage collector use case can also be used in
conjunction with the existence guarantees described in the next section.
\subsubsection{Delete-Only List}
\label{sec:defer:Delete-Only List}
The delete-only list is the less-popular counterpart to the add-only list
covered in \cref{sec:defer:Add-Only List}, and can be thought of as the
existence-guarantee use case, but without the publish/subscribe component,
as shown in \cref{fig:defer:Relationships Between RCU Use Cases}.
A delete-only list can be used when the universe of possible members of
the list is known at initialization, and where members can be removed.
For example, elements of the list might represent hardware elements of
the system that are subject to failure, but cannot be repaired or
replaced without a reboot.
An delete-only variant of a pre-BSD routing table can be derived from
\cref{lst:defer:RCU Pre-BSD Routing Table Lookup,lst:defer:RCU Pre-BSD Routing Table Add/Delete}.
Because there is no addition, the \co{route_add()} function may be
dispensed with, or, alternatively, its use might be restricted to
initialization time.
In theory, the \co{route_lookup()} function can use a non-RCU iterator,
though in the Linux kernel this will result in complaints from debug code.
In addition, the incremental cost of an RCU iterator is usually
negligible.
As a result, delete-only situations typically use algorithms and data
structures that are designed for addition as well as deletion.
\subsubsection{Quasi Reader-Writer Lock}
\label{sec:defer:Quasi Reader-Writer Lock}
Perhaps the most common use of RCU within the Linux kernel is as
a replacement for reader-writer locking in read-intensive situations.
Nevertheless, this use of RCU was not immediately apparent to me
at the outset.
In fact, I chose to implement a lightweight reader-writer
lock~\cite{WilsonCHsieh92a}\footnote{
Similar to \co{brlock} in the 2.4 Linux kernel and to
\co{lglock} in more recent Linux kernels.}
before implementing a general-purpose RCU implementation
back in the early 1990s.
Each and every one of the uses I envisioned for the lightweight reader-writer
lock was instead implemented using RCU\@.
In fact, it was more than
three years before the lightweight reader-writer lock saw its first use.
Boy, did I feel foolish!
The key similarity between RCU and reader-writer locking is that
both have read-side critical sections that can execute concurrently.
In fact, in some cases, it is possible to mechanically substitute RCU API
members for the corresponding reader-writer lock API members.
But first, why bother?
Advantages of RCU include performance,
deadlock immunity, and realtime latency.
There are, of course, limitations to RCU, including the fact that
readers and updaters run concurrently, that low-priority RCU readers
can block high-priority threads waiting for a grace period to elapse,
and that grace-period latencies can extend for many milliseconds.
These advantages and limitations are discussed in the following paragraphs.
\paragraph{Performance}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperf}}
\caption{Performance Advantage of RCU Over Reader-Writer Locking}
\label{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}
\end{figure}
The read-side performance advantages of Linux-kernel RCU over
reader-writer locking are shown in
\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking},
which was generated on a 448-CPU 2.10\,GHz Intel x86 system.
\QuickQuizSeries{%
\QuickQuizB{
WTF\@?
How the heck do you expect me to believe that RCU can have less
than a 300-picosecond overhead when the clock period at 2.10\,GHz
is almost 500\,picoseconds?
}\QuickQuizAnswerB{
First, consider that the inner loop used to
take this measurement is as follows:
\begin{VerbatimN}
for (i = nloops; i >= 0; i--) {
rcu_read_lock();
rcu_read_unlock();
}
\end{VerbatimN}
Next, consider the effective definitions of \co{rcu_read_lock()}
and \co{rcu_read_unlock()}:
\begin{VerbatimN}
#define rcu_read_lock() barrier()
#define rcu_read_unlock() barrier()
\end{VerbatimN}
These definitions constrain compiler code-movement optimizations
involving memory references, but emit no instructions in and
of themselves.
However, if the loop variable is maintained in a register,
the accesses to \co{i} will not count as memory references.
Furthermore, the compiler can do loop unrolling,
allowing the resulting code to ``execute'' multiple passes
through the loop body simply by incrementing \co{i} by
some value larger than the value 1.
So the ``measurement'' of 267 picoseconds is simply the fixed
overhead of the timing measurements divided by the number of
passes through the inner loop containing the calls
to \co{rcu_read_lock()} and \co{rcu_read_unlock()}, plus
the code to manipulate \co{i} divided by the loop-unrolling
factor.
And therefore, this measurement really is in error, in fact,
it exaggerates the overhead by an arbitrary number of orders
of magnitude.
After all, in terms of machine instructions emitted, the actual
overheads of \co{rcu_read_lock()} and of \co{rcu_read_unlock()}
are each precisely zero.
It is not just every day that a timing measurement of 267
picoseconds turns out to be an overestimate!
}\QuickQuizEndB
\QuickQuizM{
Didn't an earlier edition of this book show RCU read-side
overhead way down in the sub-picosecond range?
What happened???
}\QuickQuizAnswerM{
Excellent memory!!!
The overhead in some early releases was in fact roughly
100~femtoseconds.
What happened was that RCU usage spread more broadly through the
Linux kernel, including into code that takes page faults.
Back at that time, \co{rcu_read_lock()} and \co{rcu_read_unlock()}
were complete no-ops in \co{CONFIG_PREEMPT=n} kernels.
Unfortunately, that situation allowed the compiler to reorder
page-faulting memory accesses into RCU read-side critical
sections.
Of course, page faults can block, which destroys those critical
sections.
Nor was this a theoretical problem:
A failure actually manifested in 2019.
\ppl{Herbert}{Xu} tracked down this failure down and
\ppl{Linus}{Torvalds}
therefore queued a commit to upgrade \co{rcu_read_lock()} and
\co{rcu_read_unlock()} to unconditionally include a call to
\co{barrier()}~\cite{LinusTorvalds2019:RCUreader.barrier}.
And although \co{barrier()} emits no code, it does constrain
compiler optimizations.
And so the price of widespread RCU usage is slightly higher
\co{rcu_read_lock()} and \co{rcu_read_unlock()} overhead.
As such, Linux-kernel RCU has proven to be a victim of its
own success.
Of course, it is also the case that the older results were obtained
on a different system than were those shown in
\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}.
So which change had the most effect, Linus's commit or the change in
the system?
This question is left as an exercise to the reader.
}\QuickQuizEndM
\QuickQuizE{
Why is there such large variation for the \co{RCU} trace in
\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}?
}\QuickQuizAnswerE{
Keep in mind that this is a log-log plot, so those large-seeming
\co{RCU} variances in reality span only a few hundred picoseconds.
And that is such a short time that anything could cause it.
However, given that the variance decreases with both small and
large numbers of CPUs, one hypothesis is that the variation is
due to migrations from one CPU to another.
Yes, these measurements were taken with interrupts disabled, but
they were also taken within a guest OS, so that preemption was
still possible at the hypervisor level.
In addition, the system featured hyperthreading and a single
hardware thread running this RCU workload is able to consume
more than half of the core's resources.
Therefore, the overall throughput varies depending on how many
of a given guest OS's CPUs share cores.
Attempting to reduce these variations by running the guest OSes
at real-time priority (as suggested by Joel Fernandes) is left
as an exercise for the reader.
}\QuickQuizEndE
} % End of \QuickQuizSeries
Note that reader-writer locking is more than an order of magnitude slower
than RCU on a single CPU, and is more than \emph{four} orders of magnitude
slower on 192~CPUs.
In contrast, RCU scales quite well.
In both cases, the error bars cover the full range of the measurements
from 30~runs, with the line being the median.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperfPREEMPT}}
\caption{Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
\label{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
\end{figure}
A more moderate view may be obtained from a \co{CONFIG_PREEMPT} kernel,
though RCU still beats reader-writer locking by between a factor of seven
on a single CPU and by three orders of magnitude on 192~CPUs, as shown in
\cref{fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking},
which was generated on the same 448-CPU 2.10\,GHz x86 system.
Note the high variability of reader-writer locking at larger numbers of CPUs.
The error bars span the full range of data.
\QuickQuiz{
Given that the system had no fewer than 448~hardware threads,
why only 192~CPUs?
}\QuickQuizAnswer{
Because the script (\path{rcuscale.sh}) that generates this data
spawns a guest operating system for each set of points gathered,
and on this particular system, both \co{qemu} and KVM limit the
number of CPUs that may be configured into a given guest OS\@.
Yes, it would have been possible to run a few more CPUs, but
192 is a nice round number from a binary perspective, given
that 256 is infeasible.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/rwlockRCUperfwt}}
\caption{Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration, 192 CPUs}
\label{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}
\end{figure}
Of course, the low performance of reader-writer locking in
\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking,%
fig:defer:Performance Advantage of Preemptible RCU Over Reader-Writer Locking}
is exaggerated by the unrealistic zero-length critical sections.
The performance advantages of RCU decrease as the overhead of the critical
sections increase, as shown in
\cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration},
which was run on the same system as the previous plots.
Here, the y-axis represents the sum of the overhead of the read-side
primitives and that of the critical section and the x-axis represents
the critical-section overhead in nanoseconds.
But please note the logscale y~axis, which means that the small
separations between the traces still represent significant differences.
This figure shows non-preemptible RCU, but given that preemptible RCU's
read-side overhead is only about three nanoseconds, its plot would be
nearly identical to
\cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}.
\QuickQuiz{
Why the larger error ranges for the submicrosecond durations in
\cref{fig:defer:Comparison of RCU to Reader-Writer Locking as Function of Critical-Section Duration}?
}\QuickQuizAnswer{
Because smaller disturbances result in greater relative errors
for smaller measurements.
Also, the Linux kernel's \co{ndelay()} nanosecond-scale primitive
is (as of 2020) less accurate than is the \co{udelay()} primitive
used for the data for durations of a microsecond or more.
It is instructive to compare to the zero-length case shown in
\cref{fig:defer:Performance Advantage of RCU Over Reader-Writer Locking}.
}\QuickQuizEnd
There are three traces for reader-writer locking, with the upper trace
being for 100~CPUs, the next for 10~CPUs, and the lowest for 1~CPU\@.
The greater the number of CPUs and the shorter the critical sections,
the greater is RCU's performance advantage.
These performance advantages are underscored by the fact that 100-CPU
systems are no longer uncommon and that a number of system calls (and
thus any RCU read-side critical sections that they contain) complete
within microseconds.
In addition, as is discussed in the next paragraph,
RCU read-side primitives are almost entirely deadlock-immune.
\paragraph{Deadlock Immunity}
Although RCU offers significant performance advantages for
read-mostly workloads, one of the primary reasons for creating
RCU in the first place was in fact its immunity to read-side
deadlocks.
This immunity stems from the fact that
RCU read-side primitives do not block, spin, or even
do backwards branches, so that their execution time is deterministic.
It is therefore impossible for them to participate in a deadlock
cycle.
\QuickQuiz{
Is there an exception to this deadlock immunity, and if so,
what sequence of events could lead to deadlock?
}\QuickQuizAnswer{
One way to cause a deadlock cycle involving
RCU read-side primitives is via the following (illegal) sequence
of statements:
\begin{VerbatimU}
rcu_read_lock();
synchronize_rcu();
rcu_read_unlock();
\end{VerbatimU}
The \co{synchronize_rcu()} cannot return until all
pre-existing RCU read-side critical sections complete, but
is enclosed in an RCU read-side critical section that cannot
complete until the \co{synchronize_rcu()} returns.
The result is a classic self-deadlock---you get the same
effect when attempting to write-acquire a reader-writer lock
while read-holding it.
Note that this self-deadlock scenario does not apply to
RCU QSBR, because the context switch performed by the
\co{synchronize_rcu()} would act as a quiescent state
for this CPU, allowing a grace period to complete.
However, this is if anything even worse, because data used
by the RCU read-side critical section might be freed as a
result of the grace period completing.
Plus Linux kernel's lockdep facility will yell at you.
In short, do not invoke synchronous RCU update-side primitives, which
are listed in
\cref{tab:defer:RCU Wait-to-Finish APIs},
from within an RCU read-side critical section.
In addition, within the Linux kernel, RCU uses the scheduler
and the scheduler uses RCU\@.
In some cases, both RCU and the scheduler must take care to
avoid deadlock.
}\QuickQuizEnd
An interesting consequence of RCU's read-side deadlock immunity is
that it is possible to unconditionally upgrade an RCU
reader to an RCU updater.
Attempting to do such an upgrade with reader-writer locking results
in deadlock.
A sample code fragment that does an RCU read-to-update upgrade follows:
\begin{VerbatimN}[samepage=true]
rcu_read_lock();
list_for_each_entry_rcu(p, &head, list_field) {
do_something_with(p);
if (need_update(p)) {
spin_lock(my_lock);
do_update(p);
spin_unlock(&my_lock);
}
}
rcu_read_unlock();
\end{VerbatimN}
Note that \co{do_update()} is executed under
the protection of the lock \emph{and} under RCU read-side protection.
Another interesting consequence of RCU's deadlock immunity is its
immunity to a large class of priority inversion problems.
For example, low-priority RCU readers cannot prevent a high-priority
RCU updater from acquiring the update-side lock.
Similarly, a low-priority RCU updater cannot prevent high-priority
RCU readers from entering an RCU read-side critical section.
\QuickQuiz{
Immunity to both deadlock and priority inversion???
Sounds too good to be true.
Why should I believe that this is even possible?
}\QuickQuizAnswer{
It really does work.
After all, if it didn't work, the Linux kernel would not run.
}\QuickQuizEnd
\paragraph{Realtime Latency}
Because RCU read-side primitives neither spin nor block, they offer
excellent realtime latencies.
In addition, as noted earlier, this means that they are
immune to priority inversion
involving the RCU read-side primitives and locks.
However, RCU is susceptible to more subtle priority-inversion scenarios,
for example, a high-priority process blocked waiting for an RCU
grace period to elapse can be blocked by low-priority RCU readers
in \rt\ kernels.
This can be solved by using RCU priority
boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}.
However, use of RCU priority boosting requires that \co{rcu_read_unlock()}
do deboosting, which entails acquiring scheduler locks.
Some care is therefore required within the scheduler and RCU to avoid
deadlocks, which as of the v5.15 Linux kernel requires RCU to avoid
invoking the scheduler while holding any of RCU's locks.
This in turn means that \co{rcu_read_unlock()} is not always lockless
when RCU priority boosting is enabled.
However, \co{rcu_read_unlock()} will still be lockless if its
critical section was not priority-boosted.
Furthermore, critical sections will not be priority boosted unless they
are preempted, or, in -rt kernels, they acquire non-raw spinlocks.
This means that \co{rcu_read_unlock()} will normally be lockless from the
perspective of the highest priority task running on any given CPU.
\paragraph{RCU Readers and Updaters Run Concurrently}
Because RCU readers never spin nor block, and because updaters are not
subject to any sort of rollback or abort semantics, RCU readers and
updaters really can run concurrently.
This means that RCU readers might access stale data, and might even
see inconsistencies, either of which can render conversion from reader-writer
locking to RCU non-trivial.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{defer/rwlockRCUupdate}}
\caption{Response Time of RCU vs.\@ Reader-Writer Locking}
\label{fig:defer:Response Time of RCU vs. Reader-Writer Locking}
\end{figure}
However, in a surprisingly large number of situations, inconsistencies and
stale data are not problems.
The classic example is the networking routing table.
Because routing updates can take considerable time to reach a given
system (seconds or even minutes), the system will have been sending
packets the wrong way for quite some time when the update arrives.
It is usually not a problem to continue sending updates the wrong
way for a few additional milliseconds.
Furthermore, because RCU updaters can make changes without waiting for
RCU readers to finish,
the RCU readers might well see the change more quickly than would
batch-fair
reader-writer-locking readers, as shown in
\cref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}.
This faster RCU response time has since been corroborated (if perhaps
a bit unrealistically) by Markov-model
analysis~\cite[Figures 3 and 5]{VishakhaRamani2023LockBasedLocklessFresh}.
\QuickQuiz{
But how many other algorithms really tolerate stale and
inconsistent data?
}\QuickQuizAnswer{
Quite a few!
Please keep in mind that the finite speed of light means that
data reaching a given computer system is at least slightly stale
at the time that it arrives, and extremely stale in the case
of astronomical data.
The finite speed of light also places a sharp limit on the
consistency of data arriving from different sources of via
different paths.
You might as well face the fact that the laws of physics
are incompatible with naive notions of perfect freshness and
consistency.
}\QuickQuizEnd
Once the update is received, the rwlock writer cannot proceed until the
last reader completes, and subsequent readers cannot proceed until the
writer completes.
However, these subsequent readers are guaranteed to see the new value,
as indicated by the green shading of the rightmost boxes.
In contrast, RCU readers and updaters do not block each other, which permits
the RCU readers to see the updated values sooner.
Of course, because their execution overlaps that of the RCU updater,
\emph{all} of the RCU readers might well see updated values, including
the three readers that started before the update.
Nevertheless only the green-shaded rightmost RCU readers
are \emph{guaranteed} to see the updated values.
Reader-writer locking and RCU simply provide different guarantees.
With reader-writer locking, any reader that begins after the writer begins
is guaranteed to see new values, and any reader that attempts to
begin while the writer is spinning might or might not see new values,
depending on the reader/writer preference of the rwlock implementation in
question.
In contrast, with RCU, any reader that begins after the updater completes
is guaranteed to see new values, and any reader that completes after the
updater begins might or might not see new values, depending on timing.
The key point here is that, although reader-writer locking does
indeed guarantee consistency within the confines of the computer system,
there are situations where this consistency comes at the price of
increased \emph{inconsistency} with the outside world, courtesy of
the finite speed of light and the non-zero size of atoms.
In other words, reader-writer locking obtains internal consistency at the
price of silently stale data with respect to the outside world.
Note that if a value is computed while read-holding a reader-writer
lock, and then that value is used after that lock is released, then
this reader-writer-locking use case is using stale data.
After all, the quantities that this value is based on could change
at any time after that lock is released.
This sort of reader-writer-locking use case is often easy to
convert to RCU, as will be shown in
\cref{lst:defer:Converting Reader-Writer Locking to RCU: Data,%
lst:defer:Converting Reader-Writer Locking to RCU: Search,%
lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
and the accompanying text.
\paragraph{Low-Priority RCU Readers Can Block High-Priority Reclaimers}
In Realtime RCU~\cite{DinakarGuniguntala2008IBMSysJ} or
SRCU~\cite{PaulEMcKenney2006c},
a preempted reader will prevent a grace period from completing, even if
a high-priority task is blocked waiting for that grace period to complete.
Realtime RCU can avoid this problem by substituting \co{call_rcu()}
for \co{synchronize_rcu()} or by using RCU priority
boosting~\cite{PaulEMcKenney2007BoostRCU,DinakarGuniguntala2008IBMSysJ}.
It might someday be necessary to augment SRCU and RCU Tasks Trace with
priority boosting, but not before a clear real-world need is demonstrated.
\QuickQuiz{
If Tasks RCU Trace might someday be priority boosted, why
not also Tasks RCU and Tasks RCU Rude?
}\QuickQuizAnswer{
Maybe, but these are less likely.
In the case of Tasks RCU, recall that the quiescent state is
a voluntary context switch.
Thus, all tasks not blocked after a voluntary context switch
might need to be boosted, and the mechanics of deboosting would
not likely be at all pretty.
In the case of Tasks RCU Rude, as was the case with the old
RCU Sched, any preemptible region of code is a quiescent state.
Thus, the only tasks that might need boosting are those currently
running with preemption disabled.
But boosting the priority of a preemption-disabled task has no
effect.
It therefore seems doubly unlikely that priority boosting will
ever be introduced to Tasks RCU Rude, at least in its current
form.
}\QuickQuizEnd
\paragraph{RCU Grace Periods Extend for Many Milliseconds}
With the exception of userspace
RCU~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCU},
expedited grace periods, and several of the ``toy''
RCU implementations described in
\cref{chp:app:``Toy'' RCU Implementations},
RCU grace periods extend milliseconds.
Although there are a number of techniques to render such long
delays harmless, including use of the asynchronous interfaces
(\co{call_rcu()} and \co{call_rcu_bh()}) or of the polling interfaces
(\co{get_state_synchronize_rcu()}, \co{start_poll_synchronize_rcu()},
and \co{poll_state_synchronize_rcu()}), this situation is a major reason
for the rule of thumb that RCU be used in read-mostly situations.
As noted in \cref{sec:defer:RCU Linux-Kernel API}, within the Linux
kernel, shorter grace periods may be obtained via expedited grace
periods, for example, by invoking \co{synchronize_rcu_expedited()}
instead of \co{synchronize_rcu()}.
Expedited grace periods can reduce delays to as little as a few tens of
microseconds, albeit at the expense of higher CPU utilization and IPIs.
The added IPIs can be especially unwelcome in some real-time workloads.
\paragraph{Code:
Reader-Writer Locking vs.\@ RCU}
In the best case, the conversion from reader-writer locking to RCU
is quite simple, as shown in
\cref{lst:defer:Converting Reader-Writer Locking to RCU: Data,%
lst:defer:Converting Reader-Writer Locking to RCU: Search,%
lst:defer:Converting Reader-Writer Locking to RCU: Deletion},
all taken from
Wikipedia~\cite{WikipediaRCU}.
\begin{listing*}
{ \scriptsize
\begin{verbbox}
1 struct el { 1 struct el {
2 struct list_head lp; 2 struct list_head lp;
3 long key; 3 long key;
4 spinlock_t mutex; 4 spinlock_t mutex;
5 int data; 5 int data;
6 /* Other data fields */ 6 /* Other data fields */
7 }; 7 };
8 DEFINE_RWLOCK(listmutex); 8 DEFINE_SPINLOCK(listmutex);
9 LIST_HEAD(head); 9 LIST_HEAD(head);
\end{verbbox}
}
\hspace*{0.9in}\OneColumnHSpace{-0.5in}
\IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
\caption{Converting Reader-Writer Locking to RCU\@:
Data}
\label{lst:defer:Converting Reader-Writer Locking to RCU: Data}
\end{listing*}
\begin{listing*}
{ \scriptsize
\begin{verbbox}
1 int search(long key, int *result) 1 int search(long key, int *result)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 read_lock(&listmutex); 5 rcu_read_lock();
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry_rcu(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 *result = p->data; 8 *result = p->data;
9 read_unlock(&listmutex); 9 rcu_read_unlock();
10 return 1; 10 return 1;
11 } 11 }
12 } 12 }
13 read_unlock(&listmutex); 13 rcu_read_unlock();
14 return 0; 14 return 0;
15 } 15 }
\end{verbbox}
}
\hspace*{0.9in}\OneColumnHSpace{-0.5in}
\IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
\caption{Converting Reader-Writer Locking to RCU\@:
Search}
\label{lst:defer:Converting Reader-Writer Locking to RCU: Search}
\end{listing*}
\begin{listing*}
{ \scriptsize
\begin{verbbox}
1 int delete(long key) 1 int delete(long key)
2 { 2 {
3 struct el *p; 3 struct el *p;
4 4
5 write_lock(&listmutex); 5 spin_lock(&listmutex);
6 list_for_each_entry(p, &head, lp) { 6 list_for_each_entry(p, &head, lp) {
7 if (p->key == key) { 7 if (p->key == key) {
8 list_del(&p->lp); 8 list_del_rcu(&p->lp);
9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
10 synchronize_rcu();
10 kfree(p); 11 kfree(p);
11 return 1; 12 return 1;
12 } 13 }
13 } 14 }
14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
15 return 0; 16 return 0;
16 } 17 }
\end{verbbox}
}
\hspace*{0.9in}\OneColumnHSpace{-0.5in}
\IfEbookSize{\hspace*{-1.05in}}{}\theverbbox
\caption{Converting Reader-Writer Locking to RCU\@:
Deletion}
\label{lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
\end{listing*}
However, the transformation is not always this straightforward.
This is because neither the \co{spin_lock()} nor the
\co{synchronize_rcu()} in
\cref{lst:defer:Converting Reader-Writer Locking to RCU: Deletion}
exclude the readers in
\cref{lst:defer:Converting Reader-Writer Locking to RCU: Search}.
First, the \co{spin_lock()} does not interact in any way with
\co{rcu_read_lock()} and \co{rcu_read_unlock()}, thus not excluding them.
Second, although both \co{write_lock()} and \co{synchronize_rcu()}
wait for pre-existing readers, only \co{write_lock()} prevents
subsequent readers from commencing.\footnote{
Kudos to whoever pointed this out to Paul.}
Thus, \co{synchronize_rcu()} cannot exclude readers.
Nevertheless, a great many situations using reader-writer locking can
be converted to RCU\@.
More-elaborate cases of replacing reader-writer locking with RCU
may be found
elsewhere~\cite{NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk}.
\paragraph{Semantics:
Reader-Writer Locking vs.\@ RCU}
Expanding on the previous section, reader-writer locking semantics can
be roughly and informally summarized by the following three temporal
constraints:
\begin{enumerate}
\item Write-side acquisitions wait for any read-holders to release
the lock.
\item Writer-side acquisitions wait for any write-holder to release
the lock.
\item Read-side acquisitions wait for any write-holder to release
the lock.
\end{enumerate}
RCU dispenses entirely with constraint~\#3 and weakens the other two
as follows:
\begin{enumerate}
\item Writers wait for any pre-existing read-holders before progressing
to the destructive phase of their update (usually the freeing of
memory).
\item Writers synchronize with each other as needed.
\end{enumerate}
It is of course this weakening that permits RCU implementations to attain
excellent performance and scalability.
It also allows RCU to implement the aforementioned unconditional
read-to-write upgrade that is so attractive and so deadlock-prone in
reader-writer locking.
Code using RCU can compensate for this weakening in a surprisingly large
number of ways, but most commonly by imposing spatial constraints:
\begin{enumerate}
\item New data is placed in newly allocated memory.
\item Old data is freed, but only after:
\begin{enumerate}
\item That data has been unlinked so as to be inaccessible
to later readers, and
\item A subsequent RCU grace period has elapsed.
\end{enumerate}
\end{enumerate}
Of course, there are some reader-writer-locking use cases for which
RCU's weakened semantics are inappropriate, but experience in the Linux
kernel indicates that more than 80\pct\ of reader-writer locks can in fact
be replaced by RCU\@.
For example, a common reader-writer-locking use case computes some value
while holding the lock and then uses that value after releasing that lock.
This use case results in stale data, and therefore often accommodates
RCU's weaker semantics.
\begin{listing}
\input{CodeSamples/defer/singleton@get.fcv}
\caption{RCU Singleton Get}
\label{lst:defer:Singleton Get}
\end{listing}
\begin{listing}
\input{CodeSamples/defer/singleton@set.fcv}
\caption{RCU Singleton Set}
\label{lst:defer:Singleton Set}
\end{listing}
\begin{fcvref}[ln:defer:singleton:get]
This interaction of temporal and spatial constraints is illustrated
by the RCU singleton data structure illustrated in
\cref{fig:defer:Insertion With Concurrent Readers,fig:defer:Deletion With Concurrent Readers}.
This structure is defined on \clnrefrange{myconfig.b}{myconfig.e} of
\cref{lst:defer:Singleton Get}, and contains two integer fields,
\co{->a} and \co{->b} (\path{singleton.c}).
The current instance of this structure is referenced by the \co{curconfig}
pointer defined on \clnref{myconfig.e}.
\end{fcvref}
\begin{fcvref}[ln:defer:singleton:get]
The fields of the current structure are passed back through the
\co{cur_a} and \co{cur_b} parameters to the \co{get_config()} function
defined on \clnrefrange{get_config.b}{get_config.e}.
These two fields can be slightly out of date, but they absolutely must
be consistent with each other.
The \co{get_config()} function provides this consistency
within the RCU read-side critical section starting on
\clnref{rrl} and ending on either \clnref{rrul1} or \clnref{rrul2},
which provides the needed temporal synchronization.
\Clnref{rderef} fetches the pointer to the current \co{myconfig} structure.
This structure will be used regardless of any concurrent changes due
to calls to the \co{set_config()} function, thus providing the needed
spatial synchronization.
If \clnref{nullchk} determines that the \co{curconfig} pointer was
\co{NULL}, \clnref{retfail} returns failure.
Otherwise, \clnref{copya,copyb} copy out the \co{->a} and \co{->b} fields
and \clnref{retsuccess} returns success.
These \co{->a} and \co{->b} fields are from the same \co{myconfig}
structure, and the RCU read-side critical section prevents this structure
from being freed, thus guaranteeing that these two fields are consistent
with each other.
\end{fcvref}
\begin{fcvref}[ln:defer:singleton:set]
The structure is updated by the \co{set_config()} function shown in
\cref{lst:defer:Singleton Set}.
\Clnrefrange{allocinit.b}{allocinit.e} allocate and initialize a
new \co{myconfig} structure.
\Clnref{xchg} atomically exchanges a pointer to this new structure with
the pointer to the old structure in \co{curconfig}, while also providing
full memory ordering both before and after the \co{xchg()} operation,
thus providing the needed updater/reader spatial synchronization on
the one hand and the needed updater/updater synchronization on the other.
If \clnref{if} determines that the pointer to the old structure was
in fact non-\co{NULL}, \clnref{sr} waits for a grace period (thus providing
the needed reader/updater temporal synchronization) and \clnref{free}
frees the old structure, safe in the knowledge that there are no
longer any readers still referencing it.
\end{fcvref}
\begin{figure*}
\centering
\resizebox{\onecolumntextwidth}{!}{\includegraphics{defer/RCUspacetime}}
\caption{RCU Spatial/Temporal Synchronization}
\label{fig:defer:RCU Spatial/Temporal Synchronization}
\end{figure*}
\Cref{fig:defer:RCU Spatial/Temporal Synchronization} shows an abbreviated
representation of \co{get_config()} on the left and right and a similarly
abbreviated representation of \co{set_config()} in the middle.
Time advances from top to bottom, and the address space of the objects
referenced by \co{curconfig} advances from left to right.
The boxes with comma-separated numbers each represent a \co{myconfig}
structure, with the constraint that \co{->b} is the square of \co{->a}.
Each blue dash-dotted arrow represents an interaction with the old structure
(on the left, containing ``5,25'') and each green dashed arrow represents
an interaction with the new structure (on the right, containing ``9,81'').
The black dotted arrows represent temporal relationships between RCU readers
on the left and right and the RCU grace period at center, with each
arrow pointing from an older event to a newer event.
The call to \co{synchronize_rcu()} followed the leftmost \co{rcu_read_lock()},
and therefore that \co{synchronize_rcu()} invocation must not return
until after the corresponding \co{rcu_read_unlock()}.
In contrast, the call to \co{synchronize_rcu()} precedes the
rightmost \co{rcu_read_lock()}, which allows the return from that same
\co{synchronize_rcu()} to ignore the corresponding \co{rcu_read_unlock()}.
These temporal relationships prevent the \co{myconfig} structures from
being freed while RCU readers are still accessing them.
The two horizontal grey dashed lines represent the period of time during
which different readers get different results, however, each reader
will see one and only one of the two objects.
All readers that end before the first horizontal line will see the
leftmost \co{myconfig} structure, and all readers that start after the
second horizontal line will see the rightmost structure.
Between the two lines, that is, during the grace period, different
readers might see different objects, but as long as each reader
loads the \co{curconfig} pointer only once, each reader will see
a consistent view of its \co{myconfig} structure.
\QuickQuiz{
But doesn't the RCU grace period start sometime after the
call to \co{synchronize_rcu()} rather than in the middle
of that \co{xchg()} statement?
}\QuickQuizAnswer{
Which grace period, exactly?
The updater is required to wait for at least one grace
period that starts at or some time after the removal,
in this case, the \co{xchg()}.
So in
\cref{fig:defer:RCU Spatial/Temporal Synchronization},
the indicated grace period starts as early as theoretically
possible and extends to the return from \co{synchronize_rcu()}.
This is a perfectly legal grace period corresponding to the
change carried out by that \co{xchg()} statement.
}\QuickQuizEnd
In short, when operating on a suitable linked data structure, RCU
combines temporal and spatial synchronization in order to approximate
reader-writer locking, with RCU read-side critical sections acting as
the reader-writer-locking reader, as shown in
\cref{fig:defer:Relationships Between RCU Use Cases,fig:defer:RCU Spatial/Temporal Synchronization}.
RCU's temporal synchronization is provided by the read-side markers,
for example, \co{rcu_read_lock()} and \co{rcu_read_unlock()}, as well as
the update-side grace-period primitives, for example, \co{synchronize_rcu()}
or \co{call_rcu()}.
The spatial synchronization is provided by the read-side
\co{rcu_dereference()} family of primitives, each of which subscribes
to a version published by \co{rcu_assign_pointer()}.\footnote{
Preferably with both \co{rcu_dereference()} and
\co{rcu_assign_pointer()} being embedded in higher-level APIs.}
RCU's combining of temporal and spatial synchronization contrasts to
the schemes presented in
\cref{sec:SMPdesign:Code Locking,sec:SMPdesign:Data Locking,sec:locking:Inefficiency},
in which temporal and spatial synchronization are provided separately by
locking and by static data-structure layout, respectively.
\QuickQuiz{
Is RCU the only synchronization mechanism that combines temporal
and spatial synchronization in this way?
}\QuickQuizAnswer{
Not at all.
Hazard pointers can be considered to combine temporal and spatial
synchronization in a similar manner.
Referring to
\cref{lst:defer:Hazard-Pointer Recording and Clearing},
the \co{hp_record()} function's acquisition of a reference
provides both spatial and temporal synchronization, subscribing
to a version and marking the start of a reference, respectively.
This function therefore combines the effects of RCU's
\co{rcu_read_lock()} and \co{rcu_dereference()}.
Referring now to
\cref{lst:defer:Hazard-Pointer Scanning and Freeing},
the \co{hp_clear()} function's release of a reference provides
temporal synchronization marking the end of a reference, and is
thus similar to RCU's \co{rcu_read_unlock()}.
The \co{hazptr_free_later()} function's retiring of a
hazard-pointer-protected object provides temporal synchronization,
similar to RCU's \co{call_rcu()}.
The primitives used to mutate a hazard-pointer-protected
structure provide spatial synchronization, similar to RCU's
\co{rcu_assign_pointer()}.
Alternatively, one could instead come at hazard pointers by
analogy with reference counting.
And, by doing so, learn that reference counting also combines
temporal and spatial synchronization as described above for
hazard pointers.
}\QuickQuizEnd
\subsubsection{Quasi Reference Count}
\label{sec:defer:Quasi Reference Count}
Because grace periods are not allowed to complete while
there is an RCU read-side critical section in progress,
the RCU read-side primitives may be used as a restricted
reference-counting mechanism.
For example, consider the following code fragment:
\begin{VerbatimN}
rcu_read_lock(); /* acquire reference. */
p = rcu_dereference(head);
/* do something with p. */
rcu_read_unlock(); /* release reference. */
\end{VerbatimN}
The combination of the \co{rcu_read_lock()} and \co{rcu_dereference()}
primitives can be thought of as acquiring a reference to \co{p},
because a grace period starting after the \co{rcu_dereference()}
assignment to \co{p} cannot possibly end until after we reach the matching
\co{rcu_read_unlock()}.
This reference-counting scheme is restricted in that it is forbidden
to wait for RCU grace periods within RCU read-side critical sections,
and also forbidden to hand off an RCU read-side critical section's
references from one task to another.
Regardless of these restrictions,
the following code can safely delete \co{p}:
\begin{VerbatimN}
spin_lock(&mylock);
p = head;
rcu_assign_pointer(head, NULL);
spin_unlock(&mylock);
/* Wait for all references to be released. */
synchronize_rcu();
kfree(p);
\end{VerbatimN}
The assignment to \co{head} prevents any future references
to \co{p} from being acquired, and the \co{synchronize_rcu()}
waits for any previously acquired references to be released.
\QuickQuiz{
But wait!
This is exactly the same code that might be used when thinking
of RCU as a replacement for reader-writer locking!
What gives?
}\QuickQuizAnswer{
This is an effect of the Law of Toy Examples:
Beyond a certain point, the code fragments look the same.
The only difference is in how we think about the code.
For example, what does an \co{atomic_inc()} operation do?
It might be acquiring another explicit reference to an object
to which we already have a reference, it might be incrementing
an often-read/seldom-updated statistical counter, it might
be checking into an HPC-style barrier, or any of a number of
other things.
However, these differences can be extremely important.
For but one example of the importance, consider that if we think
of RCU as a restricted reference counting scheme, we would never
be fooled into thinking that the updates would exclude the RCU
read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement
for reader-writer locking, for example, when you are replacing
reader-writer locking with RCU\@.
}\QuickQuizEnd
Of course, RCU can also be combined with traditional reference counting,
as discussed in
\cref{sec:together:Refurbish Reference Counting}.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refcntRCUperf}}
\caption{Performance of RCU vs.\@ Reference Counting}
\label{fig:defer:Performance of RCU vs. Reference Counting}
\end{figure}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refRCUperfPREEMPT}}
\caption{Performance of Preemptible RCU vs.\@ Reference Counting}
\label{fig:defer:Performance of Preemptible RCU vs. Reference Counting}
\end{figure}
But why bother?
Again, part of the answer is performance, as shown in
\cref{fig:defer:Performance of RCU vs. Reference Counting,%
fig:defer:Performance of Preemptible RCU vs. Reference Counting},
again showing data taken on a 448-CPU 2.1\,GHz Intel x86 system
for non-preemptible and preemptible Linux-kernel RCU, respectively.
Non-preemptible RCU's advantage over
\IXalt{reference counting}{reference count} ranges from
more than an order of magnitude at one CPU up to about four orders of
magnitude at 192~CPUs.
Preemptible RCU's advantage ranges from about a factor of three at
one CPU up to about three orders of magnitude at 192~CPUs.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/defer/data/rcuscale.hps.2020.05.28a/refRCUperfwt}}
\caption{Response Time of RCU vs.\@ Reference Counting, 192 CPUs}
\label{fig:defer:Response Time of RCU vs. Reference Counting}
\end{figure}
However, as with reader-writer locking, the performance advantages of
RCU are most pronounced for short-duration critical sections and for
large numbers of CPUs, as shown in
\cref{fig:defer:Response Time of RCU vs. Reference Counting}
for the same system.
In addition, as with reader-writer locking, many system calls (and thus
any RCU read-side critical sections that they contain) complete in
a few microseconds.
Although traditional reference counters are usually associated with a
specific data structure, or perhaps a specific group of data structures,
this approach does have some disadvantages.
For example, maintaining a single global reference counter for a large
variety of data structures typically results in bouncing the cache line
containing the reference count.
As we saw in
\crefrange{fig:defer:Performance of RCU vs. Reference Counting}{fig:defer:Response Time of RCU vs. Reference Counting},
such cache-line bouncing can severely degrade performance.
In contrast, RCU's lightweight \co{rcu_read_lock()},
\co{rcu_dereference()}, and \co{rcu_read_unlock()} read-side primitives
permit extremely frequent read-side usage with negligible performance
degradation.
Except that the calls to \co{rcu_dereference()} are not doing anything
specific to acquire a reference to the pointed-to object.
The heavy lifting is instead done by the \co{rcu_read_lock()} and
\co{rcu_read_unlock()} primitives and their interactions with RCU
grace periods.
And ignoring those calls to \co{rcu_dereference()} permits RCU to be
thought of as a ``bulk reference-counting'' mechanism, where each call
to \co{rcu_read_lock()} obtains a reference on each and every RCU-protected
object, and with little or no overhead.
However, the restrictions that go with RCU can be quite onerous.
For example, in many cases, the Linux-kernel prohibition against
sleeping while in an RCU read-side critical section would defeat the
entire purpose.
Such cases might be better served by the hazard pointers mechanism
described in \cref{sec:defer:Hazard Pointers}.
Cases where code rarely sleeps have been handled by using RCU as a
reference count in the common non-sleeping case and by bridging
to an explicit reference counter when sleeping is necessary.
Alternatively, situations where a reference must be held by a single task
across a section of code that sleeps may be accommodated with Sleepable
RCU (SRCU)~\cite{PaulEMcKenney2006c}.
This fails to cover the not-uncommon situation where a reference is ``passed''
from one task to another, for example, when a reference is acquired
when starting an I/O and released in the corresponding completion
interrupt handler.
Again, such cases might be better handled by explicit reference counters
or by hazard pointers.
Of course, SRCU brings restrictions of its own, namely that the
return value from \co{srcu_read_lock()} be passed into the
corresponding \co{srcu_read_unlock()}, and that no SRCU primitives
be invoked from hardware interrupt handlers or from \IXacrf{nmi}
handlers.
The jury is still out as to how much of a problem is presented by
this restriction, and as to how it can best be handled.
However, in the common case where references are held within the confines
of a single CPU or task, RCU can be used as high-performance and highly
scalable reference-counting mechanism.
As shown in \cref{fig:defer:Relationships Between RCU Use Cases},
quasi reference counts add RCU readers as individual or bulk
reference counts, possibly also bridging to reference counters
in corner cases.
\subsubsection{Quasi Multi-Version Concurrency Control}
\label{sec:defer:Quasi Multi-Version Concurrency Control}
RCU can also be thought of as a simplified multi-version concurrency
control (MVCC) mechanism with weak consistency criteria.
The multi-version aspects were touched upon in
\cref{sec:defer:Maintain Multiple Versions of Recently Updated Objects}.
However, in its native form, RCU provides version consistency only
within a given RCU-protected data element.
Nevertheless, there are situations where consistency and fresh data are
required across multiple data elements.
Fortunately, there are a number of approaches that avoid inconsistency
and stale data, including the following:
\begin{enumerate}
\item Enclose RCU readers within sequence-locking readers, forcing
the RCU readers to be retried should an update occur,
as described in
\cref{sec:together:Correlated Data Elements}
and
\cref{sec:together:Atomic Move}.
\item Place the data that must be consistent into a single element
of a linked data structure, and refrain from updating those
fields within any element visible to RCU readers.
RCU readers gaining a reference to any such element are then
guaranteed to see consistent values.
See \cref{sec:together:Correlated Fields} for additional details.
\item Use a per-element lock that guards a ``deleted'' flag to allow
RCU readers to reject stale
data~\cite{PaulEdwardMcKenneyPhD,Arcangeli03}.
\item Provide an existence flag that is referenced by all data elements
whose update is to appear atomic to RCU
readers~\cite{PaulEMcKennneyAtomicTreeN4037,PaulEMcKennneyAtomicTreeCPPCON2014,PaulEMcKenneyIssaquahUpdate2015,PaulEMcKenney2016IssaquahACMApp,PaulEMcKenney2016IssaquahCPPCON}.
\item Use one of a wide range of counter-based
methods~\cite{PaulEMcKenney2008cyclicRCU,PaulEMcKenney2010cyclicRCU,PaulEMcKenney2011cyclicparallelRCU,PaulEMcKenney2014cyclicRCU,Matveev:2015:RLS:2815400.2815406,Kim:2019:MSR:3297858.3304040}.
In these approaches, updaters maintain a version number and maintain links
to old versions of a given piece of data.
Readers take a snapshot of the current version number, and, if necessary,
traverse the links to find a version consistent with that snapshot.
\end{enumerate}
In short, when using RCU to approximate multi-version concurrency control,
you only pay for the level of consistency that you actually need.
As shown in \cref{fig:defer:Relationships Between RCU Use Cases},
quasi multi-version concurrency control is based on existence guarantees,
adding read-side snapshot operations and constraints on readers and
writers, the exact form of the constraint being dictated by the
consistency requirements, as summarized above.
\subsubsection{RCU Usage Summary}
\label{sec:defer:RCU Usage Summary}
At its core, RCU is nothing more nor less than an API that provides:
\begin{enumerate}
\item A publish-subscribe mechanism for adding new data,
\item A way of waiting for pre-existing RCU readers to finish, and
\item A discipline of maintaining multiple versions to permit change
without harming or unduly delaying concurrent RCU readers.
\end{enumerate}
That said, it is possible to build higher-level constructs on top of RCU,
including the various use cases described in the earlier sections.
Furthermore, I have no doubt that new use cases will continue to be
found for RCU, as well as for any of a number of other synchronization
primitives.
And so it is that RCU's use cases are conceptually more complex than
is RCU itself, as hinted on
\cpageref{sec:defer:Mysteries RCU Use Cases}.
\QuickQuiz{
Which of these use cases best describes the Pre-BSD routing
example in
\cref{sec:defer:RCU for Pre-BSD Routing}?
}\QuickQuizAnswer{
Pre-BSD routing could be argued to fit into either
quasi reader-writer lock, quasi reference count, or
quasi multi-version concurrency control.
The code is the same either way.
This is similar to things like \co{atomic_inc()}, another tool
that can be put to a great many uses.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{defer/RCUApplicability}}
\caption{RCU Areas of Applicability}
\label{fig:defer:RCU Areas of Applicability}
\end{figure}
In the meantime,
\cref{fig:defer:RCU Areas of Applicability}
shows some rough rules of thumb on where RCU is most helpful.
As shown in the blue box in the upper-right corner of the figure, RCU
works best if you have read-mostly data where stale and inconsistent
data is permissible (but see below for more information on stale and
inconsistent data).
The canonical example of this case in the Linux kernel is routing tables.
Because it may have taken many seconds or even minutes for the
routing updates to propagate across the Internet, the system
has been sending packets the wrong way for quite some time.
Having some small probability of continuing to send some of them the wrong
way for a few more milliseconds is almost never a problem.
If you have a read-mostly workload where consistent data is required,
RCU works well, as shown by the green ``read-mostly, need consistent data''
box.
One example of this case is the Linux kernel's mapping from user-level
System-V semaphore IDs to the corresponding in-kernel data structures.
Semaphores tend to be used far more frequently than they are created
and destroyed, so this mapping is read-mostly.
However, it would be erroneous to perform a semaphore operation on
a semaphore that has already been deleted.
This need for consistency is handled by using the lock in the
in-kernel semaphore data structure, along with a ``deleted''
flag that is set when deleting a semaphore.
If a user ID maps to an in-kernel data structure with the
``deleted'' flag set, the data structure is ignored, so that
the user ID is flagged as invalid.
Although this requires that the readers acquire a lock for the
data structure representing the semaphore itself,
it allows them to dispense with locking for the
mapping data structure.
The readers therefore locklessly
traverse the tree used to map from ID to data structure,
which in turn greatly improves performance, scalability, and
real-time response.
As indicated by the yellow ``read-write'' box, RCU can also be useful
for read-write
workloads where consistent data is required, although usually in
conjunction with a number of other synchronization primitives.
For example, the directory-entry cache in recent Linux kernels uses RCU in
conjunction with sequence locks, per-CPU locks, and per-data-structure
locks to allow lockless traversal of pathnames in the common case.
Although RCU can be very beneficial in this read-write case, the
corresponding code is often more complex than that of the read-mostly
cases.
Finally, as indicated by the red box in the lower-left corner
of the figure, update-mostly workloads requiring consistent
data are rarely good places to use RCU, though there are some
exceptions~\cite{MathieuDesnoyers2012URCU}.
For example, as noted in
\cref{sec:defer:Type-Safe Memory},
within the Linux kernel, the \co{SLAB_TYPESAFE_BY_RCU}
slab-allocator flag provides type-safe memory to RCU readers, which can
greatly simplify \IXacrl{nbs} and other lockless
algorithms.
In addition, if the rare readers are on critical code paths on real-time
systems, use of RCU for those readers might provide real-time response
benefits that more than make up for the increased update-side overhead,
as discussed in \cref{sec:advsync:The Role of RCU}.
In short, RCU is an API that includes a publish-subscribe mechanism for
adding new data, a way of waiting for pre-existing RCU readers to finish,
and a discipline of maintaining multiple versions to allow updates to
avoid harming or unduly delaying concurrent RCU readers.
This RCU API is best suited for read-mostly situations, especially if
stale and inconsistent data can be tolerated by the application.