blob: 8846825f550ac5fefb94ddacee1968cefbd5dd96 [file] [log] [blame]
% defer/rcurelated.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\subsection{RCU Related Work}
\label{sec:defer:RCU Related Work}
\OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2014ReadMostly}
\OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2015ReadMostly}
The first known mention of anything resembling RCU took the form of a bug
report from
\ppl{Donald}{Knuth}~\cite[page 413 of Fundamental Algorithms]{Knuth73}
against \ppl{Joseph}{Weizenbaum}'s SLIP list-processing facility for
FORTRAN~\cite{Weizenbaum:1963:SLP:367593.367617}.
Knuth was justified in reporting the bug, as SLIP had no notion of
any sort of grace-period guarantee.
The first known non-bug-report mention of anything resembling RCU appeared
in \pplsur{H. T.}{Kung}'s and \pplsur{Philip L.}{Lehman}'s landmark
paper~\cite{Kung80}.
There was some additional use of this technique in
academia~\cite{Manber82,Manber84,BarbaraLiskov1988ArgusCACM,Pugh90,Andrews91textbook,Pu95a,Cowan96a,Rastogi:1997:LPV:645923.671017,Gamsa99},
but much of the work in this area was instead carried out by
practitioners~\cite{RichardRashid87a,Hennessy89,Jacobson93,AjuJohn95,Slingwine95,Slingwine97,Slingwine98,McKenney98}.
\QuickQuiz{
Garbage collectors?
Passive serialization?
System reference points?
Quiescent states?
Aging?
Generations?
Why on earth couldn't the knuckleheads working on these early
papers bring themselves to agree on a common terminology???
}\QuickQuizAnswer{
There were multiple independent inventions of mechanisms
vaguely resembling RCU\@.
Each group of inventors was unaware of the others, so each
made up its own terminology as a matter of course.
And the different terminology made it quite difficult for
any one group to find any of the others.
Sorry, but life is like that sometimes!
}\QuickQuizEnd
By the year 2000, the initiative had passed to open-source projects,
most notably the Linux kernel
community~\cite{RustyRussell2000a,RustyRussell2000b,McKenney01b,McKenney01a,McKenney02a,Arcangeli03}.\footnote{
A list of citations with well over 200 entries may be found in
\co{bib/RCU.bib} in the {\LaTeX} source for this book.}
RCU was accepted into the Linux kernel in late 2002, with many subsequent
improvements for scalability, robustness, real-time response, energy
efficiency, and specialized use cases.
As of 2023, Linux-kernel RCU is still under active development.
\QuickQuiz{
Why didn't Kung's and Lehman's paper result in immediate use
of RCU?
}\QuickQuizAnswer{
One reason is that Kung and Lehman were simply ahead of their
time.
Another reason was that their approach, ground-breaking though
it was, did not take a number of software-engineering and
performance issues into account.
To see that they were ahead of their time, consider that three
years after their paper was published, Paul was working on a
PDP-11 system running BSD 2.8.
This system lacked any sort of automatic configuration, which
meant that any hardware modification, including adding a new
disk drive, required hand-editing and rebuilding the kernel.
Furthermore, this was a single-CPU system, which meant that
full-system synchronization was a simple matter of disabling
interrupts.
Fast-forward a number of years, and multicore systems permitting
runtime changes in hardware configuration were commonplace.
This meant that the hardware configuration data that was implicitly
represented in 1980s kernel source code was now a mutable
data structure that was accessed on every I/O\@.
Such data structures rarely change, but could change at any time.
And this read-mostly property applies to many other new-age
data structures, including those concerning networking (rare in
the 1980s), security policies (physical locks in the 1980s),
software configuration (immutable at runtime in the 1980s),
and much else besides.
There was thus much more opportunity for RCU to demonstrate its
benefits in the 1990s and 2000s than there was in the 1980s.
Kung's and Lehman's software-engineering sins included failing
to mark readers (thus presenting debugging difficulties),
failing to provide a clean RCU API (thus tying their mechanism
to a specific data structure), and failing to allow for any
post-grace-period operation other than freeing memory (thus
disallowing a number of RCU use cases).
Kung and Lehman presented two garbage-collection strategies.
The first waited for all processes running at a given time
to terminate, which represented another software-engineering
sin that ruled out their mechanism's use in software that
runs indefinitely.
The second used per-object reference counting, which greatly
complicates their read-side code (thus representing yet
another software-engineering sin), and, on modern hardware,
results in severe cache-miss overhead (thus representing a
performance sin, see for example
\cref{fig:defer:Performance of RCU vs. Reference Counting,fig:defer:Performance of Preemptible RCU vs. Reference Counting}).
Despite this long list of software-engineering and performance
sins, Kung's and Lehman's paper remains a truly impressive piece
of work, especially considering that much of the later work
(both independent and not) committed these same sins, plus others
as well.
}\QuickQuizEnd
However, in the mid 2010s, there was a welcome upsurge in RCU research
and development across a number of communities and
institutions~\cite{FransKaashoek2015ParallelOSHistory}.
\Cref{sec:defer:RCU Uses} describes uses of RCU,
\cref{sec:defer:RCU Implementations} describes RCU implementations
(as well as work that both creates and uses an implementation),
and finally,
\cref{sec:defer:RCU Validation} describes verification and validation
of RCU and its uses.
\subsubsection{RCU Uses}
\label{sec:defer:RCU Uses}
\ppl{Phil}{Howard} and \ppl{Jon}{Walpole} of Portland State University
(PSU) have
applied RCU to red-black
trees~\cite{PhilHowardPhD,PhilHoward2011RCUTMRBTree} combined with updates
synchronized using software transactional memory.
\ppl{Josh}{Triplett} and \ppl{Jon}{Walpole} (again of PSU)
applied RCU to resizable
hash tables~\cite{JoshTriplettPhD,Triplett:2011:RPHash,JonCorbet2014RCUhash1,JonCorbet2014RCUhash2}.
Other RCU-protected resizable hash tables have been created by
\ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash} and by
\ppl{Mathieu}{Desnoyers}~\cite{PaulMcKenney2013LWNURCUhash}.
\ppl{Austin}{Clements}, \ppl{Frans}{Kaashoek}, and \ppl{Nickolai}{Zeldovich}
of MIT created an RCU-optimized balanced binary tree
(Bonsai)~\cite{AustinClements2012RCULinux:mmapsem}, and applied this
tree to the Linux kernel's VM subsystem in order to reduce read-side
contention on the Linux kernel's \co{mmap_sem}.
This work resulted in order-of-magnitude speedups and scalability up to
at least 80 CPUs for a microbenchmark featuring large numbers of minor
page faults.
This is similar to a patch developed earlier by
\ppl{Peter}{Zijlstra}~\cite{PeterZijlstra2014SpeculativePageFault}, and both
were limited by the fact that, at the time, filesystem data structures
were not safe for RCU readers.
\pplsur{Austin}{Clements} et al.\ avoided this limitation by
optimizing the page-fault
path for anonymous pages only.
More recently, filesystem data structures have been made safe for RCU
readers~\cite{JonathanCorbet2010dcacheRCU,JonathanCorbet2011dcacheRCUbug},
so perhaps this work can be implemented for all page types, not just
anonymous pages---\ppl{Peter}{Zijlstra} has, in fact, recently prototyped
exactly this, and \ppl{Laurent}{Dufour} \ppl{Michel}{Lespinasse} have
continued work along these lines.
For their part, \ppl{Matthew}{Wilcox} and \ppl{Liam}{Howlett} are working
towards use of RCU to enable fine-grained locking of and lockless access
to other memory-management data structures.
\ppl{Yandong}{Mao} and \ppl{Robert}{Morris} of MIT and \ppl{Eddie}{Kohler} of
Harvard University created another RCU-protected tree named
Masstree~\cite{Mao:2012:CCF:2168836.2168855} that combines ideas from B+
trees and tries.
Although this tree is about 2.5x slower than an RCU-protected hash table,
it supports operations on key ranges, unlike hash tables.
In addition, Masstree supports efficient storage of objects with long
shared key prefixes and, furthermore, provides persistence via logging
to mass storage.
The paper notes that Masstree's performance rivals that of memcached, even
given that Masstree is persistently storing updates and memcached is not.
The paper also compares Masstree's performance to the persistent
datastores MongoDB, VoltDB, and Redis, reporting significant performance
advantages for Masstree, in some cases exceeding two orders of magnitude.
Another paper~\cite{Tu:2013:STM:2517349.2522713}, by \ppl{Stephen}{Tu},
\ppl{Wenting}{Zheng}, \ppl{Barbara}{Liskov}, and \ppl{Samuel}{Madden}
of MIT and \pplsur{Eddie}{Kohler},
applies Masstree to an in-memory database named Silo, achieving 700K
transactions per second (42M transactions per minute) on a well-known
transaction-processing benchmark.
Interestingly enough, Silo guarantees linearizability without incurring
the overhead of grace periods while holding locks.
\ppl{Maya}{Arbel} and \ppl{Hagit}{Attiya} of Technion took a more rigorous
approach~\cite{MayaArbel2014RCUtree} to an RCU-protected search tree that,
like Masstree, allows concurrent updates.
This paper includes a proof of correctness, including proof that all
operations on this tree are \IX{linearizable}.
Unfortunately, this implementation achieves linearizability by incurring
the full latency of grace-period waits while holding locks, which degrades
scalability of update-only workloads.
One way around this problem is to abandon
linearizability~\cite{AndreasHaas2012FIFOisnt,PaulEMcKennneyAtomicTreeN4037},
however, Arbel and Attiya instead created an RCU variant that reduces
low-end grace-period latency.
Of course, nothing comes for free, and this RCU variant appears to hit
a scalability limit at about 32 CPUs.
Although there is much to be said for dropping linearizability, thus
gaining both performance and scalability, it is very good to see academics
experimenting with alternative RCU implementations.
\subsubsection{RCU Implementations}
\label{sec:defer:RCU Implementations}
\ppl{Timothy}{Harris} created a time-based user-space
RCU~\cite{TimothyLHarris2001} that improves on those created previously
by Jacobson~\cite{Jacobson93} and John~\cite{AjuJohn95}.
These prior two time-based approaches each assume a sharp upper bound on
reader duration, which can work correctly in hard real-time systems.
In non-real-time systems, this type of approach is subject to failure
when readers are interrupted, preempted, or otherwise delayed.
However, the fact that such a failure-prone implementation would be
independently invented twice shows the depth of the need for RCU-like
mechanisms.
\ppl{Timothy}{Harris} improves upon these two earlier efforts by
requiring each reader to take a snapshot of a global timebase before
starting its read-side traversal.
Freeing a reader-visible object is then deferred until all processes'
reader snapshots indicate a time following that of the removal of that object.
However, global timebases can be expensive and inaccurate on some systems.
\ppl{Keir}{Fraser} created a user-space RCU named \IXacr{ebr} for use in
non-blocking synchronization and software transactional
memory~\cite{KeirAnthonyFraserPhD,UCAM-CL-TR-579,KeirFraser2007withoutLocks}.
This work improves on that of \ppl{Timothy}{Harris} by replacing the
global clock with a software counter, thus eliminating much of the
expense and all of the inaccuracy associated with commodity-system
global clocks of that time.
Interestingly enough, this work cites Linux-kernel RCU on the one hand,
but also inspired the name \acr{qsbr} for the original non-preemptible
Linux-kernel RCU implementation.
\ppl{Mathieu}{Desnoyers} created a user-space RCU for use in
tracing~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyersPhD,MathieuDesnoyers2012URCU,PaulMcKenney2013LWNURCU,PaulMcKenney2013LWNURCUhash,PaulMcKenney2013LWNURCUhashAPI,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUatomicop,PaulMcKenney2013LWNURCUmenagerie,PaulMcKenney2013LWNURCUAPI,PaulMcKenney2013LWNURCUlist,PaulMcKenney2013LWNURCUmenagerieRCU},
which has seen use in a number of projects~\cite{MikeDay2013RCUqemu}.
Researchers at Charles University in Prague have also been
working on RCU implementations, including dissertations by
\ppl{Andrej}{Podzimek}~\cite{AndrejPodzimek2010masters} and
\ppl{Adam}{Hraska}~\cite{AdamHraska2013RCUHelenOS}.
\ppl{Yujie}{Liu} (Lehigh University), \ppl{Victor}{Luchangco} (Oracle Labs), and
\ppl{Michael}{Spear} (also Lehigh)~\cite{Liu:2013:MSA:2549695.2549732}
pressed scalable non-zero indicators
(SNZI)~\cite{FaithEllen:2007:SNZI} into service as a grace-period
mechanism.
The intended use is to implement software transactional memory
(see \cref{sec:future:Transactional Memory}), which
imposes linearizability requirements, which in turn seems to
limit scalability.
RCU-like mechanisms are also finding their way into Java.
\pplsur{KC}{Sivaramakrishnan} et al.~\cite{Sivaramakrishnan:2012:ERB:2258996.2259005}
use an RCU-like mechanism to eliminate the read barriers that are
otherwise required when interacting with Java's garbage collector,
resulting in significant performance improvements.
\ppl{Ran}{Liu}, \ppl{Heng}{Zhang}, and \ppl{Haibo}{Chen} of
Shanghai Jiao Tong University
created a specialized variant of RCU that they used for an optimized
``passive reader-writer lock''~\cite{RanLiu2014PassiveRWLock}, similar to
those created by \ppl{Gautham}{Shenoy}~\cite{GauthamShenoy2006RCUrwlock} and
\ppl{Srivatsa}{Bhat}~\cite{SrivatsaSBhat2014RCUrwlock}.
The Liu et al.\ paper is interesting from a number of
perspectives~\cite{PaulEMcKenney2014ReadMostly}.
\ppl{Mike}{Ash} posted~\cite{MikeAsh2015Apple} a description of an RCU-like
primitive in Apple's Objective-C runtime.
This approach identifies read-side critical sections via designated
code ranges, thus qualifying as another method of achieving
zero read-side overhead, albeit one that poses some interesting
practical challenges for large read-side critical sections that
span multiple functions.
\ppl{Pedro}{Ramalhete} and \ppl{Andreia}{Correia}~\cite{PedroRmalhete2015PoorMansRCU}
produced ``Poor Man's RCU'', which, despite using a pair of
\IXhpl{reader-writer}{lock}, manages to provide \IXalt{lock-free}{lock free}
\IXpl{forward-progress guarantee} to
readers~\cite{PaulEMcKenney2015ReadMostly}.
\ppl{Maya}{Arbel} and \ppl{Adam}{Morrison}~\cite{Arbel:2015:PRR:2858788.2688518}
produced ``Predicate RCU'', which works hard to reduce grace-period
duration in order to efficiently support algorithms that hold
update-side locks across grace periods.
This results in reduced batching of updates into grace periods
and reduced scalability, but does succeed in providing short
grace periods.
\QuickQuiz{
Why not just drop the lock before waiting for the grace
period, or using something like \co{call_rcu()}
instead of waiting for a grace period?
}\QuickQuizAnswer{
The authors wished to support \IX{linearizable} tree
operations, so that concurrent additions to, deletions
from, and searches of the tree would appear to execute
in some globally agreed-upon order.
In their search trees, this requires holding locks
across grace periods.
(It is probably better to drop linearizability as a
requirement in most cases, but linearizability is a
surprisingly popular (and costly!\@) requirement.)
}\QuickQuizEnd
\ppl{Alexander}{Matveev} (MIT), \ppl{Nir}{Shavit} (MIT and Tel-Aviv University),
\ppl{Pascal}{Felber} (University of Neuch\^{a}tel), and \ppl{Patrick}{Marlier} (also
University of Neuch\^{a}tel)~\cite{Matveev:2015:RLS:2815400.2815406}
produced an RCU-like mechanism that can be thought of as
software transactional memory that explicitly marks
read-only transactions.
Their use cases require holding locks across grace periods, which limits
scalability~\cite{PaulEMcKenney2015ReadMostly,PaulEMcKenney2015ReadMostlySidebar}.
This appears to be the first academic RCU-related work to
make good use of the \co{rcutorture} test suite, and also the
first to have submitted a performance improvement to Linux-kernel
RCU, which was accepted into v4.4.
\ppl{Alexander}{Matveev}'s RLU was followed up by MV-RLU from
\ppl{Jaeho}{Kim} et al.~\cite{Kim:2019:MSR:3297858.3304040}.
This work improves scalability over RLU by permitting multiple concurrent
updates, by avoiding holding locks across grace periods, and by using
asynchronous grace periods, for example, \co{call_rcu()} instead of
\co{synchronize_rcu()}.
This paper also made some interesting performance-evaluation choices that
are discussed further in
\cref{sec:future:Deferred Reclamation}
on
\cpageref{sec:future:Deferred Reclamation}.
\ppl{Adam}{Belay} et al.~created an RCU implementation that guards the
data structures used by TCP/IP's address-resolution protocol (ARP)
in their IX operating system~\cite{Belay:2016:IOS:3014162.2997641}.
\ppl{Geoff}{Romer} and \ppl{Andrew}{Hunter} (both at Google) proposed
a cell-based API for RCU
protection of singleton data structures for inclusion in the
C++ standard~\cite{GeoffRomer2018C++DeferredReclamationP0561R4}.
\ppl{Dimitrios}{Siakavaras} et al.~have applied
HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees},
\ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color
graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring},
and
\ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention
locking on \IXacr{numa} systems.
\ppl{Alex}{Kogan} et al.~applied RCU to the construction of range locking
for scalable address spaces~\cite{AlexKogan2020RCUrangelocks}.
On June 17, 2023, the ISO C++ Standards committee voted RCU into
C++26~\cite{PaulEMcKennney2023C++26RCU4}.
Production uses of RCU are listed in
\cref{sec:defer:Production Uses of RCU}.
\subsubsection{RCU Validation}
\label{sec:defer:RCU Validation}
In early 2017, it is commonly recognized that almost any bug is a potential
security exploit, so validation and verification are first-class concerns.
Researchers at Stony Brook University have produced an RCU-aware data-race
detector~\cite{AbhinavDuggal2010Masters,JustinSeyster2012PhD,Seyster:2011:RFA:2075416.2075425}.
\ppl{Alexey}{Gotsman} of IMDEA, \ppl{Noam}{Rinetzky} of Tel Aviv University,
and \ppl{Hongseok}{Yang} of the University of Oxford have published a
paper~\cite{AlexeyGotsman2012VerifyGraceExtended} expressing the formal
semantics of RCU in terms of separation logic, and have continued with
other aspects of concurrency.
\ppl{Joseph}{Tassarotti} (Carnegie-Mellon University), \ppl{Derek}{Dreyer} (Max
Planck Institute for Software Systems), and \ppl{Viktor}{Vafeiadis}
(also MPI-SWS)~\cite{JosephTassarotti2015RCUproof}
produced a manual formal proof of correctness of the \IXacrf{qsbr}
variant of userspace
RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}.
\ppl{Lihao}{Liang} (University of Oxford), \pplmdl{Paul E.}{McKenney} (IBM),
\ppl{Daniel}{Kroening}, and \ppl{Tom}{Melham}
(both also Oxford)~\cite{LihaoLiang2016VerifyTreeRCU}
used the \IXacrf{cbmc}~\cite{EdmundClarke2004CBMC}
to produce a mechanical proof of correctness of a significant portion
of Linux-kernel Tree RCU\@.
\ppl{Lance}{Roy}~\cite{LanceRoy2017CBMC-SRCU} used CBMC to produce a similar
proof of correctness for a significant portion of Linux-kernel
sleepable RCU (SRCU)~\cite{PaulEMcKenney2006c}.
Finally, \ppl{Michalis}{Kokologiannakis} and \ppl{Konstantinos}{Sagonas}
(National Technical University of
Athens)~\cite{MichalisKokologiannakis2017NidhuggRCU,MichalisKokologiannakis2019RCUstatelessModelCheck}
used the Nighugg tool~\cite{CarlLeonardsson2014Nidhugg}
to produce a mechanical proof of correctness of a somewhat larger
portion of Linux-kernel Tree RCU\@.
None of these efforts located any bugs other than bugs injected into
RCU specifically to test the verification tools.
In contrast,
\ppl{Alex}{Groce} (Oregon State University), \ppl{Iftekhar}{Ahmed},
\ppl{Carlos}{Jensen} (both also OSU), and \pplmdl{Paul E.}{McKenney}
(IBM)~\cite{Groce:2015:VMC:2916135.2916190}
automatically mutated Linux-kernel RCU's source code to test the
coverage of the \co{rcutorture} test suite.
The effort found several holes in this suite's coverage, one of which
was hiding a real bug (since fixed) in Tiny RCU\@.
With some luck, all of this validation work will eventually result in
more and better tools for validating concurrent code.