defer/rcurelated.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/rcurelated.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \subsection{RCU Related Work}
 \label{sec:defer:RCU Related Work}
 \OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2014ReadMostly}
 \OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2015ReadMostly}

 The first known mention of anything resembling RCU took the form of a bug
 report from
 \ppl{Donald}{Knuth}~\cite[page 413 of Fundamental Algorithms]{Knuth73}
 against \ppl{Joseph}{Weizenbaum}'s SLIP list-processing facility for
 FORTRAN~\cite{Weizenbaum:1963:SLP:367593.367617}.
 Knuth was justified in reporting the bug, as SLIP had no notion of
 any sort of grace-period guarantee.

 The first known non-bug-report mention of anything resembling RCU appeared
 in \pplsur{H. T.}{Kung}'s and \pplsur{Philip L.}{Lehman}'s landmark
 paper~\cite{Kung80}.
 There was some additional use of this technique in
 academia~\cite{Manber82,Manber84,BarbaraLiskov1988ArgusCACM,Pugh90,Andrews91textbook,Pu95a,Cowan96a,Rastogi:1997:LPV:645923.671017,Gamsa99},
 but much of the work in this area was instead carried out by
 practitioners~\cite{RichardRashid87a,Hennessy89,Jacobson93,AjuJohn95,Slingwine95,Slingwine97,Slingwine98,McKenney98}.

 \QuickQuiz{
 	Garbage collectors?
 	Passive serialization?
 	System reference points?
 	Quiescent states?
 	Aging?
 	Generations?
 	Why on earth couldn't the knuckleheads working on these early
 	papers bring themselves to agree on a common terminology???
 }\QuickQuizAnswer{
 	There were multiple independent inventions of mechanisms
 	vaguely resembling RCU\@.
 	Each group of inventors was unaware of the others, so each
 	made up its own terminology as a matter of course.
 	And the different terminology made it quite difficult for
 	any one group to find any of the others.

 	Sorry, but life is like that sometimes!
 }\QuickQuizEnd

 By the year 2000, the initiative had passed to open-source projects,
 most notably the Linux kernel
 community~\cite{RustyRussell2000a,RustyRussell2000b,McKenney01b,McKenney01a,McKenney02a,Arcangeli03}.\footnote{
 	A list of citations with well over 200 entries may be found in
 	\co{bib/RCU.bib} in the {\LaTeX} source for this book.}
 RCU was accepted into the Linux kernel in late 2002, with many subsequent
 improvements for scalability, robustness, real-time response, energy
 efficiency, and specialized use cases.
 As of 2023, Linux-kernel RCU is still under active development.

 \QuickQuiz{
 	Why didn't Kung's and Lehman's paper result in immediate use
 	of RCU?
 }\QuickQuizAnswer{
 	One reason is that Kung and Lehman were simply ahead of their
 	time.
 	Another reason was that their approach, ground-breaking though
 	it was, did not take a number of software-engineering and
 	performance issues into account.

 	To see that they were ahead of their time, consider that three
 	years after their paper was published, Paul was working on a
 	PDP-11 system running BSD 2.8.
 	This system lacked any sort of automatic configuration, which
 	meant that any hardware modification, including adding a new
 	disk drive, required hand-editing and rebuilding the kernel.
 	Furthermore, this was a single-CPU system, which meant that
 	full-system synchronization was a simple matter of disabling
 	interrupts.

 	Fast-forward a number of years, and multicore systems permitting
 	runtime changes in hardware configuration were commonplace.
 	This meant that the hardware configuration data that was implicitly
 	represented in 1980s kernel source code was now a mutable
 	data structure that was accessed on every I/O\@.
 	Such data structures rarely change, but could change at any time.
 	And this read-mostly property applies to many other new-age
 	data structures, including those concerning networking (rare in
 	the 1980s), security policies (physical locks in the 1980s),
 	software configuration (immutable at runtime in the 1980s),
 	and much else besides.
 	There was thus much more opportunity for RCU to demonstrate its
 	benefits in the 1990s and 2000s than there was in the 1980s.

 	Kung's and Lehman's software-engineering sins included failing
 	to mark readers (thus presenting debugging difficulties),
 	failing to provide a clean RCU API (thus tying their mechanism
 	to a specific data structure), and failing to allow for any
 	post-grace-period operation other than freeing memory (thus
 	disallowing a number of RCU use cases).

 	Kung and Lehman presented two garbage-collection strategies.
 	The first waited for all processes running at a given time
 	to terminate, which represented another software-engineering
 	sin that ruled out their mechanism's use in software that
 	runs indefinitely.
 	The second used per-object reference counting, which greatly
 	complicates their read-side code (thus representing yet
 	another software-engineering sin), and, on modern hardware,
 	results in severe cache-miss overhead (thus representing a
 	performance sin, see for example
 	\cref{fig:defer:Performance of RCU vs. Reference Counting,fig:defer:Performance of Preemptible RCU vs. Reference Counting}).

 	Despite this long list of software-engineering and performance
 	sins, Kung's and Lehman's paper remains a truly impressive piece
 	of work, especially considering that much of the later work
 	(both independent and not) committed these same sins, plus others
 	as well.
 }\QuickQuizEnd

 However, in the mid 2010s, there was a welcome upsurge in RCU research
 and development across a number of communities and
 institutions~\cite{FransKaashoek2015ParallelOSHistory}.
 \Cref{sec:defer:RCU Uses} describes uses of RCU,
 \cref{sec:defer:RCU Implementations} describes RCU implementations
 (as well as work that both creates and uses an implementation),
 and finally,
 \cref{sec:defer:RCU Validation} describes verification and validation
 of RCU and its uses.

 \subsubsection{RCU Uses}
 \label{sec:defer:RCU Uses}

 \ppl{Phil}{Howard} and \ppl{Jon}{Walpole} of Portland State University
 (PSU) have
 applied RCU to red-black
 trees~\cite{PhilHowardPhD,PhilHoward2011RCUTMRBTree} combined with updates
 synchronized using software transactional memory.
 \ppl{Josh}{Triplett} and \ppl{Jon}{Walpole} (again of PSU)
 applied RCU to resizable
 hash tables~\cite{JoshTriplettPhD,Triplett:2011:RPHash,JonCorbet2014RCUhash1,JonCorbet2014RCUhash2}.
 Other RCU-protected resizable hash tables have been created by
 \ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash} and by
 \ppl{Mathieu}{Desnoyers}~\cite{PaulMcKenney2013LWNURCUhash}.

 \ppl{Austin}{Clements}, \ppl{Frans}{Kaashoek}, and \ppl{Nickolai}{Zeldovich}
 of MIT created an RCU-optimized balanced binary tree
 (Bonsai)~\cite{AustinClements2012RCULinux:mmapsem}, and applied this
 tree to the Linux kernel's VM subsystem in order to reduce read-side
 contention on the Linux kernel's \co{mmap_sem}.
 This work resulted in order-of-magnitude speedups and scalability up to
 at least 80 CPUs for a microbenchmark featuring large numbers of minor
 page faults.
 This is similar to a patch developed earlier by
 \ppl{Peter}{Zijlstra}~\cite{PeterZijlstra2014SpeculativePageFault}, and both
 were limited by the fact that, at the time, filesystem data structures
 were not safe for RCU readers.
 \pplsur{Austin}{Clements} et al.\ avoided this limitation by
 optimizing the page-fault
 path for anonymous pages only.
 More recently, filesystem data structures have been made safe for RCU
 readers~\cite{JonathanCorbet2010dcacheRCU,JonathanCorbet2011dcacheRCUbug},
 so perhaps this work can be implemented for all page types, not just
 anonymous pages---\ppl{Peter}{Zijlstra} has, in fact, recently prototyped
 exactly this, and \ppl{Laurent}{Dufour} \ppl{Michel}{Lespinasse} have
 continued work along these lines.
 For their part, \ppl{Matthew}{Wilcox} and \ppl{Liam}{Howlett} are working
 towards use of RCU to enable fine-grained locking of and lockless access
 to other memory-management data structures.

 \ppl{Yandong}{Mao} and \ppl{Robert}{Morris} of MIT and \ppl{Eddie}{Kohler} of
 Harvard University created another RCU-protected tree named
 Masstree~\cite{Mao:2012:CCF:2168836.2168855} that combines ideas from B+
 trees and tries.
 Although this tree is about 2.5x slower than an RCU-protected hash table,
 it supports operations on key ranges, unlike hash tables.
 In addition, Masstree supports efficient storage of objects with long
 shared key prefixes and, furthermore, provides persistence via logging
 to mass storage.

 The paper notes that Masstree's performance rivals that of memcached, even
 given that Masstree is persistently storing updates and memcached is not.
 The paper also compares Masstree's performance to the persistent
 datastores MongoDB, VoltDB, and Redis, reporting significant performance
 advantages for Masstree, in some cases exceeding two orders of magnitude.
 Another paper~\cite{Tu:2013:STM:2517349.2522713}, by \ppl{Stephen}{Tu},
 \ppl{Wenting}{Zheng}, \ppl{Barbara}{Liskov}, and \ppl{Samuel}{Madden}
 of MIT and \pplsur{Eddie}{Kohler},
 applies Masstree to an in-memory database named Silo, achieving 700K
 transactions per second (42M transactions per minute) on a well-known
 transaction-processing benchmark.
 Interestingly enough, Silo guarantees linearizability without incurring
 the overhead of grace periods while holding locks.

 \ppl{Maya}{Arbel} and \ppl{Hagit}{Attiya} of Technion took a more rigorous
 approach~\cite{MayaArbel2014RCUtree} to an RCU-protected search tree that,
 like Masstree, allows concurrent updates.
 This paper includes a proof of correctness, including proof that all
 operations on this tree are \IX{linearizable}.
 Unfortunately, this implementation achieves linearizability by incurring
 the full latency of grace-period waits while holding locks, which degrades
 scalability of update-only workloads.
 One way around this problem is to abandon
 linearizability~\cite{AndreasHaas2012FIFOisnt,PaulEMcKennneyAtomicTreeN4037},
 however, Arbel and Attiya instead created an RCU variant that reduces
 low-end grace-period latency.
 Of course, nothing comes for free, and this RCU variant appears to hit
 a scalability limit at about 32 CPUs.
 Although there is much to be said for dropping linearizability, thus
 gaining both performance and scalability, it is very good to see academics
 experimenting with alternative RCU implementations.

 \subsubsection{RCU Implementations}
 \label{sec:defer:RCU Implementations}

 \ppl{Timothy}{Harris} created a time-based user-space
 RCU~\cite{TimothyLHarris2001} that improves on those created previously
 by Jacobson~\cite{Jacobson93} and John~\cite{AjuJohn95}.
 These prior two time-based approaches each assume a sharp upper bound on
 reader duration, which can work correctly in hard real-time systems.
 In non-real-time systems, this type of approach is subject to failure
 when readers are interrupted, preempted, or otherwise delayed.
 However, the fact that such a failure-prone implementation would be
 independently invented twice shows the depth of the need for RCU-like
 mechanisms.
 \ppl{Timothy}{Harris} improves upon these two earlier efforts by
 requiring each reader to take a snapshot of a global timebase before
 starting its read-side traversal.
 Freeing a reader-visible object is then deferred until all processes'
 reader snapshots indicate a time following that of the removal of that object.
 However, global timebases can be expensive and inaccurate on some systems.

 \ppl{Keir}{Fraser} created a user-space RCU named \IXacr{ebr} for use in
 non-blocking synchronization and software transactional
 memory~\cite{KeirAnthonyFraserPhD,UCAM-CL-TR-579,KeirFraser2007withoutLocks}.
 This work improves on that of \ppl{Timothy}{Harris} by replacing the
 global clock with a software counter, thus eliminating much of the
 expense and all of the inaccuracy associated with commodity-system
 global clocks of that time.
 Interestingly enough, this work cites Linux-kernel RCU on the one hand,
 but also inspired the name \acr{qsbr} for the original non-preemptible
 Linux-kernel RCU implementation.

 \ppl{Mathieu}{Desnoyers} created a user-space RCU for use in
 tracing~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyersPhD,MathieuDesnoyers2012URCU,PaulMcKenney2013LWNURCU,PaulMcKenney2013LWNURCUhash,PaulMcKenney2013LWNURCUhashAPI,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUatomicop,PaulMcKenney2013LWNURCUmenagerie,PaulMcKenney2013LWNURCUAPI,PaulMcKenney2013LWNURCUlist,PaulMcKenney2013LWNURCUmenagerieRCU},
 which has seen use in a number of projects~\cite{MikeDay2013RCUqemu}.

 Researchers at Charles University in Prague have also been
 working on RCU implementations, including dissertations by
 \ppl{Andrej}{Podzimek}~\cite{AndrejPodzimek2010masters} and
 \ppl{Adam}{Hraska}~\cite{AdamHraska2013RCUHelenOS}.

 \ppl{Yujie}{Liu} (Lehigh University), \ppl{Victor}{Luchangco} (Oracle Labs), and
 \ppl{Michael}{Spear} (also Lehigh)~\cite{Liu:2013:MSA:2549695.2549732}
 pressed scalable non-zero indicators
 (SNZI)~\cite{FaithEllen:2007:SNZI} into service as a grace-period
 mechanism.
 The intended use is to implement software transactional memory
 (see \cref{sec:future:Transactional Memory}), which
 imposes linearizability requirements, which in turn seems to
 limit scalability.

 RCU-like mechanisms are also finding their way into Java.
 \pplsur{KC}{Sivaramakrishnan} et al.~\cite{Sivaramakrishnan:2012:ERB:2258996.2259005}
 use an RCU-like mechanism to eliminate the read barriers that are
 otherwise required when interacting with Java's garbage collector,
 resulting in significant performance improvements.

 \ppl{Ran}{Liu}, \ppl{Heng}{Zhang}, and \ppl{Haibo}{Chen} of
 Shanghai Jiao Tong University
 created a specialized variant of RCU that they used for an optimized
 ``passive reader-writer lock''~\cite{RanLiu2014PassiveRWLock}, similar to
 those created by \ppl{Gautham}{Shenoy}~\cite{GauthamShenoy2006RCUrwlock} and
 \ppl{Srivatsa}{Bhat}~\cite{SrivatsaSBhat2014RCUrwlock}.
 The Liu et al.\ paper is interesting from a number of
 perspectives~\cite{PaulEMcKenney2014ReadMostly}.

 \ppl{Mike}{Ash} posted~\cite{MikeAsh2015Apple} a description of an RCU-like
 primitive in Apple's Objective-C runtime.
 This approach identifies read-side critical sections via designated
 code ranges, thus qualifying as another method of achieving
 zero read-side overhead, albeit one that poses some interesting
 practical challenges for large read-side critical sections that
 span multiple functions.

 \ppl{Pedro}{Ramalhete} and \ppl{Andreia}{Correia}~\cite{PedroRmalhete2015PoorMansRCU}
 produced ``Poor Man's RCU'', which, despite using a pair of
 \IXhpl{reader-writer}{lock}, manages to provide \IXalt{lock-free}{lock free}
 \IXpl{forward-progress guarantee} to
 readers~\cite{PaulEMcKenney2015ReadMostly}.

 \ppl{Maya}{Arbel} and \ppl{Adam}{Morrison}~\cite{Arbel:2015:PRR:2858788.2688518}
 produced ``Predicate RCU'', which works hard to reduce grace-period
 duration in order to efficiently support algorithms that hold
 update-side locks across grace periods.
 This results in reduced batching of updates into grace periods
 and reduced scalability, but does succeed in providing short
 grace periods.

 \QuickQuiz{
 	Why not just drop the lock before waiting for the grace
 	period, or using something like \co{call_rcu()}
 	instead of waiting for a grace period?
 }\QuickQuizAnswer{
 	The authors wished to support \IX{linearizable} tree
 	operations, so that concurrent additions to, deletions
 	from, and searches of the tree would appear to execute
 	in some globally agreed-upon order.
 	In their search trees, this requires holding locks
 	across grace periods.
 	(It is probably better to drop linearizability as a
 	requirement in most cases, but linearizability is a
 	surprisingly popular (and costly!\@) requirement.)
 }\QuickQuizEnd

 \ppl{Alexander}{Matveev} (MIT), \ppl{Nir}{Shavit} (MIT and Tel-Aviv University),
 \ppl{Pascal}{Felber} (University of Neuch\^{a}tel), and \ppl{Patrick}{Marlier} (also
 University of Neuch\^{a}tel)~\cite{Matveev:2015:RLS:2815400.2815406}
 produced an RCU-like mechanism that can be thought of as
 software transactional memory that explicitly marks
 read-only transactions.
 Their use cases require holding locks across grace periods, which limits
 scalability~\cite{PaulEMcKenney2015ReadMostly,PaulEMcKenney2015ReadMostlySidebar}.
 This appears to be the first academic RCU-related work to
 make good use of the \co{rcutorture} test suite, and also the
 first to have submitted a performance improvement to Linux-kernel
 RCU, which was accepted into v4.4.

 \ppl{Alexander}{Matveev}'s RLU was followed up by MV-RLU from
 \ppl{Jaeho}{Kim} et al.~\cite{Kim:2019:MSR:3297858.3304040}.
 This work improves scalability over RLU by permitting multiple concurrent
 updates, by avoiding holding locks across grace periods, and by using
 asynchronous grace periods, for example, \co{call_rcu()} instead of
 \co{synchronize_rcu()}.
 This paper also made some interesting performance-evaluation choices that
 are discussed further in
 \cref{sec:future:Deferred Reclamation}
 on
 \cpageref{sec:future:Deferred Reclamation}.

 \ppl{Adam}{Belay} et al.~created an RCU implementation that guards the
 data structures used by TCP/IP's address-resolution protocol (ARP)
 in their IX operating system~\cite{Belay:2016:IOS:3014162.2997641}.

 \ppl{Geoff}{Romer} and \ppl{Andrew}{Hunter} (both at Google) proposed
 a cell-based API for RCU
 protection of singleton data structures for inclusion in the
 C++ standard~\cite{GeoffRomer2018C++DeferredReclamationP0561R4}.

 \ppl{Dimitrios}{Siakavaras} et al.~have applied
 HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees},
 \ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color
 graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring},
 and
 \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention
 locking on \IXacr{numa} systems.

 \ppl{Alex}{Kogan} et al.~applied RCU to the construction of range locking
 for scalable address spaces~\cite{AlexKogan2020RCUrangelocks}.

 On June 17, 2023, the ISO C++ Standards committee voted RCU into
 C++26~\cite{PaulEMcKennney2023C++26RCU4}.

 Production uses of RCU are listed in
 \cref{sec:defer:Production Uses of RCU}.

 \subsubsection{RCU Validation}
 \label{sec:defer:RCU Validation}

 In early 2017, it is commonly recognized that almost any bug is a potential
 security exploit, so validation and verification are first-class concerns.

 Researchers at Stony Brook University have produced an RCU-aware data-race
 detector~\cite{AbhinavDuggal2010Masters,JustinSeyster2012PhD,Seyster:2011:RFA:2075416.2075425}.
 \ppl{Alexey}{Gotsman} of IMDEA, \ppl{Noam}{Rinetzky} of Tel Aviv University,
 and \ppl{Hongseok}{Yang} of the University of Oxford have published a
 paper~\cite{AlexeyGotsman2012VerifyGraceExtended} expressing the formal
 semantics of RCU in terms of separation logic, and have continued with
 other aspects of concurrency.

 \ppl{Joseph}{Tassarotti} (Carnegie-Mellon University), \ppl{Derek}{Dreyer} (Max
 Planck Institute for Software Systems), and \ppl{Viktor}{Vafeiadis}
 (also MPI-SWS)~\cite{JosephTassarotti2015RCUproof}
 produced a manual formal proof of correctness of the \IXacrf{qsbr}
 variant of userspace
 RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}.
 \ppl{Lihao}{Liang} (University of Oxford), \pplmdl{Paul E.}{McKenney} (IBM),
 \ppl{Daniel}{Kroening}, and \ppl{Tom}{Melham}
 (both also Oxford)~\cite{LihaoLiang2016VerifyTreeRCU}
 used the \IXacrf{cbmc}~\cite{EdmundClarke2004CBMC}
 to produce a mechanical proof of correctness of a significant portion
 of Linux-kernel Tree RCU\@.
 \ppl{Lance}{Roy}~\cite{LanceRoy2017CBMC-SRCU} used CBMC to produce a similar
 proof of correctness for a significant portion of Linux-kernel
 sleepable RCU (SRCU)~\cite{PaulEMcKenney2006c}.
 Finally, \ppl{Michalis}{Kokologiannakis} and \ppl{Konstantinos}{Sagonas}
 (National Technical University of
 Athens)~\cite{MichalisKokologiannakis2017NidhuggRCU,MichalisKokologiannakis2019RCUstatelessModelCheck}
 used the Nighugg tool~\cite{CarlLeonardsson2014Nidhugg}
 to produce a mechanical proof of correctness of a somewhat larger
 portion of Linux-kernel Tree RCU\@.

 None of these efforts located any bugs other than bugs injected into
 RCU specifically to test the verification tools.
 In contrast,
 \ppl{Alex}{Groce} (Oregon State University), \ppl{Iftekhar}{Ahmed},
 \ppl{Carlos}{Jensen} (both also OSU), and \pplmdl{Paul E.}{McKenney}
 (IBM)~\cite{Groce:2015:VMC:2916135.2916190}
 automatically mutated Linux-kernel RCU's source code to test the
 coverage of the \co{rcutorture} test suite.
 The effort found several holes in this suite's coverage, one of which
 was hiding a real bug (since fixed) in Tiny RCU\@.

 With some luck, all of this validation work will eventually result in
 more and better tools for validating concurrent code.