| % defer/rcurelated.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \subsection{RCU Related Work} |
| \label{sec:defer:RCU Related Work} |
| \OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2014ReadMostly} |
| \OriginallyPublished{sec:defer:RCU Related Work}{RCU Related Work}{Linux Weekly News}{PaulEMcKenney2015ReadMostly} |
| |
| The first known mention of anything resembling RCU took the form of a bug |
| report from |
| \ppl{Donald}{Knuth}~\cite[page 413 of Fundamental Algorithms]{Knuth73} |
| against \ppl{Joseph}{Weizenbaum}'s SLIP list-processing facility for |
| FORTRAN~\cite{Weizenbaum:1963:SLP:367593.367617}. |
| Knuth was justified in reporting the bug, as SLIP had no notion of |
| any sort of grace-period guarantee. |
| |
| The first known non-bug-report mention of anything resembling RCU appeared |
| in \pplsur{H. T.}{Kung}'s and \pplsur{Philip L.}{Lehman}'s landmark |
| paper~\cite{Kung80}. |
| There was some additional use of this technique in |
| academia~\cite{Manber82,Manber84,BarbaraLiskov1988ArgusCACM,Pugh90,Andrews91textbook,Pu95a,Cowan96a,Rastogi:1997:LPV:645923.671017,Gamsa99}, |
| but much of the work in this area was instead carried out by |
| practitioners~\cite{RichardRashid87a,Hennessy89,Jacobson93,AjuJohn95,Slingwine95,Slingwine97,Slingwine98,McKenney98}. |
| |
| \QuickQuiz{ |
| Garbage collectors? |
| Passive serialization? |
| System reference points? |
| Quiescent states? |
| Aging? |
| Generations? |
| Why on earth couldn't the knuckleheads working on these early |
| papers bring themselves to agree on a common terminology??? |
| }\QuickQuizAnswer{ |
| There were multiple independent inventions of mechanisms |
| vaguely resembling RCU\@. |
| Each group of inventors was unaware of the others, so each |
| made up its own terminology as a matter of course. |
| And the different terminology made it quite difficult for |
| any one group to find any of the others. |
| |
| Sorry, but life is like that sometimes! |
| }\QuickQuizEnd |
| |
| By the year 2000, the initiative had passed to open-source projects, |
| most notably the Linux kernel |
| community~\cite{RustyRussell2000a,RustyRussell2000b,McKenney01b,McKenney01a,McKenney02a,Arcangeli03}.\footnote{ |
| A list of citations with well over 200 entries may be found in |
| \co{bib/RCU.bib} in the {\LaTeX} source for this book.} |
| RCU was accepted into the Linux kernel in late 2002, with many subsequent |
| improvements for scalability, robustness, real-time response, energy |
| efficiency, and specialized use cases. |
| As of 2023, Linux-kernel RCU is still under active development. |
| |
| \QuickQuiz{ |
| Why didn't Kung's and Lehman's paper result in immediate use |
| of RCU? |
| }\QuickQuizAnswer{ |
| One reason is that Kung and Lehman were simply ahead of their |
| time. |
| Another reason was that their approach, ground-breaking though |
| it was, did not take a number of software-engineering and |
| performance issues into account. |
| |
| To see that they were ahead of their time, consider that three |
| years after their paper was published, Paul was working on a |
| PDP-11 system running BSD 2.8. |
| This system lacked any sort of automatic configuration, which |
| meant that any hardware modification, including adding a new |
| disk drive, required hand-editing and rebuilding the kernel. |
| Furthermore, this was a single-CPU system, which meant that |
| full-system synchronization was a simple matter of disabling |
| interrupts. |
| |
| Fast-forward a number of years, and multicore systems permitting |
| runtime changes in hardware configuration were commonplace. |
| This meant that the hardware configuration data that was implicitly |
| represented in 1980s kernel source code was now a mutable |
| data structure that was accessed on every I/O\@. |
| Such data structures rarely change, but could change at any time. |
| And this read-mostly property applies to many other new-age |
| data structures, including those concerning networking (rare in |
| the 1980s), security policies (physical locks in the 1980s), |
| software configuration (immutable at runtime in the 1980s), |
| and much else besides. |
| There was thus much more opportunity for RCU to demonstrate its |
| benefits in the 1990s and 2000s than there was in the 1980s. |
| |
| Kung's and Lehman's software-engineering sins included failing |
| to mark readers (thus presenting debugging difficulties), |
| failing to provide a clean RCU API (thus tying their mechanism |
| to a specific data structure), and failing to allow for any |
| post-grace-period operation other than freeing memory (thus |
| disallowing a number of RCU use cases). |
| |
| Kung and Lehman presented two garbage-collection strategies. |
| The first waited for all processes running at a given time |
| to terminate, which represented another software-engineering |
| sin that ruled out their mechanism's use in software that |
| runs indefinitely. |
| The second used per-object reference counting, which greatly |
| complicates their read-side code (thus representing yet |
| another software-engineering sin), and, on modern hardware, |
| results in severe cache-miss overhead (thus representing a |
| performance sin, see for example |
| \cref{fig:defer:Performance of RCU vs. Reference Counting,fig:defer:Performance of Preemptible RCU vs. Reference Counting}). |
| |
| Despite this long list of software-engineering and performance |
| sins, Kung's and Lehman's paper remains a truly impressive piece |
| of work, especially considering that much of the later work |
| (both independent and not) committed these same sins, plus others |
| as well. |
| }\QuickQuizEnd |
| |
| However, in the mid 2010s, there was a welcome upsurge in RCU research |
| and development across a number of communities and |
| institutions~\cite{FransKaashoek2015ParallelOSHistory}. |
| \Cref{sec:defer:RCU Uses} describes uses of RCU, |
| \cref{sec:defer:RCU Implementations} describes RCU implementations |
| (as well as work that both creates and uses an implementation), |
| and finally, |
| \cref{sec:defer:RCU Validation} describes verification and validation |
| of RCU and its uses. |
| |
| \subsubsection{RCU Uses} |
| \label{sec:defer:RCU Uses} |
| |
| \ppl{Phil}{Howard} and \ppl{Jon}{Walpole} of Portland State University |
| (PSU) have |
| applied RCU to red-black |
| trees~\cite{PhilHowardPhD,PhilHoward2011RCUTMRBTree} combined with updates |
| synchronized using software transactional memory. |
| \ppl{Josh}{Triplett} and \ppl{Jon}{Walpole} (again of PSU) |
| applied RCU to resizable |
| hash tables~\cite{JoshTriplettPhD,Triplett:2011:RPHash,JonCorbet2014RCUhash1,JonCorbet2014RCUhash2}. |
| Other RCU-protected resizable hash tables have been created by |
| \ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash} and by |
| \ppl{Mathieu}{Desnoyers}~\cite{PaulMcKenney2013LWNURCUhash}. |
| |
| \ppl{Austin}{Clements}, \ppl{Frans}{Kaashoek}, and \ppl{Nickolai}{Zeldovich} |
| of MIT created an RCU-optimized balanced binary tree |
| (Bonsai)~\cite{AustinClements2012RCULinux:mmapsem}, and applied this |
| tree to the Linux kernel's VM subsystem in order to reduce read-side |
| contention on the Linux kernel's \co{mmap_sem}. |
| This work resulted in order-of-magnitude speedups and scalability up to |
| at least 80 CPUs for a microbenchmark featuring large numbers of minor |
| page faults. |
| This is similar to a patch developed earlier by |
| \ppl{Peter}{Zijlstra}~\cite{PeterZijlstra2014SpeculativePageFault}, and both |
| were limited by the fact that, at the time, filesystem data structures |
| were not safe for RCU readers. |
| \pplsur{Austin}{Clements} et al.\ avoided this limitation by |
| optimizing the page-fault |
| path for anonymous pages only. |
| More recently, filesystem data structures have been made safe for RCU |
| readers~\cite{JonathanCorbet2010dcacheRCU,JonathanCorbet2011dcacheRCUbug}, |
| so perhaps this work can be implemented for all page types, not just |
| anonymous pages---\ppl{Peter}{Zijlstra} has, in fact, recently prototyped |
| exactly this, and \ppl{Laurent}{Dufour} \ppl{Michel}{Lespinasse} have |
| continued work along these lines. |
| For their part, \ppl{Matthew}{Wilcox} and \ppl{Liam}{Howlett} are working |
| towards use of RCU to enable fine-grained locking of and lockless access |
| to other memory-management data structures. |
| |
| \ppl{Yandong}{Mao} and \ppl{Robert}{Morris} of MIT and \ppl{Eddie}{Kohler} of |
| Harvard University created another RCU-protected tree named |
| Masstree~\cite{Mao:2012:CCF:2168836.2168855} that combines ideas from B+ |
| trees and tries. |
| Although this tree is about 2.5x slower than an RCU-protected hash table, |
| it supports operations on key ranges, unlike hash tables. |
| In addition, Masstree supports efficient storage of objects with long |
| shared key prefixes and, furthermore, provides persistence via logging |
| to mass storage. |
| |
| The paper notes that Masstree's performance rivals that of memcached, even |
| given that Masstree is persistently storing updates and memcached is not. |
| The paper also compares Masstree's performance to the persistent |
| datastores MongoDB, VoltDB, and Redis, reporting significant performance |
| advantages for Masstree, in some cases exceeding two orders of magnitude. |
| Another paper~\cite{Tu:2013:STM:2517349.2522713}, by \ppl{Stephen}{Tu}, |
| \ppl{Wenting}{Zheng}, \ppl{Barbara}{Liskov}, and \ppl{Samuel}{Madden} |
| of MIT and \pplsur{Eddie}{Kohler}, |
| applies Masstree to an in-memory database named Silo, achieving 700K |
| transactions per second (42M transactions per minute) on a well-known |
| transaction-processing benchmark. |
| Interestingly enough, Silo guarantees linearizability without incurring |
| the overhead of grace periods while holding locks. |
| |
| \ppl{Maya}{Arbel} and \ppl{Hagit}{Attiya} of Technion took a more rigorous |
| approach~\cite{MayaArbel2014RCUtree} to an RCU-protected search tree that, |
| like Masstree, allows concurrent updates. |
| This paper includes a proof of correctness, including proof that all |
| operations on this tree are \IX{linearizable}. |
| Unfortunately, this implementation achieves linearizability by incurring |
| the full latency of grace-period waits while holding locks, which degrades |
| scalability of update-only workloads. |
| One way around this problem is to abandon |
| linearizability~\cite{AndreasHaas2012FIFOisnt,PaulEMcKennneyAtomicTreeN4037}, |
| however, Arbel and Attiya instead created an RCU variant that reduces |
| low-end grace-period latency. |
| Of course, nothing comes for free, and this RCU variant appears to hit |
| a scalability limit at about 32 CPUs. |
| Although there is much to be said for dropping linearizability, thus |
| gaining both performance and scalability, it is very good to see academics |
| experimenting with alternative RCU implementations. |
| |
| \subsubsection{RCU Implementations} |
| \label{sec:defer:RCU Implementations} |
| |
| \ppl{Timothy}{Harris} created a time-based user-space |
| RCU~\cite{TimothyLHarris2001} that improves on those created previously |
| by Jacobson~\cite{Jacobson93} and John~\cite{AjuJohn95}. |
| These prior two time-based approaches each assume a sharp upper bound on |
| reader duration, which can work correctly in hard real-time systems. |
| In non-real-time systems, this type of approach is subject to failure |
| when readers are interrupted, preempted, or otherwise delayed. |
| However, the fact that such a failure-prone implementation would be |
| independently invented twice shows the depth of the need for RCU-like |
| mechanisms. |
| \ppl{Timothy}{Harris} improves upon these two earlier efforts by |
| requiring each reader to take a snapshot of a global timebase before |
| starting its read-side traversal. |
| Freeing a reader-visible object is then deferred until all processes' |
| reader snapshots indicate a time following that of the removal of that object. |
| However, global timebases can be expensive and inaccurate on some systems. |
| |
| \ppl{Keir}{Fraser} created a user-space RCU named \IXacr{ebr} for use in |
| non-blocking synchronization and software transactional |
| memory~\cite{KeirAnthonyFraserPhD,UCAM-CL-TR-579,KeirFraser2007withoutLocks}. |
| This work improves on that of \ppl{Timothy}{Harris} by replacing the |
| global clock with a software counter, thus eliminating much of the |
| expense and all of the inaccuracy associated with commodity-system |
| global clocks of that time. |
| Interestingly enough, this work cites Linux-kernel RCU on the one hand, |
| but also inspired the name \acr{qsbr} for the original non-preemptible |
| Linux-kernel RCU implementation. |
| |
| \ppl{Mathieu}{Desnoyers} created a user-space RCU for use in |
| tracing~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyersPhD,MathieuDesnoyers2012URCU,PaulMcKenney2013LWNURCU,PaulMcKenney2013LWNURCUhash,PaulMcKenney2013LWNURCUhashAPI,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUatomicop,PaulMcKenney2013LWNURCUmenagerie,PaulMcKenney2013LWNURCUAPI,PaulMcKenney2013LWNURCUlist,PaulMcKenney2013LWNURCUmenagerieRCU}, |
| which has seen use in a number of projects~\cite{MikeDay2013RCUqemu}. |
| |
| Researchers at Charles University in Prague have also been |
| working on RCU implementations, including dissertations by |
| \ppl{Andrej}{Podzimek}~\cite{AndrejPodzimek2010masters} and |
| \ppl{Adam}{Hraska}~\cite{AdamHraska2013RCUHelenOS}. |
| |
| \ppl{Yujie}{Liu} (Lehigh University), \ppl{Victor}{Luchangco} (Oracle Labs), and |
| \ppl{Michael}{Spear} (also Lehigh)~\cite{Liu:2013:MSA:2549695.2549732} |
| pressed scalable non-zero indicators |
| (SNZI)~\cite{FaithEllen:2007:SNZI} into service as a grace-period |
| mechanism. |
| The intended use is to implement software transactional memory |
| (see \cref{sec:future:Transactional Memory}), which |
| imposes linearizability requirements, which in turn seems to |
| limit scalability. |
| |
| RCU-like mechanisms are also finding their way into Java. |
| \pplsur{KC}{Sivaramakrishnan} et al.~\cite{Sivaramakrishnan:2012:ERB:2258996.2259005} |
| use an RCU-like mechanism to eliminate the read barriers that are |
| otherwise required when interacting with Java's garbage collector, |
| resulting in significant performance improvements. |
| |
| \ppl{Ran}{Liu}, \ppl{Heng}{Zhang}, and \ppl{Haibo}{Chen} of |
| Shanghai Jiao Tong University |
| created a specialized variant of RCU that they used for an optimized |
| ``passive reader-writer lock''~\cite{RanLiu2014PassiveRWLock}, similar to |
| those created by \ppl{Gautham}{Shenoy}~\cite{GauthamShenoy2006RCUrwlock} and |
| \ppl{Srivatsa}{Bhat}~\cite{SrivatsaSBhat2014RCUrwlock}. |
| The Liu et al.\ paper is interesting from a number of |
| perspectives~\cite{PaulEMcKenney2014ReadMostly}. |
| |
| \ppl{Mike}{Ash} posted~\cite{MikeAsh2015Apple} a description of an RCU-like |
| primitive in Apple's Objective-C runtime. |
| This approach identifies read-side critical sections via designated |
| code ranges, thus qualifying as another method of achieving |
| zero read-side overhead, albeit one that poses some interesting |
| practical challenges for large read-side critical sections that |
| span multiple functions. |
| |
| \ppl{Pedro}{Ramalhete} and \ppl{Andreia}{Correia}~\cite{PedroRmalhete2015PoorMansRCU} |
| produced ``Poor Man's RCU'', which, despite using a pair of |
| \IXhpl{reader-writer}{lock}, manages to provide \IXalt{lock-free}{lock free} |
| \IXpl{forward-progress guarantee} to |
| readers~\cite{PaulEMcKenney2015ReadMostly}. |
| |
| \ppl{Maya}{Arbel} and \ppl{Adam}{Morrison}~\cite{Arbel:2015:PRR:2858788.2688518} |
| produced ``Predicate RCU'', which works hard to reduce grace-period |
| duration in order to efficiently support algorithms that hold |
| update-side locks across grace periods. |
| This results in reduced batching of updates into grace periods |
| and reduced scalability, but does succeed in providing short |
| grace periods. |
| |
| \QuickQuiz{ |
| Why not just drop the lock before waiting for the grace |
| period, or using something like \co{call_rcu()} |
| instead of waiting for a grace period? |
| }\QuickQuizAnswer{ |
| The authors wished to support \IX{linearizable} tree |
| operations, so that concurrent additions to, deletions |
| from, and searches of the tree would appear to execute |
| in some globally agreed-upon order. |
| In their search trees, this requires holding locks |
| across grace periods. |
| (It is probably better to drop linearizability as a |
| requirement in most cases, but linearizability is a |
| surprisingly popular (and costly!\@) requirement.) |
| }\QuickQuizEnd |
| |
| \ppl{Alexander}{Matveev} (MIT), \ppl{Nir}{Shavit} (MIT and Tel-Aviv University), |
| \ppl{Pascal}{Felber} (University of Neuch\^{a}tel), and \ppl{Patrick}{Marlier} (also |
| University of Neuch\^{a}tel)~\cite{Matveev:2015:RLS:2815400.2815406} |
| produced an RCU-like mechanism that can be thought of as |
| software transactional memory that explicitly marks |
| read-only transactions. |
| Their use cases require holding locks across grace periods, which limits |
| scalability~\cite{PaulEMcKenney2015ReadMostly,PaulEMcKenney2015ReadMostlySidebar}. |
| This appears to be the first academic RCU-related work to |
| make good use of the \co{rcutorture} test suite, and also the |
| first to have submitted a performance improvement to Linux-kernel |
| RCU, which was accepted into v4.4. |
| |
| \ppl{Alexander}{Matveev}'s RLU was followed up by MV-RLU from |
| \ppl{Jaeho}{Kim} et al.~\cite{Kim:2019:MSR:3297858.3304040}. |
| This work improves scalability over RLU by permitting multiple concurrent |
| updates, by avoiding holding locks across grace periods, and by using |
| asynchronous grace periods, for example, \co{call_rcu()} instead of |
| \co{synchronize_rcu()}. |
| This paper also made some interesting performance-evaluation choices that |
| are discussed further in |
| \cref{sec:future:Deferred Reclamation} |
| on |
| \cpageref{sec:future:Deferred Reclamation}. |
| |
| \ppl{Adam}{Belay} et al.~created an RCU implementation that guards the |
| data structures used by TCP/IP's address-resolution protocol (ARP) |
| in their IX operating system~\cite{Belay:2016:IOS:3014162.2997641}. |
| |
| \ppl{Geoff}{Romer} and \ppl{Andrew}{Hunter} (both at Google) proposed |
| a cell-based API for RCU |
| protection of singleton data structures for inclusion in the |
| C++ standard~\cite{GeoffRomer2018C++DeferredReclamationP0561R4}. |
| |
| \ppl{Dimitrios}{Siakavaras} et al.~have applied |
| HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees}, |
| \ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color |
| graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring}, |
| and |
| \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention |
| locking on \IXacr{numa} systems. |
| |
| \ppl{Alex}{Kogan} et al.~applied RCU to the construction of range locking |
| for scalable address spaces~\cite{AlexKogan2020RCUrangelocks}. |
| |
| On June 17, 2023, the ISO C++ Standards committee voted RCU into |
| C++26~\cite{PaulEMcKennney2023C++26RCU4}. |
| |
| Production uses of RCU are listed in |
| \cref{sec:defer:Production Uses of RCU}. |
| |
| \subsubsection{RCU Validation} |
| \label{sec:defer:RCU Validation} |
| |
| In early 2017, it is commonly recognized that almost any bug is a potential |
| security exploit, so validation and verification are first-class concerns. |
| |
| Researchers at Stony Brook University have produced an RCU-aware data-race |
| detector~\cite{AbhinavDuggal2010Masters,JustinSeyster2012PhD,Seyster:2011:RFA:2075416.2075425}. |
| \ppl{Alexey}{Gotsman} of IMDEA, \ppl{Noam}{Rinetzky} of Tel Aviv University, |
| and \ppl{Hongseok}{Yang} of the University of Oxford have published a |
| paper~\cite{AlexeyGotsman2012VerifyGraceExtended} expressing the formal |
| semantics of RCU in terms of separation logic, and have continued with |
| other aspects of concurrency. |
| |
| \ppl{Joseph}{Tassarotti} (Carnegie-Mellon University), \ppl{Derek}{Dreyer} (Max |
| Planck Institute for Software Systems), and \ppl{Viktor}{Vafeiadis} |
| (also MPI-SWS)~\cite{JosephTassarotti2015RCUproof} |
| produced a manual formal proof of correctness of the \IXacrf{qsbr} |
| variant of userspace |
| RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}. |
| \ppl{Lihao}{Liang} (University of Oxford), \pplmdl{Paul E.}{McKenney} (IBM), |
| \ppl{Daniel}{Kroening}, and \ppl{Tom}{Melham} |
| (both also Oxford)~\cite{LihaoLiang2016VerifyTreeRCU} |
| used the \IXacrf{cbmc}~\cite{EdmundClarke2004CBMC} |
| to produce a mechanical proof of correctness of a significant portion |
| of Linux-kernel Tree RCU\@. |
| \ppl{Lance}{Roy}~\cite{LanceRoy2017CBMC-SRCU} used CBMC to produce a similar |
| proof of correctness for a significant portion of Linux-kernel |
| sleepable RCU (SRCU)~\cite{PaulEMcKenney2006c}. |
| Finally, \ppl{Michalis}{Kokologiannakis} and \ppl{Konstantinos}{Sagonas} |
| (National Technical University of |
| Athens)~\cite{MichalisKokologiannakis2017NidhuggRCU,MichalisKokologiannakis2019RCUstatelessModelCheck} |
| used the Nighugg tool~\cite{CarlLeonardsson2014Nidhugg} |
| to produce a mechanical proof of correctness of a somewhat larger |
| portion of Linux-kernel Tree RCU\@. |
| |
| None of these efforts located any bugs other than bugs injected into |
| RCU specifically to test the verification tools. |
| In contrast, |
| \ppl{Alex}{Groce} (Oregon State University), \ppl{Iftekhar}{Ahmed}, |
| \ppl{Carlos}{Jensen} (both also OSU), and \pplmdl{Paul E.}{McKenney} |
| (IBM)~\cite{Groce:2015:VMC:2916135.2916190} |
| automatically mutated Linux-kernel RCU's source code to test the |
| coverage of the \co{rcutorture} test suite. |
| The effort found several holes in this suite's coverage, one of which |
| was hiding a real bug (since fixed) in Tiny RCU\@. |
| |
| With some luck, all of this validation work will eventually result in |
| more and better tools for validating concurrent code. |