SMPdesign/SMPdesign.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % SMPdesign/SMPdesign.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{chp:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}{qqzSMPdesign}
 %
 \Epigraph{Divide and rule.}{Philip II of Macedon}

 This chapter describes how to design software to take advantage of
 modern commodity multicore systems by using idioms, or
 ``design patterns''~\cite{Alexander79,GOF95,SchmidtStalRohnertBuschmann2000v2Textbook},
 to balance performance, scalability, and response time.
 Correctly partitioned problems lead to simple, scalable, and
 high-performance solutions, while poorly partitioned problems result
 in slow and complex solutions.
 This chapter will help you design partitioning into your code, with
 some discussion of batching and weakening as well.
 The word ``design'' is very important:
 You should partition first, batch second, weaken third, and code fourth.
 Changing this order often leads to poor performance and scalability
 along with great frustration.\footnote{
 	That other great dodge around the Laws of Physics, read-only
 	replication, is covered in \cref{chp:Deferred Processing}.}

 This chapter will also look at some specific problems, including:

 \begin{enumerate}
 \item	Constraints on the classic Dining Philosophers problem requiring
 	that all the philophers be able to dine concurrently.
 \label{sec:SMPdesign:Problems Dining Philosophers}
 \item	Lock-based double-ended queue implementations that provide
 	concurrency between operations on both ends of a given queue
 	when there are many elements in the queue, but still work
 	correctly when the queue contains only a few elements.
 	(Or, for that matter, no elements.)
 \label{sec:SMPdesign:Problems Double-Ended Queue}
 \item	Summarizing the rough quality of a concurrent algorithm with only
 	a few numbers.
 \label{sec:SMPdesign:Problems Quality Assessment}
 \item	Selecting the right granularity of partitioning.
 \label{sec:SMPdesign:Problems Granularity}
 \item	Concurrent designs for applications that do not fully partition.
 \label{sec:SMPdesign:Problems Parallel Fastpath}
 \item	Obtaining more than 2x speedup from two CPUs.
 \label{sec:SMPdesign:Problems Maze}
 \end{enumerate}

 To this end, \cref{sec:SMPdesign:Partitioning Exercises}
 presents partitioning exercises,
 \cref{sec:SMPdesign:Design Criteria} reviews partitionability
 design criteria,
 \cref{sec:SMPdesign:Synchronization Granularity}
 discusses synchronization granularity selection,
 \cref{sec:SMPdesign:Parallel Fastpath}
 overviews important parallel-fastpath design patterns
 that provide speed and scalability on common-case fastpaths while using
 simpler less-scalable ``slow path'' fallbacks for unusual situations,
 and finally
 \cref{sec:SMPdesign:Beyond Partitioning}
 takes a brief look beyond partitioning.

 \input{SMPdesign/partexercises}

 \input{SMPdesign/criteria}

 \section{Synchronization Granularity}
 \label{sec:SMPdesign:Synchronization Granularity}
 %
 \epigraph{Doing little things well is a step toward doing big things better.}
 	 {Harry F.~Banks}

 \Cref{fig:SMPdesign:Design Patterns and Lock Granularity}
 gives a pictorial view of different levels of synchronization granularity,
 each of which is described in one of the following sections.
 These sections focus primarily on locking, but similar granularity
 issues arise with all forms of synchronization.

 \begin{figure}
 \centering
 \resizebox{1.2in}{!}{\includegraphics{SMPdesign/LockGranularity}}
 \caption{Design Patterns and Lock Granularity}
 \label{fig:SMPdesign:Design Patterns and Lock Granularity}
 \end{figure}

 \subsection{Sequential Program}
 \label{sec:SMPdesign:Sequential Program}

 If the program runs fast enough on a single processor, and
 has no interactions with other processes, threads, or interrupt
 handlers, you should
 remove the synchronization primitives and spare yourself their
 overhead and complexity.
 Some years back, there were those who would argue that \IXr{Moore's Law}
 would eventually force all programs into this category.
 However, as can be seen in
 \cref{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs},
 the exponential increase in single-threaded performance halted in
 about 2003.
 Therefore,
 increasing performance will increasingly require parallelism.\footnote{
 	This plot shows clock frequencies for newer CPUs theoretically
 	capable of retiring one or more instructions per clock, and MIPS for
 	older CPUs requiring multiple clocks to execute even the
 	simplest instruction.
 	The reason for taking this approach is that the newer CPUs'
 	ability to retire multiple instructions per clock is typically
 	limited by memory-system performance.}
 Given that back in 2006 Paul typed the first version of this sentence
 on a dual-core laptop, and further given that many of the graphs added
 in 2020 were generated on a system with 56~hardware threads per socket,
 parallelism is well and truly here.
 It is also important to note that Ethernet bandwidth is continuing to
 grow, as shown in
 \cref{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance}.
 This growth will continue to motivate multithreaded servers in order to
 handle the communications load.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}}
 \caption{MIPS/Clock-Frequency Trend for Intel CPUs}
 \label{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/CPUvsEnet}}
 \caption{Ethernet Bandwidth vs.\@ Intel x86 CPU Performance}
 \label{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance}
 \end{figure}

 Please note that this does \emph{not} mean that you should code each
 and every program in a multi-threaded manner.
 Again, if a program runs quickly enough on a single processor,
 spare yourself the overhead and complexity of SMP synchronization
 primitives.
 The simplicity of the hash-table lookup code in
 \cref{lst:SMPdesign:Sequential-Program Hash Table Search}
 underscores this point.\footnote{
 	The examples in this section are taken from Hart et
 	al.~\cite{ThomasEHart2006a}, adapted for clarity
 	by gathering related code from multiple files.}
 A key point is that speedups due to parallelism are normally
 limited to the number of CPUs.
 In contrast, speedups due to sequential optimizations, for example,
 careful choice of data structure, can be arbitrarily large.

 \QuickQuiz{
 	What should you do to validate a hash table?
 }\QuickQuizAnswer{
 	Quite a bit, actually.

 	See \cref{sec:datastruct:RCU-Protected Hash Table Validation}
 	for a good starting point.
 }\QuickQuizEnd

 On the other hand, if you are not in this happy situation, read on!

 \begin{listing}
 \begin{VerbatimL}[commandchars=\\\@\$]
 struct hash_table
 {
 	long nbuckets;
 	struct node **buckets;
 };

 typedef struct node {
 	unsigned long key;
 	struct node *next;
 } node_t;

 int hash_search(struct hash_table *h, long key)
 {
 	struct node *cur;

 	cur = h->buckets[key % h->nbuckets];
 	while (cur != NULL) {
 		if (cur->key >= key) {
 			return (cur->key == key);
 		}
 		cur = cur->next;
 	}
 	return 0;
 }
 \end{VerbatimL}
 \caption{Sequential-Program Hash Table Search}
 \label{lst:SMPdesign:Sequential-Program Hash Table Search}
 \end{listing}

 % ./test_hash_null.exe 1000 0/100 1 1024 1
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 96.2913  max = 98.2337  min = 90.4095  std = 2.95314
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 91.5592  max = 97.3315  min = 89.9885  std = 2.88925
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 93.3568  max = 106.162  min = 89.8828  std = 6.40418

 \subsection{Code Locking}
 \label{sec:SMPdesign:Code Locking}

 \IXh{Code}{locking} is quite simple due to the fact that it uses only
 global locks.\footnote{
 	If your program instead has locks in data structures,
 	or, in the case of Java, uses classes with synchronized
 	instances, you are instead using ``data locking'', described
 	in \cref{sec:SMPdesign:Data Locking}.}
 It is especially
 easy to retrofit an existing program to use code locking in
 order to run it on a multiprocessor.
 If the program has only a single shared resource, code locking
 will even give optimal performance.
 However, many of the larger and more complex programs
 require much of the execution to
 occur in \IXpl{critical section}, which in turn causes code locking
 to sharply limits their scalability.

 Therefore, you should use code locking on programs that spend
 only a small fraction of their execution time in critical sections or
 from which only modest scaling is required.
 In addition, programs that primarily use the more scalable approaches
 described in later sections often use code locking to handle rare error
 cases or significant state transitions.
 In these cases, code locking will provide a relatively simple
 program that is very similar to its sequential counterpart,
 as can be seen in
 \cref{lst:SMPdesign:Code-Locking Hash Table Search}.
 However, note that the simple return of the comparison in
 \co{hash_search()} in
 \cref{lst:SMPdesign:Sequential-Program Hash Table Search}
 has now become three statements due to the need to release the
 lock before returning.

 \begin{listing}
 \begin{fcvlabel}[ln:SMPdesign:Code-Locking Hash Table Search]
 \begin{VerbatimL}[commandchars=\\\@\$]
 spinlock_t hash_lock;

 struct hash_table
 {
 	long nbuckets;
 	struct node **buckets;
 };

 typedef struct node {
 	unsigned long key;
 	struct node *next;
 } node_t;

 int hash_search(struct hash_table *h, long key)
 {
 	struct node *cur;
 	int retval;

 	spin_lock(&hash_lock);				\lnlbl@acq$
 	cur = h->buckets[key % h->nbuckets];
 	while (cur != NULL) {
 		if (cur->key >= key) {
 			retval = (cur->key == key);
 			spin_unlock(&hash_lock);	\lnlbl@rel1$
 			return retval;
 		}
 		cur = cur->next;
 	}
 	spin_unlock(&hash_lock);			\lnlbl@rel2$
 	return 0;
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Code-Locking Hash Table Search}
 \label{lst:SMPdesign:Code-Locking Hash Table Search}
 \end{listing}

 \begin{fcvref}[ln:SMPdesign:Code-Locking Hash Table Search]
 Note that the \co{hash_lock} acquisition and release statements on
 \clnref{acq,rel1,rel2} are mediating ownership of the hash table among
 the CPUs wishing to concurrently access that hash table.
 Another way of looking at this is that \co{hash_lock} is partitioning
 time, thus giving each requesting CPU its own partition of time during
 which it owns this hash table.
 In addition, in a well-designed algorithm, there should be ample partitions
 of time during which no CPU owns this hash table.
 \end{fcvref}

 \QuickQuiz{
 	``Partitioning time''?
 	Isn't that an odd turn of phrase?
 }\QuickQuizAnswer{
 	Perhaps so.

 	But in the next section we will be partitioning space (that is,
 	address space) as well as time.
 	This nomenclature will permit us to partition spacetime, as
 	opposed to (say) partitioning space but segmenting time.
 }\QuickQuizEnd

 Unfortunately, code locking is particularly prone to ``\IX{lock contention}'',
 where multiple CPUs need to acquire the lock concurrently.
 SMP programmers who have taken care of groups of small children
 (or groups of older people who are acting like children) will immediately
 recognize the danger of having only one of something,
 as illustrated in \cref{fig:SMPdesign:Lock Contention}.

 % ./test_hash_codelock.exe 1000 0/100 1 1024 1
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_codelock.exe: avg = 164.115  max = 170.388  min = 161.659  std = 3.21857
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_codelock.exe: avg = 181.17  max = 198.4  min = 162.459  std = 15.8585
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_codelock.exe: avg = 167.651  max = 189.014  min = 162.144  std = 10.6819

 % ./test_hash_codelock.exe 1000 0/100 1 1024 2
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_codelock.exe: avg = 378.481  max = 385.971  min = 374.235  std = 4.05934
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_codelock.exe: avg = 753.414  max = 1015.28  min = 377.734  std = 294.942
 % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_codelock.exe: avg = 502.737  max = 980.924  min = 374.406  std = 239.383

 One solution to this problem, named ``data locking'', is described
 in the next section.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{cartoons/r-2014-Data-one-fighting}}
 \caption{Lock Contention}
 \ContributedBy{fig:SMPdesign:Lock Contention}{Melissa Broussard}
 \end{figure}

 \subsection{Data Locking}
 \label{sec:SMPdesign:Data Locking}

 Many data structures may be partitioned,
 with each partition of the data structure having its own lock.
 Then the \IXpl{critical section} for each part of the data structure
 can execute in parallel,
 although only one instance of the critical section for a given
 part could be executing at a given time.
 You should use data locking when contention must
 be reduced, and where synchronization overhead is not
 limiting speedups.
 Data locking reduces contention by distributing the instances
 of the overly-large critical section across multiple data structures,
 for example, maintaining per-hash-bucket critical sections in a
 hash table, as shown in
 \cref{lst:SMPdesign:Data-Locking Hash Table Search}.
 The increased scalability again results in a slight increase in complexity
 in the form of an additional data structure, the \co{struct bucket}.

 \begin{listing}
 \begin{VerbatimL}
 struct hash_table
 {
 	long nbuckets;
 	struct bucket **buckets;
 };

 struct bucket {
 	spinlock_t bucket_lock;
 	node_t *list_head;
 };

 typedef struct node {
 	unsigned long key;
 	struct node *next;
 } node_t;

 int hash_search(struct hash_table *h, long key)
 {
 	struct bucket *bp;
 	struct node *cur;
 	int retval;

 	bp = h->buckets[key % h->nbuckets];
 	spin_lock(&bp->bucket_lock);
 	cur = bp->list_head;
 	while (cur != NULL) {
 		if (cur->key >= key) {
 			retval = (cur->key == key);
 			spin_unlock(&bp->bucket_lock);
 			return retval;
 		}
 		cur = cur->next;
 	}
 	spin_unlock(&bp->bucket_lock);
 	return 0;
 }
 \end{VerbatimL}
 \caption{Data-Locking Hash Table Search}
 \label{lst:SMPdesign:Data-Locking Hash Table Search}
 \end{listing}

 In contrast with the contentious situation
 shown in \cref{fig:SMPdesign:Lock Contention},
 data locking helps promote harmony, as illustrated by
 \cref{fig:SMPdesign:Data Locking}---and in parallel programs,
 this \emph{almost} always translates into increased performance and
 scalability.
 For this reason, data locking was heavily used by Sequent in its
 kernels~\cite{Beck85,Inman85,Garg90,Dove90,McKenney92b,McKenney92a,McKenney93}.

 Another way of looking at this is to think of each \co{->bucket_lock}
 as mediating ownership not of the entire hash table as was done for code
 locking, but only for the bucket corresponding to that \co{->bucket_lock}.
 Each lock still partitions time, but the per-bucket-locking technique
 also partitions the address space, so that the overall technique can be
 said to partition spacetime.
 If the number of buckets is large enough, this partitioning of space
 should with high probability permit a given CPU immediate access to
 a given hash bucket.

 \begin{figure}
 \centering
 \resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Data-many-happy}}
 \caption{Data Locking}
 \ContributedBy{fig:SMPdesign:Data Locking}{Melissa Broussard}
 \end{figure}

 % ./test_hash_spinlock.exe 1000 0/100 1 1024 1
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_spinlock.exe: avg = 158.118  max = 162.404  min = 156.199  std = 2.19391
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_spinlock.exe: avg = 157.717  max = 162.446  min = 156.415  std = 2.36662
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_spinlock.exe: avg = 158.369  max = 164.75  min = 156.501  std = 3.19454

 % ./test_hash_spinlock.exe 1000 0/100 1 1024 2
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_spinlock.exe: avg = 223.426  max = 422.948  min = 167.858  std = 100.136
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_spinlock.exe: avg = 235.462  max = 507.134  min = 167.466  std = 135.836
 % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_spinlock.exe: avg = 305.807  max = 481.685  min = 167.939  std = 132.589

 However, as those who have taken care of small children can again attest,
 even providing enough to go around is no guarantee of tranquillity.
 The analogous situation can arise in SMP programs.
 For example, the Linux kernel maintains a cache of files and directories
 (called ``dcache'').
 Each entry in this cache has its own lock, but the entries corresponding
 to the root directory and its direct descendants are much more likely to
 be traversed than are more obscure entries.
 This can result in many CPUs contending for the locks of these popular
 entries, resulting in a situation not unlike that
 shown in \cref{fig:SMPdesign:Data and Skew}.

 \begin{figure}
 \centering
 \resizebox{\twocolumnwidth}{!}{\includegraphics{cartoons/r-2014-Data-many-fighting}}
 \caption{Data Locking and Skew}
 \ContributedBy{fig:SMPdesign:Data and Skew}{Melissa Broussard}
 \end{figure}

 In many cases, algorithms can be designed to reduce the instance of
 data skew, and in some cases eliminate it entirely
 (for example, in the Linux kernel's
 dcache~\cite{McKenney04a,JonathanCorbet2010dcacheRCU,NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk,NeilBrown2015PathnameSymlinks}).
 Data locking is often used for partitionable data structures such as
 hash tables, as well as in situations where multiple entities are each
 represented by an instance of a given data structure.
 The Linux-kernel task list is an example of the latter, each task
 structure having its own \co{alloc_lock} and \co{pi_lock}.

 A key challenge with data locking on dynamically allocated structures
 is ensuring that the structure remains in existence while the lock is
 being acquired~\cite{Gamsa99}.
 The code in
 \cref{lst:SMPdesign:Data-Locking Hash Table Search}
 finesses this challenge by placing the locks in the statically allocated
 hash buckets, which are never freed.
 However, this trick would not work if the hash table were resizeable,
 so that the locks were now dynamically allocated.
 In this case, there would need to be some means to prevent the hash
 bucket from being freed during the time that its lock was being acquired.

 \QuickQuiz{
 	What are some ways of preventing a structure from being freed while
 	its lock is being acquired?
 }\QuickQuizAnswer{
 	Here are a few possible solutions to this
 	\emph{\IX{existence guarantee}} problem:

 	\begin{enumerate}
 	\item	Provide a statically allocated lock that is held while
 		the per-structure lock is being acquired, which is an
 		example of hierarchical locking (see
 		\cref{sec:SMPdesign:Hierarchical Locking}).
 		Of course, using a single global lock for this purpose
 		can result in unacceptably high levels of lock contention,
 		dramatically reducing performance and scalability.
 	\item	Provide an array of statically allocated locks, hashing
 		the structure's address to select the lock to be acquired,
 		as described in \cref{chp:Locking}.
 		Given a hash function of sufficiently high quality, this
 		avoids the scalability limitations of the single global
 		lock, but in read-mostly situations, the lock-acquisition
 		overhead can result in unacceptably degraded performance.
 	\item	Use a garbage collector, in software environments providing
 		them, so that a structure cannot be deallocated while being
 		referenced.
 		This works very well, removing the existence-guarantee
 		burden (and much else besides) from the developer's
 		shoulders, but imposes the overhead of garbage collection
 		on the program.
 		Although garbage-collection technology has advanced
 		considerably in the past few decades, its overhead
 		may be unacceptably high for some applications.
 		In addition, some applications require that the developer
 		exercise more control over the layout and placement of
 		data structures than is permitted by most garbage collected
 		environments.
 	\item	As a special case of a garbage collector, use a global
 		reference counter, or a global array of reference counters.
 		These have strengths and limitations similar to those
 		called out above for locks.
 	\item	Use \emph{\IXpl{hazard pointer}}~\cite{MagedMichael04a}, which
 		can be thought of as an inside-out reference count.
 		Hazard-pointer-based algorithms maintain a per-thread list of
 		pointers, so that the appearance of a given pointer on
 		any of these lists acts as a reference to the corresponding
 		structure.
 		Hazard pointers are starting to see significant production use
 		(see \cref{sec:defer:Production Uses of Hazard Pointers}).
 	\item	Use transactional memory
 		(TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95},
 		so that each reference and
 		modification to the data structure in question is
 		performed atomically.
 		Although TM has engendered much excitement in recent years,
 		and seems likely to be of some use in production software,
 		developers should exercise some
 		caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM},
 		particularly in performance-critical code.
 		In particular, existence guarantees require that the
 		transaction covers the full path from a global reference
 		to the data elements being updated.
 		For more on TM, including ways to overcome some of its
 		weaknesses by combining it with other synchronization
 		mechanisms, see
 		\cref{sec:future:Transactional Memory,sec:future:Hardware Transactional Memory}.
 	\item	Use RCU, which can be thought of as an extremely lightweight
 		approximation to a garbage collector.
 		Updaters are not permitted to free RCU-protected
 		data structures that RCU readers might still be referencing.
 		RCU is most heavily used for read-mostly data structures,
 		and is discussed at length in
 		\cref{sec:defer:Read-Copy Update (RCU)}.
 	\end{enumerate}

 	For more on providing existence guarantees, see
 	\cref{chp:Locking,chp:Deferred Processing}.
 }\QuickQuizEnd

 \subsection{Data Ownership}
 \label{sec:SMPdesign:Data Ownership}

 Data ownership partitions a given data structure over the threads
 or CPUs, so that each thread/CPU accesses its subset of the data
 structure without any synchronization overhead whatsoever.
 However, if one thread wishes to access some other thread's data,
 the first thread is unable to do so directly.
 Instead, the first thread must communicate with the second thread,
 so that the second thread performs the operation on behalf of the
 first, or, alternatively, migrates the data to the first thread.

 Data ownership might seem arcane, but it is used very frequently:
 \begin{enumerate}
 \item	Any variables accessible by only one CPU or thread
 	(such as {\tt auto} variables in C
 	and C++) are owned by that CPU or process.
 \item	An instance of a user interface owns the corresponding
 	user's context.
 	It is very common for applications interacting with parallel
 	database engines to be written as if they were entirely
 	sequential programs.
 	Such applications own the user interface and his current
 	action.
 	Explicit parallelism is thus confined to the
 	database engine itself.
 \item	Parametric simulations are often trivially parallelized
 	by granting each thread ownership of a particular region
 	of the parameter space.
 	There are also computing frameworks designed for this
 	type of problem~\cite{BOINC2008}.
 \end{enumerate}

 If there is significant sharing, communication between the threads
 or CPUs can result in significant complexity and overhead.
 Furthermore, if the most-heavily used data happens to be that owned
 by a single CPU, that CPU will be a ``\IX{hot spot}'', sometimes with
 results resembling that shown in \cref{fig:SMPdesign:Data and Skew}.
 However, in situations where no sharing is required, data ownership
 achieves ideal performance, and with code that can be as simple
 as the sequential-program case shown in
 \cref{lst:SMPdesign:Sequential-Program Hash Table Search}.
 Such situations are often referred to as ``\IX{embarrassingly
 parallel}'', and, in the best case, resemble the situation
 previously shown in \cref{fig:SMPdesign:Data Locking}.

 % ./test_hash_null.exe 1000 0/100 1 1024 1
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 96.2913  max = 98.2337  min = 90.4095  std = 2.95314
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 91.5592  max = 97.3315  min = 89.9885  std = 2.88925
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
 % ./test_hash_null.exe: avg = 93.3568  max = 106.162  min = 89.8828  std = 6.40418

 % ./test_hash_null.exe 1000 0/100 1 1024 2
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_null.exe: avg = 45.4526  max = 46.4281  min = 45.1954  std = 0.487791
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_null.exe: avg = 46.0238  max = 49.2861  min = 45.1852  std = 1.63127
 % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
 % ./test_hash_null.exe: avg = 46.6858  max = 52.6278  min = 45.1761  std = 2.97102

 Another important instance of data ownership occurs when the data
 is read-only, in which case,
 all threads can ``own'' it via replication.

 Where data locking partitions both the address space (with one
 hash buckets per partition) and time (using per-bucket locks), data
 ownership partitions only the address space.
 The reason that data ownership need not partition time is because a
 given thread or CPU is assigned permanent ownership of a given
 address-space partition.

 \QuickQuiz{
 	But won't system boot and shutdown (or application startup
 	and shutdown) be partitioning time, even for data ownership?
 }\QuickQuizAnswer{
 	You can indeed think in these terms.

 	And if you are working on a persistent data store where state
 	survives shutdown, thinking in these terms might even be useful.
 }\QuickQuizEnd

 Data ownership will be presented in more detail in
 \cref{chp:Data Ownership}.

 \subsection{Locking Granularity and Performance}
 \label{sec:SMPdesign:Locking Granularity and Performance}

 This section looks at locking granularity and performance from
 a mathematical synchronization\-/efficiency viewpoint.
 Readers who are uninspired by mathematics might choose to skip
 this section.

 The approach is to use a crude queueing model for the \IX{efficiency} of
 synchronization mechanism that operate on a single shared global
 variable, based on an M/M/1 queue.
 M/M/1 queuing models are based on an exponentially distributed
 ``inter-arrival rate'' $\lambda$ and an exponentially distributed
 ``service rate'' $\mu$.
 The inter-arrival rate $\lambda$ can be thought of as the average
 number of synchronization operations per second that the system
 would process if the synchronization were free, in other words,
 $\lambda$ is an inverse measure of the overhead of each non-synchronization
 unit of work.
 For example, if each unit of work was a transaction, and if each transaction
 took one millisecond to process, excluding synchronization overhead,
 then $\lambda$ would be 1,000~transactions per second.

 The service rate $\mu$ is defined similarly, but for the average
 number of synchronization operations per second that the system
 would process if the overhead of each transaction was zero, and
 ignoring the fact that CPUs must wait on each other to complete
 their synchronization operations, in other words, $\mu$ can be roughly
 thought of as the synchronization overhead in absence of contention.
 For example, suppose that each transaction's synchronization operation
 involves an atomic increment instruction, and that a computer system is
 able to do a private-variable atomic increment every 5~nanoseconds on
 each CPU
 (see \cref{fig:count:Atomic Increment Scalability on x86}).\footnote{
 	Of course, if there are 8 CPUs all incrementing the same
 	shared variable, then each CPU must wait at least 35~nanoseconds
 	for each of the other CPUs to do its increment before consuming
 	an additional 5~nanoseconds doing its own increment.
 	In fact, the wait will be longer due to the need
 	to move the variable from one CPU to another.}
 The value of $\mu$ is therefore about 200,000,000 atomic increments
 per second.

 Of course, the value of $\lambda$ increases as increasing numbers of CPUs
 increment a shared variable because each CPU is capable of processing
 transactions independently (again, ignoring synchronization):

 \begin{equation}
 	\lambda = n \lambda_0
 \end{equation}

 Here, $n$ is the number of CPUs and $\lambda_0$ is the transaction-processing
 capability of a single CPU\@.
 Note that the expected time for a single CPU to execute a single transaction
 in the absence of contention is $1 / \lambda_0$.

 Because the CPUs have to ``wait in line'' behind each other to get their
 chance to increment the single shared variable, we can use the M/M/1
 queueing-model expression for the expected total waiting time:

 \begin{equation}
 	T = \frac{1}{\mu - \lambda}
 \end{equation}

 Substituting the above value of $\lambda$:

 \begin{equation}
 	T = \frac{1}{\mu - n \lambda_0}
 \end{equation}

 Now, the efficiency is just the ratio of the time required to process
 a transaction in absence of synchronization ($1 / \lambda_0$)
 to the time required including synchronization ($T + 1 / \lambda_0$):

 \begin{equation}
 	e = \frac{1 / \lambda_0}{T + 1 / \lambda_0}
 \end{equation}

 Substituting the above value for $T$ and simplifying:

 \begin{equation}
 	e = \frac{\frac{\mu}{\lambda_0} - n}{\frac{\mu}{\lambda_0} - (n - 1)}
 \end{equation}

 But the value of $\mu / \lambda_0$ is just the ratio of the time required
 to process the transaction (absent synchronization overhead) to that of
 the synchronization overhead itself (absent contention).
 If we call this ratio $f$, we have:

 \begin{equation}
 	e = \frac{f - n}{f - (n - 1)}
 \end{equation}

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{SMPdesign/synceff}}
 \caption{Synchronization Efficiency}
 \label{fig:SMPdesign:Synchronization Efficiency}
 \end{figure}

 \Cref{fig:SMPdesign:Synchronization Efficiency} plots the synchronization
 efficiency $e$ as a function of the number of CPUs/threads $n$ for
 a few values of the overhead ratio $f$.
 For example, again using the 5-nanosecond atomic increment, the $f=10$
 line corresponds to each CPU attempting an atomic increment every
 50~nanoseconds, and the $f=100$ line corresponds to each CPU attempting
 an atomic increment every 500~nanoseconds, which in turn corresponds to
 some hundreds (perhaps thousands) of instructions.
 Given that each trace drops off sharply with increasing numbers of
 CPUs or threads, we can conclude that
 synchronization mechanisms based on
 atomic manipulation of a single global shared variable will not
 scale well if used heavily on current commodity hardware.
 This is an abstract mathematical depiction of the forces leading
 to the parallel counting algorithms that were discussed in
 \cref{chp:Counting}.
 Your real-world mileage may differ.

 Nevertheless, the concept of efficiency is useful, and even in cases
 having little or no formal synchronization.
 Consider for example a matrix multiply, in which the columns of one
 matrix are multiplied (via ``dot product'') by the rows of another,
 resulting in an entry in a third matrix.
 Because none of these operations conflict, it is possible to partition
 the columns of the first matrix among a group of threads, with each thread
 computing the corresponding columns of the result matrix.
 The threads can therefore operate entirely independently, with no
 synchronization overhead whatsoever, as is done in
 \path{matmul.c}.
 One might therefore expect a perfect efficiency of 1.0.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/data/hps.2020.03.30a/matmuleff}}
 \caption{Matrix Multiply Efficiency}
 \label{fig:SMPdesign:Matrix Multiply Efficiency}
 \end{figure}

 However,
 \cref{fig:SMPdesign:Matrix Multiply Efficiency}
 tells a different story, especially for a 64-by-64 matrix multiply,
 which never gets above an efficiency of about 0.3, even when running
 single-threaded, and drops sharply as more threads are added.\footnote{
 	In contrast to the smooth traces of
 	\cref{fig:SMPdesign:Synchronization Efficiency},
 	the wide error bars and jagged traces of
 	\cref{fig:SMPdesign:Matrix Multiply Efficiency}
 	gives evidence of its real-world nature.}
 The 128-by-128 matrix does better, but still fails to demonstrate
 much performance increase with added threads.
 The 256-by-256 matrix does scale reasonably well, but only up to a handful
 of CPUs.
 The 512-by-512 matrix multiply's efficiency is measurably less
 than 1.0 on as few as 10 threads, and even the 1024-by-1024 matrix
 multiply deviates noticeably from perfection at a few tens of threads.
 Nevertheless, this figure clearly demonstrates the performance and
 scalability benefits of batching:
 If you must incur synchronization overhead, you may as well get your
 money's worth, which is the solution to the problem of deciding on
 granularity of synchronization put forth on
 \cpageref{sec:SMPdesign:Problems Granularity}.

 \QuickQuiz{
 	How can a single-threaded 64-by-64 matrix multiple possibly
 	have an efficiency of less than 1.0?
 	Shouldn't all of the traces in
 	\cref{fig:SMPdesign:Matrix Multiply Efficiency}
 	have efficiency of exactly 1.0 when running on one thread?
 }\QuickQuizAnswer{
 	The \path{matmul.c} program creates the specified number of
 	worker threads, so even the single-worker-thread case incurs
 	thread-creation overhead.
 	Making the changes required to optimize away thread-creation
 	overhead in the single-worker-thread case is left as an
 	exercise to the reader.
 }\QuickQuizEnd

 Given these inefficiencies,
 it is worthwhile to look into more-scalable approaches
 such as the data locking described in
 \cref{sec:SMPdesign:Data Locking}
 or the parallel-fastpath approach discussed in the next section.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How are data-parallel techniques going to help with matrix
 	multiply?
 	It is \emph{already} data parallel!!!
 }\QuickQuizAnswerB{
 	I am glad that you are paying attention!
 	This example serves to show that although data parallelism can
 	be a very good thing, it is not some magic wand that automatically
 	wards off any and all sources of inefficiency.
 	Linear scaling at full performance, even to ``only'' 64 threads,
 	requires care at all phases of design and implementation.

 	In particular, you need to pay careful attention to the
 	size of the partitions.
 	For example, if you split a 64-by-64 matrix multiply across
 	64 threads, each thread gets only 64 floating-point multiplies.
 	The cost of a floating-point multiply is minuscule compared to
 	the overhead of thread creation, and cache-miss overhead
 	also plays a role in spoiling the theoretically perfect scalability
 	(and also in making the traces so jagged).
 	The full 448~hardware threads would require a matrix with
 	hundreds of thousands of rows and columns to attain good
 	scalability, but by that point GPGPUs become quite attractive,
 	especially from a price/performance viewpoint.

 	Moral:
 	If you have a parallel program with variable input,
 	always include a check for the input size being too small to
 	be worth parallelizing.
 	And when it is not helpful to parallelize, it is not helpful
 	to incur the overhead required to spawn a thread, now is it?
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	What did you do to validate this matrix multiply algorithm?
 }\QuickQuizAnswerE{
 	For this simple approach, very little.

 	However, the validation of production-quality matrix multiply
 	requires great care and attention.
 	Some cases require careful handling of floating-point rounding
 	errors, others involve complex sparse-matrix data structures,
 	and still others make use of special-purpose arithmetic hardware
 	such as vector units or GPGPUs.
 	Adequate tests for handling of floating-point rounding errors
 	can be especially challenging.
 }\QuickQuizEndE
 }

 \section{Parallel Fastpath}
 \label{sec:SMPdesign:Parallel Fastpath}
 %
 \epigraph{There are two ways of meeting difficulties:
 	  You alter the difficulties, or you alter yourself to meet them.}
 	 {Phyllis Bottome}

 Fine-grained (and therefore \emph{usually} higher-performance)
 designs are typically more complex than are coarser-grained designs.
 In many cases, most of the overhead is incurred by a small fraction
 of the code~\cite{Knuth73}.
 So why not focus effort on that small fraction?

 This is the idea behind the parallel-fastpath design pattern, to aggressively
 parallelize the common-case code path without incurring the complexity
 that would be required to aggressively parallelize the entire algorithm.
 You must understand not only the specific algorithm you wish
 to parallelize, but also the workload that the algorithm will
 be subjected to.
 Great creativity and design effort is often required to construct
 a parallel fastpath.

 Parallel fastpath combines different patterns (one for the
 fastpath, one elsewhere) and is therefore a template pattern.
 The following instances of parallel
 fastpath occur often enough to warrant their own patterns,
 as depicted in \cref{fig:SMPdesign:Parallel-Fastpath Design Patterns}:

 \begin{figure}
 \centering
 \resizebox{2.3in}{!}{\includegraphics{SMPdesign/ParallelFastpath}}
 % \includegraphics{SMPdesign/ParallelFastpath}
 \caption{Parallel-Fastpath Design Patterns}
 \label{fig:SMPdesign:Parallel-Fastpath Design Patterns}
 \end{figure}

 \begin{enumerate}
 \item	Reader/Writer Locking
 	(described below in \cref{sec:SMPdesign:Reader/Writer Locking}).
 \item	Read-copy update (RCU), which may be used as a high-performance
 	replacement for reader/writer locking, is introduced in
 	\cref{sec:defer:Read-Copy Update (RCU)}.
 	Other alternatives include hazard pointers
 	(\cref{sec:defer:Hazard Pointers})
 	and sequence locking (\cref{sec:defer:Sequence Locks}).
 	These alternatives will not be discussed further in this chapter.
 \item   Hierarchical Locking~(\cite{McKenney95b}), which is touched upon
 	in \cref{sec:SMPdesign:Hierarchical Locking}.
 \item	Resource Allocator Caches~(\cite{McKenney95b,McKenney93}).
 	See \cref{sec:SMPdesign:Resource Allocator Caches}
 	for more detail.
 \end{enumerate}

 \subsection{Reader/Writer Locking}
 \label{sec:SMPdesign:Reader/Writer Locking}

 If synchronization overhead is negligible (for example, if the program
 uses coarse-grained parallelism with large \IXpl{critical section}), and if
 only a small fraction of the critical sections modify data, then allowing
 multiple readers to proceed in parallel can greatly increase scalability.
 Writers exclude both readers and each other.
 There are many implementations of reader-writer locking, including
 the POSIX implementation described in
 \cref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
 \Cref{lst:SMPdesign:Reader-Writer-Locking Hash Table Search}
 shows how the hash search might be implemented using reader-writer locking.

 \begin{listing}
 \begin{VerbatimL}[commandchars=\\\@\$]
 rwlock_t hash_lock;

 struct hash_table
 {
 	long nbuckets;
 	struct node **buckets;
 };

 typedef struct node {
 	unsigned long key;
 	struct node *next;
 } node_t;

 int hash_search(struct hash_table *h, long key)
 {
 	struct node *cur;
 	int retval;

 	read_lock(&hash_lock);
 	cur = h->buckets[key % h->nbuckets];
 	while (cur != NULL) {
 		if (cur->key >= key) {
 			retval = (cur->key == key);
 			read_unlock(&hash_lock);
 			return retval;
 		}
 		cur = cur->next;
 	}
 	read_unlock(&hash_lock);
 	return 0;
 }
 \end{VerbatimL}
 \caption{Reader-Writer-Locking Hash Table Search}
 \label{lst:SMPdesign:Reader-Writer-Locking Hash Table Search}
 \end{listing}

 Reader/writer locking is a simple instance of asymmetric locking.
 Snaman~\cite{Snaman87} describes a more ornate six-mode
 asymmetric locking design used in several clustered systems.
 Locking in general and reader-writer locking in particular is described
 extensively in
 \cref{chp:Locking}.

 \subsection{Hierarchical Locking}
 \label{sec:SMPdesign:Hierarchical Locking}

 The idea behind hierarchical locking is to have a coarse-grained lock
 that is held only long enough to work out which fine-grained lock
 to acquire.
 \Cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
 shows how our hash-table search might be adapted to do hierarchical
 locking, but also shows the great weakness of this approach:
 We have paid the overhead of acquiring a second lock, but we only
 hold it for a short time.
 In this case, the data-locking approach would be simpler
 and likely perform better.

 \begin{listing}
 \begin{fcvlabel}[ln:SMPdesign:Hierarchical-Locking Hash Table Search]
 \begin{VerbatimL}[commandchars=\\\@\$]
 struct hash_table
 {
 	long nbuckets;
 	struct bucket **buckets;
 };

 struct bucket {
 	spinlock_t bucket_lock;
 	node_t *list_head;
 };

 typedef struct node {
 	spinlock_t node_lock;
 	unsigned long key;
 	struct node *next;
 } node_t;

 int hash_search(struct hash_table *h, long key)
 {
 	struct bucket *bp;
 	struct node *cur;
 	int retval;

 	bp = h->buckets[key % h->nbuckets];
 	spin_lock(&bp->bucket_lock);
 	cur = bp->list_head;
 	while (cur != NULL) {
 		if (cur->key >= key) {
 			spin_lock(&cur->node_lock);
 			spin_unlock(&bp->bucket_lock);
 			retval = (cur->key == key);\lnlbl@retval$
 			spin_unlock(&cur->node_lock);
 			return retval;
 		}
 		cur = cur->next;
 	}
 	spin_unlock(&bp->bucket_lock);
 	return 0;
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Hierarchical-Locking Hash Table Search}
 \label{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
 \end{listing}

 \QuickQuiz{
 	In what situation would hierarchical locking work well?
 }\QuickQuizAnswer{
 	If the comparison on
 	\clnrefr{ln:SMPdesign:Hierarchical-Locking Hash Table Search:retval} of
 	\cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
 	were replaced by a much heavier-weight operation,
 	then releasing \co{bp->bucket_lock} \emph{might} reduce lock
 	contention enough to outweigh the overhead of the extra
 	acquisition and release of \co{cur->node_lock}.
 }\QuickQuizEnd

 \subsection{Resource Allocator Caches}
 \label{sec:SMPdesign:Resource Allocator Caches}

 This section presents a simplified schematic of a parallel fixed-block-size
 memory allocator.
 More detailed descriptions may be found in the
 literature~\cite{McKenney92a,McKenney93,Bonwick01slab,McKenney01e,JasonEvans2011jemalloc,ChrisKennelly2020tcmalloc}
 or in the Linux kernel~\cite{Torvalds2.6kernel}.

 \subsubsection{Parallel Resource Allocation Problem}

 The basic problem facing a parallel memory allocator is the tension
 between the need to provide extremely fast memory allocation and
 freeing in the common case and the need to efficiently distribute
 memory in face of unfavorable allocation and freeing patterns.

 To see this tension, consider a straightforward application of
 data ownership to this problem---simply carve up memory so that
 each CPU owns its share.
 For example, suppose that a system with 12~CPUs has 64~gigabytes
 of memory, for example, the laptop I am using right now.
 We could simply assign each CPU a five-gigabyte region of memory,
 and allow each CPU to allocate from its own region, without the need
 for locking and its complexities and overheads.
 Unfortunately, this scheme fails when CPU~0 only allocates memory and
 CPU~1 only frees it, as happens in simple producer-consumer workloads.

 The other extreme, \IXh{code}{locking}, suffers from excessive
 \IX{lock contention}
 and overhead~\cite{McKenney93}.

 \subsubsection{Parallel Fastpath for Resource Allocation}
 \label{sec:SMPdesign:Parallel Fastpath for Resource Allocation}

 The commonly used solution uses parallel fastpath with each CPU
 owning a modest cache of blocks, and with a large code-locked
 shared pool for additional blocks.
 To prevent any given CPU from monopolizing the memory blocks,
 we place a limit on the number of blocks that can be in each CPU's
 cache.
 In a two-CPU system, the flow of memory blocks will be as shown
 in \cref{fig:SMPdesign:Allocator Cache Schematic}:
 When a given CPU is trying to free a block when its pool is full,
 it sends blocks to the global pool, and, similarly, when that CPU
 is trying to allocate a block when its pool is empty, it retrieves
 blocks from the global pool.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/allocatorcache}}
 \caption{Allocator Cache Schematic}
 \label{fig:SMPdesign:Allocator Cache Schematic}
 \end{figure}

 \subsubsection{Data Structures}

 The actual data structures for a ``toy'' implementation of allocator
 caches are shown in
 \cref{lst:SMPdesign:Allocator-Cache Data Structures}
 (``\path{smpalloc.c}'').
 The ``Global Pool'' of \cref{fig:SMPdesign:Allocator Cache Schematic}
 is implemented by \co{globalmem} of type \co{struct globalmempool},
 and the two CPU pools by the per-thread variable \co{perthreadmem} of
 type \co{struct perthreadmempool}.
 Both of these data structures have arrays of pointers to blocks
 in their \co{pool} fields, which are filled from index zero upwards.
 Thus, if \co{globalmem.pool[3]} is \co{NULL}, then the remainder of
 the array from index 4 up must also be \co{NULL}.
 The \co{cur} fields contain the index of the highest-numbered full
 element of the \co{pool} array, or $-1$ if all elements are empty.
 All elements from \co{globalmem.pool[0]} through
 \co{globalmem.pool[globalmem.cur]} must be full, and all the rest
 must be empty.\footnote{
 	Both pool sizes (\co{TARGET_POOL_SIZE} and
 	\co{GLOBAL_POOL_SIZE}) are unrealistically small, but this small
 	size makes it easier to single-step the program in order to get
 	a feel for its operation.}

 \begin{listing}
 \input{CodeSamples/SMPdesign/smpalloc@data_struct.fcv}
 \caption{Allocator-Cache Data Structures}
 \label{lst:SMPdesign:Allocator-Cache Data Structures}
 \end{listing}

 The operation of the pool data structures is illustrated by
 \cref{fig:SMPdesign:Allocator Pool Schematic},
 with the six boxes representing the array of pointers making up
 the \co{pool} field, and the number preceding them representing
 the \co{cur} field.
 The shaded boxes represent non-\co{NULL} pointers, while the empty
 boxes represent \co{NULL} pointers.
 An important, though potentially confusing, invariant of this
 data structure is that the \co{cur} field is always one
 smaller than the number of non-\co{NULL} pointers.

 \begin{figure}
 \centering
 \resizebox{2.6in}{!}{\includegraphics{SMPdesign/AllocatorPool}}
 \caption{Allocator Pool Schematic}
 \label{fig:SMPdesign:Allocator Pool Schematic}
 \end{figure}

 \subsubsection{Allocation Function}

 \begin{fcvref}[ln:SMPdesign:smpalloc:alloc]
 The allocation function \co{memblock_alloc()} may be seen in
 \cref{lst:SMPdesign:Allocator-Cache Allocator Function}.
 \Clnref{pick} picks up the current thread's per-thread pool,
 and \clnref{chk:empty} checks to see if it is empty.

 If so, \clnrefrange{ack}{rel} attempt to refill it
 from the global pool
 under the spinlock acquired on \clnref{ack} and released on \clnref{rel}.
 \Clnrefrange{loop:b}{loop:e} move blocks from the global
 to the per-thread pool until
 either the local pool reaches its target size (half full) or
 the global pool is exhausted, and \clnref{set} sets the per-thread pool's
 count to the proper value.

 In either case, \clnref{chk:notempty} checks for the per-thread
 pool still being
 empty, and if not, \clnrefrange{rem:b}{rem:e} remove a block and return it.
 Otherwise, \clnref{ret:NULL} tells the sad tale of memory exhaustion.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/SMPdesign/smpalloc@alloc.fcv}
 \caption{Allocator-Cache Allocator Function}
 \label{lst:SMPdesign:Allocator-Cache Allocator Function}
 \end{listing}

 \subsubsection{Free Function}

 \begin{fcvref}[ln:SMPdesign:smpalloc:free]
 \Cref{lst:SMPdesign:Allocator-Cache Free Function} shows
 the memory-block free function.
 \Clnref{get} gets a pointer to this thread's pool, and
 \clnref{chk:full} checks to see if this per-thread pool is full.

 If so, \clnrefrange{acq}{empty:e} empty half of the per-thread pool
 into the global pool,
 with \clnref{acq,rel} acquiring and releasing the spinlock.
 \Clnrefrange{loop:b}{loop:e} implement the loop moving blocks
 from the local to the
 global pool, and \clnref{set} sets the per-thread pool's count to the proper
 value.

 In either case, \clnref{place} then places the newly freed block into the
 per-thread pool.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/SMPdesign/smpalloc@free.fcv}
 \caption{Allocator-Cache Free Function}
 \label{lst:SMPdesign:Allocator-Cache Free Function}
 \end{listing}

 \QuickQuiz{
 	Doesn't this resource-allocator design resemble that of
 	the approximate limit counters covered in
 	\cref{sec:count:Approximate Limit Counters}?
 }\QuickQuizAnswer{
 	Indeed it does!
 	We are used to thinking of allocating and freeing memory,
 	but the algorithms in
 	\cref{sec:count:Approximate Limit Counters}
 	are taking very similar actions to allocate and free
 	``count''.
 }\QuickQuizEnd

 \subsubsection{Performance}

 Rough performance results\footnote{
 	This data was not collected in a statistically meaningful way,
 	and therefore should be viewed with great skepticism and suspicion.
 	Good data-collection and -reduction practice is discussed
 	in \cref{chp:Validation}.
 	That said, repeated runs gave similar results, and these results
 	match more careful evaluations of similar algorithms.}
 are shown in
 \cref{fig:SMPdesign:Allocator Cache Performance},
 running on a dual-core Intel x86 running at 1\,GHz (4300 bogomips per CPU)
 with at most six blocks allowed in each CPU's cache.
 In this micro-benchmark,
 each thread repeatedly allocates a group of blocks and then frees all
 the blocks in that group, with
 the number of blocks in the group being the ``allocation run length''
 displayed on the x-axis.
 The y-axis shows the number of successful allocation/free pairs per
 microsecond---failed allocations are not counted.
 The ``X''s are from a two-thread run, while the ``+''s are from a
 single-threaded run.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/smpalloc}}
 \caption{Allocator Cache Performance}
 \label{fig:SMPdesign:Allocator Cache Performance}
 \end{figure}

 Note that run lengths up to six scale linearly and give excellent performance,
 while run lengths greater than six show poor performance and almost always
 also show \emph{negative} scaling.
 It is therefore quite important to size \co{TARGET_POOL_SIZE}
 sufficiently large,
 which fortunately is usually quite easy to do in actual
 practice~\cite{McKenney01e}, especially given today's large memories.
 For example, in most systems, it is quite reasonable to set
 \co{TARGET_POOL_SIZE} to 100, in which case allocations and frees
 are guaranteed to be confined to per-thread pools at least 99\pct\ of
 the time.

 As can be seen from the figure, the situations where the common-case
 data-ownership applies (run lengths up to six) provide greatly improved
 performance compared to the cases where locks must be acquired.
 Avoiding synchronization in the common case will be a recurring theme through
 this book.

 \QuickQuizSeries{%
 \QuickQuizB{
 	In \cref{fig:SMPdesign:Allocator Cache Performance},
 	there is a pattern of performance rising with increasing run
 	length in groups of three samples, for example, for run lengths
 	10, 11, and 12.
 	Why?
 }\QuickQuizAnswerB{
 	This is due to the per-CPU target value being three.
 	A run length of 12 must acquire the global-pool lock twice,
 	while a run length of 13 must acquire the global-pool lock
 	three times.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Allocation failures were observed in the two-thread
 	tests at run lengths of 19 and greater.
 	Given the global-pool size of 40 and the per-thread target
 	pool size $s$ of three, number of threads $n$ equal to two,
 	and assuming that the per-thread pools are initially
 	empty with none of the memory in use, what is the smallest allocation
 	run length $m$ at which failures can occur?
 	(Recall that each thread repeatedly allocates $m$ block of memory,
 	and then frees the $m$ blocks of memory.)
 	Alternatively, given $n$ threads each with pool size $s$, and
 	where each thread repeatedly first allocates $m$ blocks of memory
 	and then frees those $m$ blocks, how large must the global pool
 	size be?
 	\emph{Note:}
 	Obtaining the correct answer will require you to
 	examine the \path{smpalloc.c} source code, and very likely
 	single-step it as well.
 	You have been warned!
 }\QuickQuizAnswerE{
 	This solution is adapted from one put forward by Alexey Roytman.
 	It is based on the following definitions:

 	\begin{description}
 	\item[$g$]	Number of blocks globally available.
 	\item[$i$]	Number of blocks left in the initializing thread's
 			per-thread pool.
 			(This is one reason you needed to look at the code!)
 	\item[$m$]	Allocation/free run length.
 	\item[$n$]	Number of threads, excluding the initialization thread.
 	\item[$p$]	Per-thread maximum block consumption, including
 			both the blocks actually allocated and the blocks
 			remaining in the per-thread pool.
 	\end{description}

 	The values $g$, $m$, and $n$ are given.
 	The value for $p$ is $m$ rounded up to the next multiple of $s$,
 	as follows:

 	\begin{equation}
 		p = s \left \lceil \frac{m}{s} \right \rceil
 	\label{sec:SMPdesign:p}
 	\end{equation}

 	The value for $i$ is as follows:

 	\begin{equation}
 		i = \left \{
 			\begin{array}{l}
 				g \pmod{2 s} = 0: 2 s \\
 				g \pmod{2 s} \ne 0: g \pmod{2 s}
 			\end{array}
 		    \right .
 	\label{sec:SMPdesign:i}
 	\end{equation}

 	\begin{figure}
 	\centering
 	\resizebox{3in}{!}{\includegraphics{SMPdesign/smpalloclim}}
 	\caption{Allocator Cache Run-Length Analysis}
 	\label{fig:SMPdesign:Allocator Cache Run-Length Analysis}
 	\end{figure}

 	The relationships between these quantities are shown in
 	\cref{fig:SMPdesign:Allocator Cache Run-Length Analysis}.
 	The global pool is shown on the top of this figure, and
 	the ``extra'' initializer thread's per-thread pool and
 	per-thread allocations are the left-most pair of boxes.
 	The initializer thread has no blocks allocated, but has
 	$i$ blocks stranded in its per-thread pool.
 	The rightmost two pairs of boxes are the per-thread pools and
 	per-thread allocations of threads holding the maximum possible
 	number of blocks, while the second-from-left pair of boxes
 	represents the thread currently trying to allocate.

 	The total number of blocks is $g$, and adding up the per-thread
 	allocations and per-thread pools, we see that the global pool
 	contains $g-i-p(n-1)$ blocks.
 	If the allocating thread is to be successful, it needs at least
 	$m$ blocks in the global pool, in other words:

 	\begin{equation}
 		g - i - p(n - 1) \ge m
 	\label{sec:SMPdesign:g-vs-m}
 	\end{equation}

 	The question has $g=40$, $s=3$, and $n=2$.
 	\Cref{sec:SMPdesign:i} gives $i=4$, and
 	\cref{sec:SMPdesign:p} gives $p=18$ for $m=18$
 	and $p=21$ for $m=19$.
 	Plugging these into \cref{sec:SMPdesign:g-vs-m}
 	shows that $m=18$ will not overflow, but that $m=19$ might
 	well do so.

 	The presence of $i$ could be considered to be a bug.
 	After all, why allocate memory only to have it stranded in
 	the initialization thread's cache?
 	One way of fixing this would be to provide a \co{memblock_flush()}
 	function that flushed the current thread's pool into the
 	global pool.
 	The initialization thread could then invoke this function
 	after freeing all of the blocks.
 }\QuickQuizEndE
 }

 \subsubsection{Validation}

 Validation of this simple allocator spawns a specified number of threads,
 with each thread repeatedly allocating a specified number of memory
 blocks and then deallocating them.
 This simple regimen suffices to exercise both the per-thread caches
 and the global pool, as can be seen in
 \cref{fig:SMPdesign:Allocator Cache Performance}.

 Much more aggressive validation is required for memory allocators that
 are to be used in production.
 The test suites for tcmalloc~\cite{ChrisKennelly2020tcmalloc} and
 jemalloc~\cite{JasonEvans2011jemalloc} are instructive, as are the
 tests for the Linux kernel's memory allocator.

 \subsubsection{Real-World Design}

 The toy parallel resource allocator was quite simple, but real-world
 designs expand on this approach in a number of ways.

 First, real-world allocators are required to handle a wide range
 of allocation sizes, as opposed to the single size shown in this
 toy example.
 One popular way to do this is to offer a fixed set of sizes, spaced
 so as to balance external and internal \IX{fragmentation}, such as in
 the late-1980s BSD memory allocator~\cite{McKusick88}.
 Doing this would mean that the ``globalmem'' variable would need
 to be replicated on a per-size basis, and that the associated
 lock would similarly be replicated, resulting in \IXh{data}{locking}
 rather than the toy program's code locking.

 Second, production-quality systems must be able to repurpose memory,
 meaning that they must be able to coalesce blocks into larger structures,
 such as pages~\cite{McKenney93}.
 This coalescing will also need to be protected by a lock, which again
 could be replicated on a per-size basis.

 Third, coalesced memory must be returned to the underlying memory
 system, and pages of memory must also be allocated from the underlying
 memory system.
 The locking required at this level will depend on that of the underlying
 memory system, but could well be code locking.
 Code locking can often be tolerated at this level, because this
 level is so infrequently reached in well-designed systems~\cite{McKenney01e}.

 Concurrent userspace allocators face similar
 challenges~\cite{ChrisKennelly2020tcmalloc,JasonEvans2011jemalloc}.

 Despite this real-world design's greater complexity, the underlying
 idea is the same---repeated application of parallel fastpath,
 as shown in
 \cref{fig:app:questions:Schematic of Real-World Parallel Allocator}.

 And ``parallel fastpath'' is one of the solutions to the non-partitionable
 application problem put forth on
 \cpageref{sec:SMPdesign:Problems Parallel Fastpath}.

 \begin{table}
 \rowcolors{1}{}{lightgray}
 \small
 \centering
 \renewcommand*{\arraystretch}{1.25}
 \begin{tabularx}{\twocolumnwidth}{ll>{\raggedright\arraybackslash}X}
 \toprule
 Level	& Locking & Purpose \\
 \midrule
 Per-thread pool	  & Data ownership & High-speed allocation \\
 Global block pool & Data locking   & Distributing blocks among threads \\
 Coalescing	  & Data locking   & Combining blocks into pages \\
 System memory	  & Code locking   & Memory from/to system \\
 \bottomrule
 \end{tabularx}
 \caption{Schematic of Real-World Parallel Allocator}
 \label{fig:app:questions:Schematic of Real-World Parallel Allocator}
 \end{table}

 \input{SMPdesign/beyond}

 \QuickQuizAnswersChp{qqzSMPdesign}