blob: 246182c23290c303189354ea37129b0d8d51ba4e [file] [log] [blame]
% SMPdesign/SMPdesign.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{chp:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}{qqzSMPdesign}
%
\Epigraph{Divide and rule.}{Philip II of Macedon}
This chapter describes how to design software to take advantage of
modern commodity multicore systems by using idioms, or
``design patterns''~\cite{Alexander79,GOF95,SchmidtStalRohnertBuschmann2000v2Textbook},
to balance performance, scalability, and response time.
Correctly partitioned problems lead to simple, scalable, and
high-performance solutions, while poorly partitioned problems result
in slow and complex solutions.
This chapter will help you design partitioning into your code, with
some discussion of batching and weakening as well.
The word ``design'' is very important:
You should partition first, batch second, weaken third, and code fourth.
Changing this order often leads to poor performance and scalability
along with great frustration.\footnote{
That other great dodge around the Laws of Physics, read-only
replication, is covered in \cref{chp:Deferred Processing}.}
This chapter will also look at some specific problems, including:
\begin{enumerate}
\item Constraints on the classic Dining Philosophers problem requiring
that all the philophers be able to dine concurrently.
\label{sec:SMPdesign:Problems Dining Philosophers}
\item Lock-based double-ended queue implementations that provide
concurrency between operations on both ends of a given queue
when there are many elements in the queue, but still work
correctly when the queue contains only a few elements.
(Or, for that matter, no elements.)
\label{sec:SMPdesign:Problems Double-Ended Queue}
\item Summarizing the rough quality of a concurrent algorithm with only
a few numbers.
\label{sec:SMPdesign:Problems Quality Assessment}
\item Selecting the right granularity of partitioning.
\label{sec:SMPdesign:Problems Granularity}
\item Concurrent designs for applications that do not fully partition.
\label{sec:SMPdesign:Problems Parallel Fastpath}
\item Obtaining more than 2x speedup from two CPUs.
\label{sec:SMPdesign:Problems Maze}
\end{enumerate}
To this end, \cref{sec:SMPdesign:Partitioning Exercises}
presents partitioning exercises,
\cref{sec:SMPdesign:Design Criteria} reviews partitionability
design criteria,
\cref{sec:SMPdesign:Synchronization Granularity}
discusses synchronization granularity selection,
\cref{sec:SMPdesign:Parallel Fastpath}
overviews important parallel-fastpath design patterns
that provide speed and scalability on common-case fastpaths while using
simpler less-scalable ``slow path'' fallbacks for unusual situations,
and finally
\cref{sec:SMPdesign:Beyond Partitioning}
takes a brief look beyond partitioning.
\input{SMPdesign/partexercises}
\input{SMPdesign/criteria}
\section{Synchronization Granularity}
\label{sec:SMPdesign:Synchronization Granularity}
%
\epigraph{Doing little things well is a step toward doing big things better.}
{Harry F.~Banks}
\Cref{fig:SMPdesign:Design Patterns and Lock Granularity}
gives a pictorial view of different levels of synchronization granularity,
each of which is described in one of the following sections.
These sections focus primarily on locking, but similar granularity
issues arise with all forms of synchronization.
\begin{figure}
\centering
\resizebox{1.2in}{!}{\includegraphics{SMPdesign/LockGranularity}}
\caption{Design Patterns and Lock Granularity}
\label{fig:SMPdesign:Design Patterns and Lock Granularity}
\end{figure}
\subsection{Sequential Program}
\label{sec:SMPdesign:Sequential Program}
If the program runs fast enough on a single processor, and
has no interactions with other processes, threads, or interrupt
handlers, you should
remove the synchronization primitives and spare yourself their
overhead and complexity.
Some years back, there were those who would argue that \IXr{Moore's Law}
would eventually force all programs into this category.
However, as can be seen in
\cref{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs},
the exponential increase in single-threaded performance halted in
about 2003.
Therefore,
increasing performance will increasingly require parallelism.\footnote{
This plot shows clock frequencies for newer CPUs theoretically
capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the
simplest instruction.
The reason for taking this approach is that the newer CPUs'
ability to retire multiple instructions per clock is typically
limited by memory-system performance.}
Given that back in 2006 Paul typed the first version of this sentence
on a dual-core laptop, and further given that many of the graphs added
in 2020 were generated on a system with 56~hardware threads per socket,
parallelism is well and truly here.
It is also important to note that Ethernet bandwidth is continuing to
grow, as shown in
\cref{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance}.
This growth will continue to motivate multithreaded servers in order to
handle the communications load.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}}
\caption{MIPS/Clock-Frequency Trend for Intel CPUs}
\label{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs}
\end{figure}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/CPUvsEnet}}
\caption{Ethernet Bandwidth vs.\@ Intel x86 CPU Performance}
\label{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance}
\end{figure}
Please note that this does \emph{not} mean that you should code each
and every program in a multi-threaded manner.
Again, if a program runs quickly enough on a single processor,
spare yourself the overhead and complexity of SMP synchronization
primitives.
The simplicity of the hash-table lookup code in
\cref{lst:SMPdesign:Sequential-Program Hash Table Search}
underscores this point.\footnote{
The examples in this section are taken from Hart et
al.~\cite{ThomasEHart2006a}, adapted for clarity
by gathering related code from multiple files.}
A key point is that speedups due to parallelism are normally
limited to the number of CPUs.
In contrast, speedups due to sequential optimizations, for example,
careful choice of data structure, can be arbitrarily large.
\QuickQuiz{
What should you do to validate a hash table?
}\QuickQuizAnswer{
Quite a bit, actually.
See \cref{sec:datastruct:RCU-Protected Hash Table Validation}
for a good starting point.
}\QuickQuizEnd
On the other hand, if you are not in this happy situation, read on!
\begin{listing}
\begin{VerbatimL}[commandchars=\\\@\$]
struct hash_table
{
long nbuckets;
struct node **buckets;
};
typedef struct node {
unsigned long key;
struct node *next;
} node_t;
int hash_search(struct hash_table *h, long key)
{
struct node *cur;
cur = h->buckets[key % h->nbuckets];
while (cur != NULL) {
if (cur->key >= key) {
return (cur->key == key);
}
cur = cur->next;
}
return 0;
}
\end{VerbatimL}
\caption{Sequential-Program Hash Table Search}
\label{lst:SMPdesign:Sequential-Program Hash Table Search}
\end{listing}
% ./test_hash_null.exe 1000 0/100 1 1024 1
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 96.2913 max = 98.2337 min = 90.4095 std = 2.95314
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 91.5592 max = 97.3315 min = 89.9885 std = 2.88925
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 93.3568 max = 106.162 min = 89.8828 std = 6.40418
\subsection{Code Locking}
\label{sec:SMPdesign:Code Locking}
\IXh{Code}{locking} is quite simple due to the fact that it uses only
global locks.\footnote{
If your program instead has locks in data structures,
or, in the case of Java, uses classes with synchronized
instances, you are instead using ``data locking'', described
in \cref{sec:SMPdesign:Data Locking}.}
It is especially
easy to retrofit an existing program to use code locking in
order to run it on a multiprocessor.
If the program has only a single shared resource, code locking
will even give optimal performance.
However, many of the larger and more complex programs
require much of the execution to
occur in \IXpl{critical section}, which in turn causes code locking
to sharply limits their scalability.
Therefore, you should use code locking on programs that spend
only a small fraction of their execution time in critical sections or
from which only modest scaling is required.
In addition, programs that primarily use the more scalable approaches
described in later sections often use code locking to handle rare error
cases or significant state transitions.
In these cases, code locking will provide a relatively simple
program that is very similar to its sequential counterpart,
as can be seen in
\cref{lst:SMPdesign:Code-Locking Hash Table Search}.
However, note that the simple return of the comparison in
\co{hash_search()} in
\cref{lst:SMPdesign:Sequential-Program Hash Table Search}
has now become three statements due to the need to release the
lock before returning.
\begin{listing}
\begin{fcvlabel}[ln:SMPdesign:Code-Locking Hash Table Search]
\begin{VerbatimL}[commandchars=\\\@\$]
spinlock_t hash_lock;
struct hash_table
{
long nbuckets;
struct node **buckets;
};
typedef struct node {
unsigned long key;
struct node *next;
} node_t;
int hash_search(struct hash_table *h, long key)
{
struct node *cur;
int retval;
spin_lock(&hash_lock); \lnlbl@acq$
cur = h->buckets[key % h->nbuckets];
while (cur != NULL) {
if (cur->key >= key) {
retval = (cur->key == key);
spin_unlock(&hash_lock); \lnlbl@rel1$
return retval;
}
cur = cur->next;
}
spin_unlock(&hash_lock); \lnlbl@rel2$
return 0;
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Code-Locking Hash Table Search}
\label{lst:SMPdesign:Code-Locking Hash Table Search}
\end{listing}
\begin{fcvref}[ln:SMPdesign:Code-Locking Hash Table Search]
Note that the \co{hash_lock} acquisition and release statements on
\clnref{acq,rel1,rel2} are mediating ownership of the hash table among
the CPUs wishing to concurrently access that hash table.
Another way of looking at this is that \co{hash_lock} is partitioning
time, thus giving each requesting CPU its own partition of time during
which it owns this hash table.
In addition, in a well-designed algorithm, there should be ample partitions
of time during which no CPU owns this hash table.
\end{fcvref}
\QuickQuiz{
``Partitioning time''?
Isn't that an odd turn of phrase?
}\QuickQuizAnswer{
Perhaps so.
But in the next section we will be partitioning space (that is,
address space) as well as time.
This nomenclature will permit us to partition spacetime, as
opposed to (say) partitioning space but segmenting time.
}\QuickQuizEnd
Unfortunately, code locking is particularly prone to ``\IX{lock contention}'',
where multiple CPUs need to acquire the lock concurrently.
SMP programmers who have taken care of groups of small children
(or groups of older people who are acting like children) will immediately
recognize the danger of having only one of something,
as illustrated in \cref{fig:SMPdesign:Lock Contention}.
% ./test_hash_codelock.exe 1000 0/100 1 1024 1
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_codelock.exe: avg = 164.115 max = 170.388 min = 161.659 std = 3.21857
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_codelock.exe: avg = 181.17 max = 198.4 min = 162.459 std = 15.8585
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_codelock.exe: avg = 167.651 max = 189.014 min = 162.144 std = 10.6819
% ./test_hash_codelock.exe 1000 0/100 1 1024 2
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_codelock.exe: avg = 378.481 max = 385.971 min = 374.235 std = 4.05934
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_codelock.exe: avg = 753.414 max = 1015.28 min = 377.734 std = 294.942
% ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_codelock.exe: avg = 502.737 max = 980.924 min = 374.406 std = 239.383
One solution to this problem, named ``data locking'', is described
in the next section.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{cartoons/r-2014-Data-one-fighting}}
\caption{Lock Contention}
\ContributedBy{fig:SMPdesign:Lock Contention}{Melissa Broussard}
\end{figure}
\subsection{Data Locking}
\label{sec:SMPdesign:Data Locking}
Many data structures may be partitioned,
with each partition of the data structure having its own lock.
Then the \IXpl{critical section} for each part of the data structure
can execute in parallel,
although only one instance of the critical section for a given
part could be executing at a given time.
You should use data locking when contention must
be reduced, and where synchronization overhead is not
limiting speedups.
Data locking reduces contention by distributing the instances
of the overly-large critical section across multiple data structures,
for example, maintaining per-hash-bucket critical sections in a
hash table, as shown in
\cref{lst:SMPdesign:Data-Locking Hash Table Search}.
The increased scalability again results in a slight increase in complexity
in the form of an additional data structure, the \co{struct bucket}.
\begin{listing}
\begin{VerbatimL}
struct hash_table
{
long nbuckets;
struct bucket **buckets;
};
struct bucket {
spinlock_t bucket_lock;
node_t *list_head;
};
typedef struct node {
unsigned long key;
struct node *next;
} node_t;
int hash_search(struct hash_table *h, long key)
{
struct bucket *bp;
struct node *cur;
int retval;
bp = h->buckets[key % h->nbuckets];
spin_lock(&bp->bucket_lock);
cur = bp->list_head;
while (cur != NULL) {
if (cur->key >= key) {
retval = (cur->key == key);
spin_unlock(&bp->bucket_lock);
return retval;
}
cur = cur->next;
}
spin_unlock(&bp->bucket_lock);
return 0;
}
\end{VerbatimL}
\caption{Data-Locking Hash Table Search}
\label{lst:SMPdesign:Data-Locking Hash Table Search}
\end{listing}
In contrast with the contentious situation
shown in \cref{fig:SMPdesign:Lock Contention},
data locking helps promote harmony, as illustrated by
\cref{fig:SMPdesign:Data Locking}---and in parallel programs,
this \emph{almost} always translates into increased performance and
scalability.
For this reason, data locking was heavily used by Sequent in its
kernels~\cite{Beck85,Inman85,Garg90,Dove90,McKenney92b,McKenney92a,McKenney93}.
Another way of looking at this is to think of each \co{->bucket_lock}
as mediating ownership not of the entire hash table as was done for code
locking, but only for the bucket corresponding to that \co{->bucket_lock}.
Each lock still partitions time, but the per-bucket-locking technique
also partitions the address space, so that the overall technique can be
said to partition spacetime.
If the number of buckets is large enough, this partitioning of space
should with high probability permit a given CPU immediate access to
a given hash bucket.
\begin{figure}
\centering
\resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Data-many-happy}}
\caption{Data Locking}
\ContributedBy{fig:SMPdesign:Data Locking}{Melissa Broussard}
\end{figure}
% ./test_hash_spinlock.exe 1000 0/100 1 1024 1
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_spinlock.exe: avg = 158.118 max = 162.404 min = 156.199 std = 2.19391
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_spinlock.exe: avg = 157.717 max = 162.446 min = 156.415 std = 2.36662
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_spinlock.exe: avg = 158.369 max = 164.75 min = 156.501 std = 3.19454
% ./test_hash_spinlock.exe 1000 0/100 1 1024 2
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_spinlock.exe: avg = 223.426 max = 422.948 min = 167.858 std = 100.136
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_spinlock.exe: avg = 235.462 max = 507.134 min = 167.466 std = 135.836
% ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_spinlock.exe: avg = 305.807 max = 481.685 min = 167.939 std = 132.589
However, as those who have taken care of small children can again attest,
even providing enough to go around is no guarantee of tranquillity.
The analogous situation can arise in SMP programs.
For example, the Linux kernel maintains a cache of files and directories
(called ``dcache'').
Each entry in this cache has its own lock, but the entries corresponding
to the root directory and its direct descendants are much more likely to
be traversed than are more obscure entries.
This can result in many CPUs contending for the locks of these popular
entries, resulting in a situation not unlike that
shown in \cref{fig:SMPdesign:Data and Skew}.
\begin{figure}
\centering
\resizebox{\twocolumnwidth}{!}{\includegraphics{cartoons/r-2014-Data-many-fighting}}
\caption{Data Locking and Skew}
\ContributedBy{fig:SMPdesign:Data and Skew}{Melissa Broussard}
\end{figure}
In many cases, algorithms can be designed to reduce the instance of
data skew, and in some cases eliminate it entirely
(for example, in the Linux kernel's
dcache~\cite{McKenney04a,JonathanCorbet2010dcacheRCU,NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk,NeilBrown2015PathnameSymlinks}).
Data locking is often used for partitionable data structures such as
hash tables, as well as in situations where multiple entities are each
represented by an instance of a given data structure.
The Linux-kernel task list is an example of the latter, each task
structure having its own \co{alloc_lock} and \co{pi_lock}.
A key challenge with data locking on dynamically allocated structures
is ensuring that the structure remains in existence while the lock is
being acquired~\cite{Gamsa99}.
The code in
\cref{lst:SMPdesign:Data-Locking Hash Table Search}
finesses this challenge by placing the locks in the statically allocated
hash buckets, which are never freed.
However, this trick would not work if the hash table were resizeable,
so that the locks were now dynamically allocated.
In this case, there would need to be some means to prevent the hash
bucket from being freed during the time that its lock was being acquired.
\QuickQuiz{
What are some ways of preventing a structure from being freed while
its lock is being acquired?
}\QuickQuizAnswer{
Here are a few possible solutions to this
\emph{\IX{existence guarantee}} problem:
\begin{enumerate}
\item Provide a statically allocated lock that is held while
the per-structure lock is being acquired, which is an
example of hierarchical locking (see
\cref{sec:SMPdesign:Hierarchical Locking}).
Of course, using a single global lock for this purpose
can result in unacceptably high levels of lock contention,
dramatically reducing performance and scalability.
\item Provide an array of statically allocated locks, hashing
the structure's address to select the lock to be acquired,
as described in \cref{chp:Locking}.
Given a hash function of sufficiently high quality, this
avoids the scalability limitations of the single global
lock, but in read-mostly situations, the lock-acquisition
overhead can result in unacceptably degraded performance.
\item Use a garbage collector, in software environments providing
them, so that a structure cannot be deallocated while being
referenced.
This works very well, removing the existence-guarantee
burden (and much else besides) from the developer's
shoulders, but imposes the overhead of garbage collection
on the program.
Although garbage-collection technology has advanced
considerably in the past few decades, its overhead
may be unacceptably high for some applications.
In addition, some applications require that the developer
exercise more control over the layout and placement of
data structures than is permitted by most garbage collected
environments.
\item As a special case of a garbage collector, use a global
reference counter, or a global array of reference counters.
These have strengths and limitations similar to those
called out above for locks.
\item Use \emph{\IXpl{hazard pointer}}~\cite{MagedMichael04a}, which
can be thought of as an inside-out reference count.
Hazard-pointer-based algorithms maintain a per-thread list of
pointers, so that the appearance of a given pointer on
any of these lists acts as a reference to the corresponding
structure.
Hazard pointers are starting to see significant production use
(see \cref{sec:defer:Production Uses of Hazard Pointers}).
\item Use transactional memory
(TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95},
so that each reference and
modification to the data structure in question is
performed atomically.
Although TM has engendered much excitement in recent years,
and seems likely to be of some use in production software,
developers should exercise some
caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM},
particularly in performance-critical code.
In particular, existence guarantees require that the
transaction covers the full path from a global reference
to the data elements being updated.
For more on TM, including ways to overcome some of its
weaknesses by combining it with other synchronization
mechanisms, see
\cref{sec:future:Transactional Memory,sec:future:Hardware Transactional Memory}.
\item Use RCU, which can be thought of as an extremely lightweight
approximation to a garbage collector.
Updaters are not permitted to free RCU-protected
data structures that RCU readers might still be referencing.
RCU is most heavily used for read-mostly data structures,
and is discussed at length in
\cref{sec:defer:Read-Copy Update (RCU)}.
\end{enumerate}
For more on providing existence guarantees, see
\cref{chp:Locking,chp:Deferred Processing}.
}\QuickQuizEnd
\subsection{Data Ownership}
\label{sec:SMPdesign:Data Ownership}
Data ownership partitions a given data structure over the threads
or CPUs, so that each thread/CPU accesses its subset of the data
structure without any synchronization overhead whatsoever.
However, if one thread wishes to access some other thread's data,
the first thread is unable to do so directly.
Instead, the first thread must communicate with the second thread,
so that the second thread performs the operation on behalf of the
first, or, alternatively, migrates the data to the first thread.
Data ownership might seem arcane, but it is used very frequently:
\begin{enumerate}
\item Any variables accessible by only one CPU or thread
(such as {\tt auto} variables in C
and C++) are owned by that CPU or process.
\item An instance of a user interface owns the corresponding
user's context.
It is very common for applications interacting with parallel
database engines to be written as if they were entirely
sequential programs.
Such applications own the user interface and his current
action.
Explicit parallelism is thus confined to the
database engine itself.
\item Parametric simulations are often trivially parallelized
by granting each thread ownership of a particular region
of the parameter space.
There are also computing frameworks designed for this
type of problem~\cite{BOINC2008}.
\end{enumerate}
If there is significant sharing, communication between the threads
or CPUs can result in significant complexity and overhead.
Furthermore, if the most-heavily used data happens to be that owned
by a single CPU, that CPU will be a ``\IX{hot spot}'', sometimes with
results resembling that shown in \cref{fig:SMPdesign:Data and Skew}.
However, in situations where no sharing is required, data ownership
achieves ideal performance, and with code that can be as simple
as the sequential-program case shown in
\cref{lst:SMPdesign:Sequential-Program Hash Table Search}.
Such situations are often referred to as ``\IX{embarrassingly
parallel}'', and, in the best case, resemble the situation
previously shown in \cref{fig:SMPdesign:Data Locking}.
% ./test_hash_null.exe 1000 0/100 1 1024 1
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 96.2913 max = 98.2337 min = 90.4095 std = 2.95314
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 91.5592 max = 97.3315 min = 89.9885 std = 2.88925
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1
% ./test_hash_null.exe: avg = 93.3568 max = 106.162 min = 89.8828 std = 6.40418
% ./test_hash_null.exe 1000 0/100 1 1024 2
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_null.exe: avg = 45.4526 max = 46.4281 min = 45.1954 std = 0.487791
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_null.exe: avg = 46.0238 max = 49.2861 min = 45.1852 std = 1.63127
% ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2
% ./test_hash_null.exe: avg = 46.6858 max = 52.6278 min = 45.1761 std = 2.97102
Another important instance of data ownership occurs when the data
is read-only, in which case,
all threads can ``own'' it via replication.
Where data locking partitions both the address space (with one
hash buckets per partition) and time (using per-bucket locks), data
ownership partitions only the address space.
The reason that data ownership need not partition time is because a
given thread or CPU is assigned permanent ownership of a given
address-space partition.
\QuickQuiz{
But won't system boot and shutdown (or application startup
and shutdown) be partitioning time, even for data ownership?
}\QuickQuizAnswer{
You can indeed think in these terms.
And if you are working on a persistent data store where state
survives shutdown, thinking in these terms might even be useful.
}\QuickQuizEnd
Data ownership will be presented in more detail in
\cref{chp:Data Ownership}.
\subsection{Locking Granularity and Performance}
\label{sec:SMPdesign:Locking Granularity and Performance}
This section looks at locking granularity and performance from
a mathematical synchronization\-/efficiency viewpoint.
Readers who are uninspired by mathematics might choose to skip
this section.
The approach is to use a crude queueing model for the \IX{efficiency} of
synchronization mechanism that operate on a single shared global
variable, based on an M/M/1 queue.
M/M/1 queuing models are based on an exponentially distributed
``inter-arrival rate'' $\lambda$ and an exponentially distributed
``service rate'' $\mu$.
The inter-arrival rate $\lambda$ can be thought of as the average
number of synchronization operations per second that the system
would process if the synchronization were free, in other words,
$\lambda$ is an inverse measure of the overhead of each non-synchronization
unit of work.
For example, if each unit of work was a transaction, and if each transaction
took one millisecond to process, excluding synchronization overhead,
then $\lambda$ would be 1,000~transactions per second.
The service rate $\mu$ is defined similarly, but for the average
number of synchronization operations per second that the system
would process if the overhead of each transaction was zero, and
ignoring the fact that CPUs must wait on each other to complete
their synchronization operations, in other words, $\mu$ can be roughly
thought of as the synchronization overhead in absence of contention.
For example, suppose that each transaction's synchronization operation
involves an atomic increment instruction, and that a computer system is
able to do a private-variable atomic increment every 5~nanoseconds on
each CPU
(see \cref{fig:count:Atomic Increment Scalability on x86}).\footnote{
Of course, if there are 8 CPUs all incrementing the same
shared variable, then each CPU must wait at least 35~nanoseconds
for each of the other CPUs to do its increment before consuming
an additional 5~nanoseconds doing its own increment.
In fact, the wait will be longer due to the need
to move the variable from one CPU to another.}
The value of $\mu$ is therefore about 200,000,000 atomic increments
per second.
Of course, the value of $\lambda$ increases as increasing numbers of CPUs
increment a shared variable because each CPU is capable of processing
transactions independently (again, ignoring synchronization):
\begin{equation}
\lambda = n \lambda_0
\end{equation}
Here, $n$ is the number of CPUs and $\lambda_0$ is the transaction-processing
capability of a single CPU\@.
Note that the expected time for a single CPU to execute a single transaction
in the absence of contention is $1 / \lambda_0$.
Because the CPUs have to ``wait in line'' behind each other to get their
chance to increment the single shared variable, we can use the M/M/1
queueing-model expression for the expected total waiting time:
\begin{equation}
T = \frac{1}{\mu - \lambda}
\end{equation}
Substituting the above value of $\lambda$:
\begin{equation}
T = \frac{1}{\mu - n \lambda_0}
\end{equation}
Now, the efficiency is just the ratio of the time required to process
a transaction in absence of synchronization ($1 / \lambda_0$)
to the time required including synchronization ($T + 1 / \lambda_0$):
\begin{equation}
e = \frac{1 / \lambda_0}{T + 1 / \lambda_0}
\end{equation}
Substituting the above value for $T$ and simplifying:
\begin{equation}
e = \frac{\frac{\mu}{\lambda_0} - n}{\frac{\mu}{\lambda_0} - (n - 1)}
\end{equation}
But the value of $\mu / \lambda_0$ is just the ratio of the time required
to process the transaction (absent synchronization overhead) to that of
the synchronization overhead itself (absent contention).
If we call this ratio $f$, we have:
\begin{equation}
e = \frac{f - n}{f - (n - 1)}
\end{equation}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{SMPdesign/synceff}}
\caption{Synchronization Efficiency}
\label{fig:SMPdesign:Synchronization Efficiency}
\end{figure}
\Cref{fig:SMPdesign:Synchronization Efficiency} plots the synchronization
efficiency $e$ as a function of the number of CPUs/threads $n$ for
a few values of the overhead ratio $f$.
For example, again using the 5-nanosecond atomic increment, the $f=10$
line corresponds to each CPU attempting an atomic increment every
50~nanoseconds, and the $f=100$ line corresponds to each CPU attempting
an atomic increment every 500~nanoseconds, which in turn corresponds to
some hundreds (perhaps thousands) of instructions.
Given that each trace drops off sharply with increasing numbers of
CPUs or threads, we can conclude that
synchronization mechanisms based on
atomic manipulation of a single global shared variable will not
scale well if used heavily on current commodity hardware.
This is an abstract mathematical depiction of the forces leading
to the parallel counting algorithms that were discussed in
\cref{chp:Counting}.
Your real-world mileage may differ.
Nevertheless, the concept of efficiency is useful, and even in cases
having little or no formal synchronization.
Consider for example a matrix multiply, in which the columns of one
matrix are multiplied (via ``dot product'') by the rows of another,
resulting in an entry in a third matrix.
Because none of these operations conflict, it is possible to partition
the columns of the first matrix among a group of threads, with each thread
computing the corresponding columns of the result matrix.
The threads can therefore operate entirely independently, with no
synchronization overhead whatsoever, as is done in
\path{matmul.c}.
One might therefore expect a perfect efficiency of 1.0.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/data/hps.2020.03.30a/matmuleff}}
\caption{Matrix Multiply Efficiency}
\label{fig:SMPdesign:Matrix Multiply Efficiency}
\end{figure}
However,
\cref{fig:SMPdesign:Matrix Multiply Efficiency}
tells a different story, especially for a 64-by-64 matrix multiply,
which never gets above an efficiency of about 0.3, even when running
single-threaded, and drops sharply as more threads are added.\footnote{
In contrast to the smooth traces of
\cref{fig:SMPdesign:Synchronization Efficiency},
the wide error bars and jagged traces of
\cref{fig:SMPdesign:Matrix Multiply Efficiency}
gives evidence of its real-world nature.}
The 128-by-128 matrix does better, but still fails to demonstrate
much performance increase with added threads.
The 256-by-256 matrix does scale reasonably well, but only up to a handful
of CPUs.
The 512-by-512 matrix multiply's efficiency is measurably less
than 1.0 on as few as 10 threads, and even the 1024-by-1024 matrix
multiply deviates noticeably from perfection at a few tens of threads.
Nevertheless, this figure clearly demonstrates the performance and
scalability benefits of batching:
If you must incur synchronization overhead, you may as well get your
money's worth, which is the solution to the problem of deciding on
granularity of synchronization put forth on
\cpageref{sec:SMPdesign:Problems Granularity}.
\QuickQuiz{
How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0?
Shouldn't all of the traces in
\cref{fig:SMPdesign:Matrix Multiply Efficiency}
have efficiency of exactly 1.0 when running on one thread?
}\QuickQuizAnswer{
The \path{matmul.c} program creates the specified number of
worker threads, so even the single-worker-thread case incurs
thread-creation overhead.
Making the changes required to optimize away thread-creation
overhead in the single-worker-thread case is left as an
exercise to the reader.
}\QuickQuizEnd
Given these inefficiencies,
it is worthwhile to look into more-scalable approaches
such as the data locking described in
\cref{sec:SMPdesign:Data Locking}
or the parallel-fastpath approach discussed in the next section.
\QuickQuizSeries{%
\QuickQuizB{
How are data-parallel techniques going to help with matrix
multiply?
It is \emph{already} data parallel!!!
}\QuickQuizAnswerB{
I am glad that you are paying attention!
This example serves to show that although data parallelism can
be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency.
Linear scaling at full performance, even to ``only'' 64 threads,
requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the
size of the partitions.
For example, if you split a 64-by-64 matrix multiply across
64 threads, each thread gets only 64 floating-point multiplies.
The cost of a floating-point multiply is minuscule compared to
the overhead of thread creation, and cache-miss overhead
also plays a role in spoiling the theoretically perfect scalability
(and also in making the traces so jagged).
The full 448~hardware threads would require a matrix with
hundreds of thousands of rows and columns to attain good
scalability, but by that point GPGPUs become quite attractive,
especially from a price/performance viewpoint.
Moral:
If you have a parallel program with variable input,
always include a check for the input size being too small to
be worth parallelizing.
And when it is not helpful to parallelize, it is not helpful
to incur the overhead required to spawn a thread, now is it?
}\QuickQuizEndB
%
\QuickQuizE{
What did you do to validate this matrix multiply algorithm?
}\QuickQuizAnswerE{
For this simple approach, very little.
However, the validation of production-quality matrix multiply
requires great care and attention.
Some cases require careful handling of floating-point rounding
errors, others involve complex sparse-matrix data structures,
and still others make use of special-purpose arithmetic hardware
such as vector units or GPGPUs.
Adequate tests for handling of floating-point rounding errors
can be especially challenging.
}\QuickQuizEndE
}
\section{Parallel Fastpath}
\label{sec:SMPdesign:Parallel Fastpath}
%
\epigraph{There are two ways of meeting difficulties:
You alter the difficulties, or you alter yourself to meet them.}
{Phyllis Bottome}
Fine-grained (and therefore \emph{usually} higher-performance)
designs are typically more complex than are coarser-grained designs.
In many cases, most of the overhead is incurred by a small fraction
of the code~\cite{Knuth73}.
So why not focus effort on that small fraction?
This is the idea behind the parallel-fastpath design pattern, to aggressively
parallelize the common-case code path without incurring the complexity
that would be required to aggressively parallelize the entire algorithm.
You must understand not only the specific algorithm you wish
to parallelize, but also the workload that the algorithm will
be subjected to.
Great creativity and design effort is often required to construct
a parallel fastpath.
Parallel fastpath combines different patterns (one for the
fastpath, one elsewhere) and is therefore a template pattern.
The following instances of parallel
fastpath occur often enough to warrant their own patterns,
as depicted in \cref{fig:SMPdesign:Parallel-Fastpath Design Patterns}:
\begin{figure}
\centering
\resizebox{2.3in}{!}{\includegraphics{SMPdesign/ParallelFastpath}}
% \includegraphics{SMPdesign/ParallelFastpath}
\caption{Parallel-Fastpath Design Patterns}
\label{fig:SMPdesign:Parallel-Fastpath Design Patterns}
\end{figure}
\begin{enumerate}
\item Reader/Writer Locking
(described below in \cref{sec:SMPdesign:Reader/Writer Locking}).
\item Read-copy update (RCU), which may be used as a high-performance
replacement for reader/writer locking, is introduced in
\cref{sec:defer:Read-Copy Update (RCU)}.
Other alternatives include hazard pointers
(\cref{sec:defer:Hazard Pointers})
and sequence locking (\cref{sec:defer:Sequence Locks}).
These alternatives will not be discussed further in this chapter.
\item Hierarchical Locking~(\cite{McKenney95b}), which is touched upon
in \cref{sec:SMPdesign:Hierarchical Locking}.
\item Resource Allocator Caches~(\cite{McKenney95b,McKenney93}).
See \cref{sec:SMPdesign:Resource Allocator Caches}
for more detail.
\end{enumerate}
\subsection{Reader/Writer Locking}
\label{sec:SMPdesign:Reader/Writer Locking}
If synchronization overhead is negligible (for example, if the program
uses coarse-grained parallelism with large \IXpl{critical section}), and if
only a small fraction of the critical sections modify data, then allowing
multiple readers to proceed in parallel can greatly increase scalability.
Writers exclude both readers and each other.
There are many implementations of reader-writer locking, including
the POSIX implementation described in
\cref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
\Cref{lst:SMPdesign:Reader-Writer-Locking Hash Table Search}
shows how the hash search might be implemented using reader-writer locking.
\begin{listing}
\begin{VerbatimL}[commandchars=\\\@\$]
rwlock_t hash_lock;
struct hash_table
{
long nbuckets;
struct node **buckets;
};
typedef struct node {
unsigned long key;
struct node *next;
} node_t;
int hash_search(struct hash_table *h, long key)
{
struct node *cur;
int retval;
read_lock(&hash_lock);
cur = h->buckets[key % h->nbuckets];
while (cur != NULL) {
if (cur->key >= key) {
retval = (cur->key == key);
read_unlock(&hash_lock);
return retval;
}
cur = cur->next;
}
read_unlock(&hash_lock);
return 0;
}
\end{VerbatimL}
\caption{Reader-Writer-Locking Hash Table Search}
\label{lst:SMPdesign:Reader-Writer-Locking Hash Table Search}
\end{listing}
Reader/writer locking is a simple instance of asymmetric locking.
Snaman~\cite{Snaman87} describes a more ornate six-mode
asymmetric locking design used in several clustered systems.
Locking in general and reader-writer locking in particular is described
extensively in
\cref{chp:Locking}.
\subsection{Hierarchical Locking}
\label{sec:SMPdesign:Hierarchical Locking}
The idea behind hierarchical locking is to have a coarse-grained lock
that is held only long enough to work out which fine-grained lock
to acquire.
\Cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
shows how our hash-table search might be adapted to do hierarchical
locking, but also shows the great weakness of this approach:
We have paid the overhead of acquiring a second lock, but we only
hold it for a short time.
In this case, the data-locking approach would be simpler
and likely perform better.
\begin{listing}
\begin{fcvlabel}[ln:SMPdesign:Hierarchical-Locking Hash Table Search]
\begin{VerbatimL}[commandchars=\\\@\$]
struct hash_table
{
long nbuckets;
struct bucket **buckets;
};
struct bucket {
spinlock_t bucket_lock;
node_t *list_head;
};
typedef struct node {
spinlock_t node_lock;
unsigned long key;
struct node *next;
} node_t;
int hash_search(struct hash_table *h, long key)
{
struct bucket *bp;
struct node *cur;
int retval;
bp = h->buckets[key % h->nbuckets];
spin_lock(&bp->bucket_lock);
cur = bp->list_head;
while (cur != NULL) {
if (cur->key >= key) {
spin_lock(&cur->node_lock);
spin_unlock(&bp->bucket_lock);
retval = (cur->key == key);\lnlbl@retval$
spin_unlock(&cur->node_lock);
return retval;
}
cur = cur->next;
}
spin_unlock(&bp->bucket_lock);
return 0;
}
\end{VerbatimL}
\end{fcvlabel}
\caption{Hierarchical-Locking Hash Table Search}
\label{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
\end{listing}
\QuickQuiz{
In what situation would hierarchical locking work well?
}\QuickQuizAnswer{
If the comparison on
\clnrefr{ln:SMPdesign:Hierarchical-Locking Hash Table Search:retval} of
\cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search}
were replaced by a much heavier-weight operation,
then releasing \co{bp->bucket_lock} \emph{might} reduce lock
contention enough to outweigh the overhead of the extra
acquisition and release of \co{cur->node_lock}.
}\QuickQuizEnd
\subsection{Resource Allocator Caches}
\label{sec:SMPdesign:Resource Allocator Caches}
This section presents a simplified schematic of a parallel fixed-block-size
memory allocator.
More detailed descriptions may be found in the
literature~\cite{McKenney92a,McKenney93,Bonwick01slab,McKenney01e,JasonEvans2011jemalloc,ChrisKennelly2020tcmalloc}
or in the Linux kernel~\cite{Torvalds2.6kernel}.
\subsubsection{Parallel Resource Allocation Problem}
The basic problem facing a parallel memory allocator is the tension
between the need to provide extremely fast memory allocation and
freeing in the common case and the need to efficiently distribute
memory in face of unfavorable allocation and freeing patterns.
To see this tension, consider a straightforward application of
data ownership to this problem---simply carve up memory so that
each CPU owns its share.
For example, suppose that a system with 12~CPUs has 64~gigabytes
of memory, for example, the laptop I am using right now.
We could simply assign each CPU a five-gigabyte region of memory,
and allow each CPU to allocate from its own region, without the need
for locking and its complexities and overheads.
Unfortunately, this scheme fails when CPU~0 only allocates memory and
CPU~1 only frees it, as happens in simple producer-consumer workloads.
The other extreme, \IXh{code}{locking}, suffers from excessive
\IX{lock contention}
and overhead~\cite{McKenney93}.
\subsubsection{Parallel Fastpath for Resource Allocation}
\label{sec:SMPdesign:Parallel Fastpath for Resource Allocation}
The commonly used solution uses parallel fastpath with each CPU
owning a modest cache of blocks, and with a large code-locked
shared pool for additional blocks.
To prevent any given CPU from monopolizing the memory blocks,
we place a limit on the number of blocks that can be in each CPU's
cache.
In a two-CPU system, the flow of memory blocks will be as shown
in \cref{fig:SMPdesign:Allocator Cache Schematic}:
When a given CPU is trying to free a block when its pool is full,
it sends blocks to the global pool, and, similarly, when that CPU
is trying to allocate a block when its pool is empty, it retrieves
blocks from the global pool.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/allocatorcache}}
\caption{Allocator Cache Schematic}
\label{fig:SMPdesign:Allocator Cache Schematic}
\end{figure}
\subsubsection{Data Structures}
The actual data structures for a ``toy'' implementation of allocator
caches are shown in
\cref{lst:SMPdesign:Allocator-Cache Data Structures}
(``\path{smpalloc.c}'').
The ``Global Pool'' of \cref{fig:SMPdesign:Allocator Cache Schematic}
is implemented by \co{globalmem} of type \co{struct globalmempool},
and the two CPU pools by the per-thread variable \co{perthreadmem} of
type \co{struct perthreadmempool}.
Both of these data structures have arrays of pointers to blocks
in their \co{pool} fields, which are filled from index zero upwards.
Thus, if \co{globalmem.pool[3]} is \co{NULL}, then the remainder of
the array from index 4 up must also be \co{NULL}.
The \co{cur} fields contain the index of the highest-numbered full
element of the \co{pool} array, or $-1$ if all elements are empty.
All elements from \co{globalmem.pool[0]} through
\co{globalmem.pool[globalmem.cur]} must be full, and all the rest
must be empty.\footnote{
Both pool sizes (\co{TARGET_POOL_SIZE} and
\co{GLOBAL_POOL_SIZE}) are unrealistically small, but this small
size makes it easier to single-step the program in order to get
a feel for its operation.}
\begin{listing}
\input{CodeSamples/SMPdesign/smpalloc@data_struct.fcv}
\caption{Allocator-Cache Data Structures}
\label{lst:SMPdesign:Allocator-Cache Data Structures}
\end{listing}
The operation of the pool data structures is illustrated by
\cref{fig:SMPdesign:Allocator Pool Schematic},
with the six boxes representing the array of pointers making up
the \co{pool} field, and the number preceding them representing
the \co{cur} field.
The shaded boxes represent non-\co{NULL} pointers, while the empty
boxes represent \co{NULL} pointers.
An important, though potentially confusing, invariant of this
data structure is that the \co{cur} field is always one
smaller than the number of non-\co{NULL} pointers.
\begin{figure}
\centering
\resizebox{2.6in}{!}{\includegraphics{SMPdesign/AllocatorPool}}
\caption{Allocator Pool Schematic}
\label{fig:SMPdesign:Allocator Pool Schematic}
\end{figure}
\subsubsection{Allocation Function}
\begin{fcvref}[ln:SMPdesign:smpalloc:alloc]
The allocation function \co{memblock_alloc()} may be seen in
\cref{lst:SMPdesign:Allocator-Cache Allocator Function}.
\Clnref{pick} picks up the current thread's per-thread pool,
and \clnref{chk:empty} checks to see if it is empty.
If so, \clnrefrange{ack}{rel} attempt to refill it
from the global pool
under the spinlock acquired on \clnref{ack} and released on \clnref{rel}.
\Clnrefrange{loop:b}{loop:e} move blocks from the global
to the per-thread pool until
either the local pool reaches its target size (half full) or
the global pool is exhausted, and \clnref{set} sets the per-thread pool's
count to the proper value.
In either case, \clnref{chk:notempty} checks for the per-thread
pool still being
empty, and if not, \clnrefrange{rem:b}{rem:e} remove a block and return it.
Otherwise, \clnref{ret:NULL} tells the sad tale of memory exhaustion.
\end{fcvref}
\begin{listing}
\input{CodeSamples/SMPdesign/smpalloc@alloc.fcv}
\caption{Allocator-Cache Allocator Function}
\label{lst:SMPdesign:Allocator-Cache Allocator Function}
\end{listing}
\subsubsection{Free Function}
\begin{fcvref}[ln:SMPdesign:smpalloc:free]
\Cref{lst:SMPdesign:Allocator-Cache Free Function} shows
the memory-block free function.
\Clnref{get} gets a pointer to this thread's pool, and
\clnref{chk:full} checks to see if this per-thread pool is full.
If so, \clnrefrange{acq}{empty:e} empty half of the per-thread pool
into the global pool,
with \clnref{acq,rel} acquiring and releasing the spinlock.
\Clnrefrange{loop:b}{loop:e} implement the loop moving blocks
from the local to the
global pool, and \clnref{set} sets the per-thread pool's count to the proper
value.
In either case, \clnref{place} then places the newly freed block into the
per-thread pool.
\end{fcvref}
\begin{listing}
\input{CodeSamples/SMPdesign/smpalloc@free.fcv}
\caption{Allocator-Cache Free Function}
\label{lst:SMPdesign:Allocator-Cache Free Function}
\end{listing}
\QuickQuiz{
Doesn't this resource-allocator design resemble that of
the approximate limit counters covered in
\cref{sec:count:Approximate Limit Counters}?
}\QuickQuizAnswer{
Indeed it does!
We are used to thinking of allocating and freeing memory,
but the algorithms in
\cref{sec:count:Approximate Limit Counters}
are taking very similar actions to allocate and free
``count''.
}\QuickQuizEnd
\subsubsection{Performance}
Rough performance results\footnote{
This data was not collected in a statistically meaningful way,
and therefore should be viewed with great skepticism and suspicion.
Good data-collection and -reduction practice is discussed
in \cref{chp:Validation}.
That said, repeated runs gave similar results, and these results
match more careful evaluations of similar algorithms.}
are shown in
\cref{fig:SMPdesign:Allocator Cache Performance},
running on a dual-core Intel x86 running at 1\,GHz (4300 bogomips per CPU)
with at most six blocks allowed in each CPU's cache.
In this micro-benchmark,
each thread repeatedly allocates a group of blocks and then frees all
the blocks in that group, with
the number of blocks in the group being the ``allocation run length''
displayed on the x-axis.
The y-axis shows the number of successful allocation/free pairs per
microsecond---failed allocations are not counted.
The ``X''s are from a two-thread run, while the ``+''s are from a
single-threaded run.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/smpalloc}}
\caption{Allocator Cache Performance}
\label{fig:SMPdesign:Allocator Cache Performance}
\end{figure}
Note that run lengths up to six scale linearly and give excellent performance,
while run lengths greater than six show poor performance and almost always
also show \emph{negative} scaling.
It is therefore quite important to size \co{TARGET_POOL_SIZE}
sufficiently large,
which fortunately is usually quite easy to do in actual
practice~\cite{McKenney01e}, especially given today's large memories.
For example, in most systems, it is quite reasonable to set
\co{TARGET_POOL_SIZE} to 100, in which case allocations and frees
are guaranteed to be confined to per-thread pools at least 99\pct\ of
the time.
As can be seen from the figure, the situations where the common-case
data-ownership applies (run lengths up to six) provide greatly improved
performance compared to the cases where locks must be acquired.
Avoiding synchronization in the common case will be a recurring theme through
this book.
\QuickQuizSeries{%
\QuickQuizB{
In \cref{fig:SMPdesign:Allocator Cache Performance},
there is a pattern of performance rising with increasing run
length in groups of three samples, for example, for run lengths
10, 11, and 12.
Why?
}\QuickQuizAnswerB{
This is due to the per-CPU target value being three.
A run length of 12 must acquire the global-pool lock twice,
while a run length of 13 must acquire the global-pool lock
three times.
}\QuickQuizEndB
%
\QuickQuizE{
Allocation failures were observed in the two-thread
tests at run lengths of 19 and greater.
Given the global-pool size of 40 and the per-thread target
pool size $s$ of three, number of threads $n$ equal to two,
and assuming that the per-thread pools are initially
empty with none of the memory in use, what is the smallest allocation
run length $m$ at which failures can occur?
(Recall that each thread repeatedly allocates $m$ block of memory,
and then frees the $m$ blocks of memory.)
Alternatively, given $n$ threads each with pool size $s$, and
where each thread repeatedly first allocates $m$ blocks of memory
and then frees those $m$ blocks, how large must the global pool
size be?
\emph{Note:}
Obtaining the correct answer will require you to
examine the \path{smpalloc.c} source code, and very likely
single-step it as well.
You have been warned!
}\QuickQuizAnswerE{
This solution is adapted from one put forward by Alexey Roytman.
It is based on the following definitions:
\begin{description}
\item[$g$] Number of blocks globally available.
\item[$i$] Number of blocks left in the initializing thread's
per-thread pool.
(This is one reason you needed to look at the code!)
\item[$m$] Allocation/free run length.
\item[$n$] Number of threads, excluding the initialization thread.
\item[$p$] Per-thread maximum block consumption, including
both the blocks actually allocated and the blocks
remaining in the per-thread pool.
\end{description}
The values $g$, $m$, and $n$ are given.
The value for $p$ is $m$ rounded up to the next multiple of $s$,
as follows:
\begin{equation}
p = s \left \lceil \frac{m}{s} \right \rceil
\label{sec:SMPdesign:p}
\end{equation}
The value for $i$ is as follows:
\begin{equation}
i = \left \{
\begin{array}{l}
g \pmod{2 s} = 0: 2 s \\
g \pmod{2 s} \ne 0: g \pmod{2 s}
\end{array}
\right .
\label{sec:SMPdesign:i}
\end{equation}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/smpalloclim}}
\caption{Allocator Cache Run-Length Analysis}
\label{fig:SMPdesign:Allocator Cache Run-Length Analysis}
\end{figure}
The relationships between these quantities are shown in
\cref{fig:SMPdesign:Allocator Cache Run-Length Analysis}.
The global pool is shown on the top of this figure, and
the ``extra'' initializer thread's per-thread pool and
per-thread allocations are the left-most pair of boxes.
The initializer thread has no blocks allocated, but has
$i$ blocks stranded in its per-thread pool.
The rightmost two pairs of boxes are the per-thread pools and
per-thread allocations of threads holding the maximum possible
number of blocks, while the second-from-left pair of boxes
represents the thread currently trying to allocate.
The total number of blocks is $g$, and adding up the per-thread
allocations and per-thread pools, we see that the global pool
contains $g-i-p(n-1)$ blocks.
If the allocating thread is to be successful, it needs at least
$m$ blocks in the global pool, in other words:
\begin{equation}
g - i - p(n - 1) \ge m
\label{sec:SMPdesign:g-vs-m}
\end{equation}
The question has $g=40$, $s=3$, and $n=2$.
\Cref{sec:SMPdesign:i} gives $i=4$, and
\cref{sec:SMPdesign:p} gives $p=18$ for $m=18$
and $p=21$ for $m=19$.
Plugging these into \cref{sec:SMPdesign:g-vs-m}
shows that $m=18$ will not overflow, but that $m=19$ might
well do so.
The presence of $i$ could be considered to be a bug.
After all, why allocate memory only to have it stranded in
the initialization thread's cache?
One way of fixing this would be to provide a \co{memblock_flush()}
function that flushed the current thread's pool into the
global pool.
The initialization thread could then invoke this function
after freeing all of the blocks.
}\QuickQuizEndE
}
\subsubsection{Validation}
Validation of this simple allocator spawns a specified number of threads,
with each thread repeatedly allocating a specified number of memory
blocks and then deallocating them.
This simple regimen suffices to exercise both the per-thread caches
and the global pool, as can be seen in
\cref{fig:SMPdesign:Allocator Cache Performance}.
Much more aggressive validation is required for memory allocators that
are to be used in production.
The test suites for tcmalloc~\cite{ChrisKennelly2020tcmalloc} and
jemalloc~\cite{JasonEvans2011jemalloc} are instructive, as are the
tests for the Linux kernel's memory allocator.
\subsubsection{Real-World Design}
The toy parallel resource allocator was quite simple, but real-world
designs expand on this approach in a number of ways.
First, real-world allocators are required to handle a wide range
of allocation sizes, as opposed to the single size shown in this
toy example.
One popular way to do this is to offer a fixed set of sizes, spaced
so as to balance external and internal \IX{fragmentation}, such as in
the late-1980s BSD memory allocator~\cite{McKusick88}.
Doing this would mean that the ``globalmem'' variable would need
to be replicated on a per-size basis, and that the associated
lock would similarly be replicated, resulting in \IXh{data}{locking}
rather than the toy program's code locking.
Second, production-quality systems must be able to repurpose memory,
meaning that they must be able to coalesce blocks into larger structures,
such as pages~\cite{McKenney93}.
This coalescing will also need to be protected by a lock, which again
could be replicated on a per-size basis.
Third, coalesced memory must be returned to the underlying memory
system, and pages of memory must also be allocated from the underlying
memory system.
The locking required at this level will depend on that of the underlying
memory system, but could well be code locking.
Code locking can often be tolerated at this level, because this
level is so infrequently reached in well-designed systems~\cite{McKenney01e}.
Concurrent userspace allocators face similar
challenges~\cite{ChrisKennelly2020tcmalloc,JasonEvans2011jemalloc}.
Despite this real-world design's greater complexity, the underlying
idea is the same---repeated application of parallel fastpath,
as shown in
\cref{fig:app:questions:Schematic of Real-World Parallel Allocator}.
And ``parallel fastpath'' is one of the solutions to the non-partitionable
application problem put forth on
\cpageref{sec:SMPdesign:Problems Parallel Fastpath}.
\begin{table}
\rowcolors{1}{}{lightgray}
\small
\centering
\renewcommand*{\arraystretch}{1.25}
\begin{tabularx}{\twocolumnwidth}{ll>{\raggedright\arraybackslash}X}
\toprule
Level & Locking & Purpose \\
\midrule
Per-thread pool & Data ownership & High-speed allocation \\
Global block pool & Data locking & Distributing blocks among threads \\
Coalescing & Data locking & Combining blocks into pages \\
System memory & Code locking & Memory from/to system \\
\bottomrule
\end{tabularx}
\caption{Schematic of Real-World Parallel Allocator}
\label{fig:app:questions:Schematic of Real-World Parallel Allocator}
\end{table}
\input{SMPdesign/beyond}
\QuickQuizAnswersChp{qqzSMPdesign}