| % SMPdesign/SMPdesign.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}{qqzSMPdesign} |
| % |
| \Epigraph{Divide and rule.}{Philip II of Macedon} |
| |
| This chapter describes how to design software to take advantage of |
| modern commodity multicore systems by using idioms, or |
| ``design patterns''~\cite{Alexander79,GOF95,SchmidtStalRohnertBuschmann2000v2Textbook}, |
| to balance performance, scalability, and response time. |
| Correctly partitioned problems lead to simple, scalable, and |
| high-performance solutions, while poorly partitioned problems result |
| in slow and complex solutions. |
| This chapter will help you design partitioning into your code, with |
| some discussion of batching and weakening as well. |
| The word ``design'' is very important: |
| You should partition first, batch second, weaken third, and code fourth. |
| Changing this order often leads to poor performance and scalability |
| along with great frustration.\footnote{ |
| That other great dodge around the Laws of Physics, read-only |
| replication, is covered in \cref{chp:Deferred Processing}.} |
| |
| This chapter will also look at some specific problems, including: |
| |
| \begin{enumerate} |
| \item Constraints on the classic Dining Philosophers problem requiring |
| that all the philophers be able to dine concurrently. |
| \label{sec:SMPdesign:Problems Dining Philosophers} |
| \item Lock-based double-ended queue implementations that provide |
| concurrency between operations on both ends of a given queue |
| when there are many elements in the queue, but still work |
| correctly when the queue contains only a few elements. |
| (Or, for that matter, no elements.) |
| \label{sec:SMPdesign:Problems Double-Ended Queue} |
| \item Summarizing the rough quality of a concurrent algorithm with only |
| a few numbers. |
| \label{sec:SMPdesign:Problems Quality Assessment} |
| \item Selecting the right granularity of partitioning. |
| \label{sec:SMPdesign:Problems Granularity} |
| \item Concurrent designs for applications that do not fully partition. |
| \label{sec:SMPdesign:Problems Parallel Fastpath} |
| \item Obtaining more than 2x speedup from two CPUs. |
| \label{sec:SMPdesign:Problems Maze} |
| \end{enumerate} |
| |
| To this end, \cref{sec:SMPdesign:Partitioning Exercises} |
| presents partitioning exercises, |
| \cref{sec:SMPdesign:Design Criteria} reviews partitionability |
| design criteria, |
| \cref{sec:SMPdesign:Synchronization Granularity} |
| discusses synchronization granularity selection, |
| \cref{sec:SMPdesign:Parallel Fastpath} |
| overviews important parallel-fastpath design patterns |
| that provide speed and scalability on common-case fastpaths while using |
| simpler less-scalable ``slow path'' fallbacks for unusual situations, |
| and finally |
| \cref{sec:SMPdesign:Beyond Partitioning} |
| takes a brief look beyond partitioning. |
| |
| \input{SMPdesign/partexercises} |
| |
| \input{SMPdesign/criteria} |
| |
| \section{Synchronization Granularity} |
| \label{sec:SMPdesign:Synchronization Granularity} |
| % |
| \epigraph{Doing little things well is a step toward doing big things better.} |
| {Harry F.~Banks} |
| |
| \Cref{fig:SMPdesign:Design Patterns and Lock Granularity} |
| gives a pictorial view of different levels of synchronization granularity, |
| each of which is described in one of the following sections. |
| These sections focus primarily on locking, but similar granularity |
| issues arise with all forms of synchronization. |
| |
| \begin{figure} |
| \centering |
| \resizebox{1.2in}{!}{\includegraphics{SMPdesign/LockGranularity}} |
| \caption{Design Patterns and Lock Granularity} |
| \label{fig:SMPdesign:Design Patterns and Lock Granularity} |
| \end{figure} |
| |
| \subsection{Sequential Program} |
| \label{sec:SMPdesign:Sequential Program} |
| |
| If the program runs fast enough on a single processor, and |
| has no interactions with other processes, threads, or interrupt |
| handlers, you should |
| remove the synchronization primitives and spare yourself their |
| overhead and complexity. |
| Some years back, there were those who would argue that \IXr{Moore's Law} |
| would eventually force all programs into this category. |
| However, as can be seen in |
| \cref{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs}, |
| the exponential increase in single-threaded performance halted in |
| about 2003. |
| Therefore, |
| increasing performance will increasingly require parallelism.\footnote{ |
| This plot shows clock frequencies for newer CPUs theoretically |
| capable of retiring one or more instructions per clock, and MIPS for |
| older CPUs requiring multiple clocks to execute even the |
| simplest instruction. |
| The reason for taking this approach is that the newer CPUs' |
| ability to retire multiple instructions per clock is typically |
| limited by memory-system performance.} |
| Given that back in 2006 Paul typed the first version of this sentence |
| on a dual-core laptop, and further given that many of the graphs added |
| in 2020 were generated on a system with 56~hardware threads per socket, |
| parallelism is well and truly here. |
| It is also important to note that Ethernet bandwidth is continuing to |
| grow, as shown in |
| \cref{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance}. |
| This growth will continue to motivate multithreaded servers in order to |
| handle the communications load. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}} |
| \caption{MIPS/Clock-Frequency Trend for Intel CPUs} |
| \label{fig:SMPdesign:Clock-Frequency Trend for Intel CPUs} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/CPUvsEnet}} |
| \caption{Ethernet Bandwidth vs.\@ Intel x86 CPU Performance} |
| \label{fig:SMPdesign:Ethernet Bandwidth vs. Intel x86 CPU Performance} |
| \end{figure} |
| |
| Please note that this does \emph{not} mean that you should code each |
| and every program in a multi-threaded manner. |
| Again, if a program runs quickly enough on a single processor, |
| spare yourself the overhead and complexity of SMP synchronization |
| primitives. |
| The simplicity of the hash-table lookup code in |
| \cref{lst:SMPdesign:Sequential-Program Hash Table Search} |
| underscores this point.\footnote{ |
| The examples in this section are taken from Hart et |
| al.~\cite{ThomasEHart2006a}, adapted for clarity |
| by gathering related code from multiple files.} |
| A key point is that speedups due to parallelism are normally |
| limited to the number of CPUs. |
| In contrast, speedups due to sequential optimizations, for example, |
| careful choice of data structure, can be arbitrarily large. |
| |
| \QuickQuiz{ |
| What should you do to validate a hash table? |
| }\QuickQuizAnswer{ |
| Quite a bit, actually. |
| |
| See \cref{sec:datastruct:RCU-Protected Hash Table Validation} |
| for a good starting point. |
| }\QuickQuizEnd |
| |
| On the other hand, if you are not in this happy situation, read on! |
| |
| \begin{listing} |
| \begin{VerbatimL}[commandchars=\\\@\$] |
| struct hash_table |
| { |
| long nbuckets; |
| struct node **buckets; |
| }; |
| |
| typedef struct node { |
| unsigned long key; |
| struct node *next; |
| } node_t; |
| |
| int hash_search(struct hash_table *h, long key) |
| { |
| struct node *cur; |
| |
| cur = h->buckets[key % h->nbuckets]; |
| while (cur != NULL) { |
| if (cur->key >= key) { |
| return (cur->key == key); |
| } |
| cur = cur->next; |
| } |
| return 0; |
| } |
| \end{VerbatimL} |
| \caption{Sequential-Program Hash Table Search} |
| \label{lst:SMPdesign:Sequential-Program Hash Table Search} |
| \end{listing} |
| |
| % ./test_hash_null.exe 1000 0/100 1 1024 1 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 96.2913 max = 98.2337 min = 90.4095 std = 2.95314 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 91.5592 max = 97.3315 min = 89.9885 std = 2.88925 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 93.3568 max = 106.162 min = 89.8828 std = 6.40418 |
| |
| \subsection{Code Locking} |
| \label{sec:SMPdesign:Code Locking} |
| |
| \IXh{Code}{locking} is quite simple due to the fact that it uses only |
| global locks.\footnote{ |
| If your program instead has locks in data structures, |
| or, in the case of Java, uses classes with synchronized |
| instances, you are instead using ``data locking'', described |
| in \cref{sec:SMPdesign:Data Locking}.} |
| It is especially |
| easy to retrofit an existing program to use code locking in |
| order to run it on a multiprocessor. |
| If the program has only a single shared resource, code locking |
| will even give optimal performance. |
| However, many of the larger and more complex programs |
| require much of the execution to |
| occur in \IXpl{critical section}, which in turn causes code locking |
| to sharply limits their scalability. |
| |
| Therefore, you should use code locking on programs that spend |
| only a small fraction of their execution time in critical sections or |
| from which only modest scaling is required. |
| In addition, programs that primarily use the more scalable approaches |
| described in later sections often use code locking to handle rare error |
| cases or significant state transitions. |
| In these cases, code locking will provide a relatively simple |
| program that is very similar to its sequential counterpart, |
| as can be seen in |
| \cref{lst:SMPdesign:Code-Locking Hash Table Search}. |
| However, note that the simple return of the comparison in |
| \co{hash_search()} in |
| \cref{lst:SMPdesign:Sequential-Program Hash Table Search} |
| has now become three statements due to the need to release the |
| lock before returning. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:SMPdesign:Code-Locking Hash Table Search] |
| \begin{VerbatimL}[commandchars=\\\@\$] |
| spinlock_t hash_lock; |
| |
| struct hash_table |
| { |
| long nbuckets; |
| struct node **buckets; |
| }; |
| |
| typedef struct node { |
| unsigned long key; |
| struct node *next; |
| } node_t; |
| |
| int hash_search(struct hash_table *h, long key) |
| { |
| struct node *cur; |
| int retval; |
| |
| spin_lock(&hash_lock); \lnlbl@acq$ |
| cur = h->buckets[key % h->nbuckets]; |
| while (cur != NULL) { |
| if (cur->key >= key) { |
| retval = (cur->key == key); |
| spin_unlock(&hash_lock); \lnlbl@rel1$ |
| return retval; |
| } |
| cur = cur->next; |
| } |
| spin_unlock(&hash_lock); \lnlbl@rel2$ |
| return 0; |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Code-Locking Hash Table Search} |
| \label{lst:SMPdesign:Code-Locking Hash Table Search} |
| \end{listing} |
| |
| \begin{fcvref}[ln:SMPdesign:Code-Locking Hash Table Search] |
| Note that the \co{hash_lock} acquisition and release statements on |
| \clnref{acq,rel1,rel2} are mediating ownership of the hash table among |
| the CPUs wishing to concurrently access that hash table. |
| Another way of looking at this is that \co{hash_lock} is partitioning |
| time, thus giving each requesting CPU its own partition of time during |
| which it owns this hash table. |
| In addition, in a well-designed algorithm, there should be ample partitions |
| of time during which no CPU owns this hash table. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| ``Partitioning time''? |
| Isn't that an odd turn of phrase? |
| }\QuickQuizAnswer{ |
| Perhaps so. |
| |
| But in the next section we will be partitioning space (that is, |
| address space) as well as time. |
| This nomenclature will permit us to partition spacetime, as |
| opposed to (say) partitioning space but segmenting time. |
| }\QuickQuizEnd |
| |
| Unfortunately, code locking is particularly prone to ``\IX{lock contention}'', |
| where multiple CPUs need to acquire the lock concurrently. |
| SMP programmers who have taken care of groups of small children |
| (or groups of older people who are acting like children) will immediately |
| recognize the danger of having only one of something, |
| as illustrated in \cref{fig:SMPdesign:Lock Contention}. |
| |
| % ./test_hash_codelock.exe 1000 0/100 1 1024 1 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_codelock.exe: avg = 164.115 max = 170.388 min = 161.659 std = 3.21857 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_codelock.exe: avg = 181.17 max = 198.4 min = 162.459 std = 15.8585 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_codelock.exe: avg = 167.651 max = 189.014 min = 162.144 std = 10.6819 |
| |
| % ./test_hash_codelock.exe 1000 0/100 1 1024 2 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_codelock.exe: avg = 378.481 max = 385.971 min = 374.235 std = 4.05934 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_codelock.exe: avg = 753.414 max = 1015.28 min = 377.734 std = 294.942 |
| % ./test_hash_codelock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_codelock.exe: avg = 502.737 max = 980.924 min = 374.406 std = 239.383 |
| |
| One solution to this problem, named ``data locking'', is described |
| in the next section. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{cartoons/r-2014-Data-one-fighting}} |
| \caption{Lock Contention} |
| \ContributedBy{fig:SMPdesign:Lock Contention}{Melissa Broussard} |
| \end{figure} |
| |
| \subsection{Data Locking} |
| \label{sec:SMPdesign:Data Locking} |
| |
| Many data structures may be partitioned, |
| with each partition of the data structure having its own lock. |
| Then the \IXpl{critical section} for each part of the data structure |
| can execute in parallel, |
| although only one instance of the critical section for a given |
| part could be executing at a given time. |
| You should use data locking when contention must |
| be reduced, and where synchronization overhead is not |
| limiting speedups. |
| Data locking reduces contention by distributing the instances |
| of the overly-large critical section across multiple data structures, |
| for example, maintaining per-hash-bucket critical sections in a |
| hash table, as shown in |
| \cref{lst:SMPdesign:Data-Locking Hash Table Search}. |
| The increased scalability again results in a slight increase in complexity |
| in the form of an additional data structure, the \co{struct bucket}. |
| |
| \begin{listing} |
| \begin{VerbatimL} |
| struct hash_table |
| { |
| long nbuckets; |
| struct bucket **buckets; |
| }; |
| |
| struct bucket { |
| spinlock_t bucket_lock; |
| node_t *list_head; |
| }; |
| |
| typedef struct node { |
| unsigned long key; |
| struct node *next; |
| } node_t; |
| |
| int hash_search(struct hash_table *h, long key) |
| { |
| struct bucket *bp; |
| struct node *cur; |
| int retval; |
| |
| bp = h->buckets[key % h->nbuckets]; |
| spin_lock(&bp->bucket_lock); |
| cur = bp->list_head; |
| while (cur != NULL) { |
| if (cur->key >= key) { |
| retval = (cur->key == key); |
| spin_unlock(&bp->bucket_lock); |
| return retval; |
| } |
| cur = cur->next; |
| } |
| spin_unlock(&bp->bucket_lock); |
| return 0; |
| } |
| \end{VerbatimL} |
| \caption{Data-Locking Hash Table Search} |
| \label{lst:SMPdesign:Data-Locking Hash Table Search} |
| \end{listing} |
| |
| In contrast with the contentious situation |
| shown in \cref{fig:SMPdesign:Lock Contention}, |
| data locking helps promote harmony, as illustrated by |
| \cref{fig:SMPdesign:Data Locking}---and in parallel programs, |
| this \emph{almost} always translates into increased performance and |
| scalability. |
| For this reason, data locking was heavily used by Sequent in its |
| kernels~\cite{Beck85,Inman85,Garg90,Dove90,McKenney92b,McKenney92a,McKenney93}. |
| |
| Another way of looking at this is to think of each \co{->bucket_lock} |
| as mediating ownership not of the entire hash table as was done for code |
| locking, but only for the bucket corresponding to that \co{->bucket_lock}. |
| Each lock still partitions time, but the per-bucket-locking technique |
| also partitions the address space, so that the overall technique can be |
| said to partition spacetime. |
| If the number of buckets is large enough, this partitioning of space |
| should with high probability permit a given CPU immediate access to |
| a given hash bucket. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.4in}{!}{\includegraphics{cartoons/r-2014-Data-many-happy}} |
| \caption{Data Locking} |
| \ContributedBy{fig:SMPdesign:Data Locking}{Melissa Broussard} |
| \end{figure} |
| |
| % ./test_hash_spinlock.exe 1000 0/100 1 1024 1 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_spinlock.exe: avg = 158.118 max = 162.404 min = 156.199 std = 2.19391 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_spinlock.exe: avg = 157.717 max = 162.446 min = 156.415 std = 2.36662 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_spinlock.exe: avg = 158.369 max = 164.75 min = 156.501 std = 3.19454 |
| |
| % ./test_hash_spinlock.exe 1000 0/100 1 1024 2 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_spinlock.exe: avg = 223.426 max = 422.948 min = 167.858 std = 100.136 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_spinlock.exe: avg = 235.462 max = 507.134 min = 167.466 std = 135.836 |
| % ./test_hash_spinlock.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_spinlock.exe: avg = 305.807 max = 481.685 min = 167.939 std = 132.589 |
| |
| However, as those who have taken care of small children can again attest, |
| even providing enough to go around is no guarantee of tranquillity. |
| The analogous situation can arise in SMP programs. |
| For example, the Linux kernel maintains a cache of files and directories |
| (called ``dcache''). |
| Each entry in this cache has its own lock, but the entries corresponding |
| to the root directory and its direct descendants are much more likely to |
| be traversed than are more obscure entries. |
| This can result in many CPUs contending for the locks of these popular |
| entries, resulting in a situation not unlike that |
| shown in \cref{fig:SMPdesign:Data and Skew}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{\twocolumnwidth}{!}{\includegraphics{cartoons/r-2014-Data-many-fighting}} |
| \caption{Data Locking and Skew} |
| \ContributedBy{fig:SMPdesign:Data and Skew}{Melissa Broussard} |
| \end{figure} |
| |
| In many cases, algorithms can be designed to reduce the instance of |
| data skew, and in some cases eliminate it entirely |
| (for example, in the Linux kernel's |
| dcache~\cite{McKenney04a,JonathanCorbet2010dcacheRCU,NeilBrown2015PathnameLookup,NeilBrown2015RCUwalk,NeilBrown2015PathnameSymlinks}). |
| Data locking is often used for partitionable data structures such as |
| hash tables, as well as in situations where multiple entities are each |
| represented by an instance of a given data structure. |
| The Linux-kernel task list is an example of the latter, each task |
| structure having its own \co{alloc_lock} and \co{pi_lock}. |
| |
| A key challenge with data locking on dynamically allocated structures |
| is ensuring that the structure remains in existence while the lock is |
| being acquired~\cite{Gamsa99}. |
| The code in |
| \cref{lst:SMPdesign:Data-Locking Hash Table Search} |
| finesses this challenge by placing the locks in the statically allocated |
| hash buckets, which are never freed. |
| However, this trick would not work if the hash table were resizeable, |
| so that the locks were now dynamically allocated. |
| In this case, there would need to be some means to prevent the hash |
| bucket from being freed during the time that its lock was being acquired. |
| |
| \QuickQuiz{ |
| What are some ways of preventing a structure from being freed while |
| its lock is being acquired? |
| }\QuickQuizAnswer{ |
| Here are a few possible solutions to this |
| \emph{\IX{existence guarantee}} problem: |
| |
| \begin{enumerate} |
| \item Provide a statically allocated lock that is held while |
| the per-structure lock is being acquired, which is an |
| example of hierarchical locking (see |
| \cref{sec:SMPdesign:Hierarchical Locking}). |
| Of course, using a single global lock for this purpose |
| can result in unacceptably high levels of lock contention, |
| dramatically reducing performance and scalability. |
| \item Provide an array of statically allocated locks, hashing |
| the structure's address to select the lock to be acquired, |
| as described in \cref{chp:Locking}. |
| Given a hash function of sufficiently high quality, this |
| avoids the scalability limitations of the single global |
| lock, but in read-mostly situations, the lock-acquisition |
| overhead can result in unacceptably degraded performance. |
| \item Use a garbage collector, in software environments providing |
| them, so that a structure cannot be deallocated while being |
| referenced. |
| This works very well, removing the existence-guarantee |
| burden (and much else besides) from the developer's |
| shoulders, but imposes the overhead of garbage collection |
| on the program. |
| Although garbage-collection technology has advanced |
| considerably in the past few decades, its overhead |
| may be unacceptably high for some applications. |
| In addition, some applications require that the developer |
| exercise more control over the layout and placement of |
| data structures than is permitted by most garbage collected |
| environments. |
| \item As a special case of a garbage collector, use a global |
| reference counter, or a global array of reference counters. |
| These have strengths and limitations similar to those |
| called out above for locks. |
| \item Use \emph{\IXpl{hazard pointer}}~\cite{MagedMichael04a}, which |
| can be thought of as an inside-out reference count. |
| Hazard-pointer-based algorithms maintain a per-thread list of |
| pointers, so that the appearance of a given pointer on |
| any of these lists acts as a reference to the corresponding |
| structure. |
| Hazard pointers are starting to see significant production use |
| (see \cref{sec:defer:Production Uses of Hazard Pointers}). |
| \item Use transactional memory |
| (TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95}, |
| so that each reference and |
| modification to the data structure in question is |
| performed atomically. |
| Although TM has engendered much excitement in recent years, |
| and seems likely to be of some use in production software, |
| developers should exercise some |
| caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM}, |
| particularly in performance-critical code. |
| In particular, existence guarantees require that the |
| transaction covers the full path from a global reference |
| to the data elements being updated. |
| For more on TM, including ways to overcome some of its |
| weaknesses by combining it with other synchronization |
| mechanisms, see |
| \cref{sec:future:Transactional Memory,sec:future:Hardware Transactional Memory}. |
| \item Use RCU, which can be thought of as an extremely lightweight |
| approximation to a garbage collector. |
| Updaters are not permitted to free RCU-protected |
| data structures that RCU readers might still be referencing. |
| RCU is most heavily used for read-mostly data structures, |
| and is discussed at length in |
| \cref{sec:defer:Read-Copy Update (RCU)}. |
| \end{enumerate} |
| |
| For more on providing existence guarantees, see |
| \cref{chp:Locking,chp:Deferred Processing}. |
| }\QuickQuizEnd |
| |
| \subsection{Data Ownership} |
| \label{sec:SMPdesign:Data Ownership} |
| |
| Data ownership partitions a given data structure over the threads |
| or CPUs, so that each thread/CPU accesses its subset of the data |
| structure without any synchronization overhead whatsoever. |
| However, if one thread wishes to access some other thread's data, |
| the first thread is unable to do so directly. |
| Instead, the first thread must communicate with the second thread, |
| so that the second thread performs the operation on behalf of the |
| first, or, alternatively, migrates the data to the first thread. |
| |
| Data ownership might seem arcane, but it is used very frequently: |
| \begin{enumerate} |
| \item Any variables accessible by only one CPU or thread |
| (such as {\tt auto} variables in C |
| and C++) are owned by that CPU or process. |
| \item An instance of a user interface owns the corresponding |
| user's context. |
| It is very common for applications interacting with parallel |
| database engines to be written as if they were entirely |
| sequential programs. |
| Such applications own the user interface and his current |
| action. |
| Explicit parallelism is thus confined to the |
| database engine itself. |
| \item Parametric simulations are often trivially parallelized |
| by granting each thread ownership of a particular region |
| of the parameter space. |
| There are also computing frameworks designed for this |
| type of problem~\cite{BOINC2008}. |
| \end{enumerate} |
| |
| If there is significant sharing, communication between the threads |
| or CPUs can result in significant complexity and overhead. |
| Furthermore, if the most-heavily used data happens to be that owned |
| by a single CPU, that CPU will be a ``\IX{hot spot}'', sometimes with |
| results resembling that shown in \cref{fig:SMPdesign:Data and Skew}. |
| However, in situations where no sharing is required, data ownership |
| achieves ideal performance, and with code that can be as simple |
| as the sequential-program case shown in |
| \cref{lst:SMPdesign:Sequential-Program Hash Table Search}. |
| Such situations are often referred to as ``\IX{embarrassingly |
| parallel}'', and, in the best case, resemble the situation |
| previously shown in \cref{fig:SMPdesign:Data Locking}. |
| |
| % ./test_hash_null.exe 1000 0/100 1 1024 1 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 96.2913 max = 98.2337 min = 90.4095 std = 2.95314 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 91.5592 max = 97.3315 min = 89.9885 std = 2.88925 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 1 |
| % ./test_hash_null.exe: avg = 93.3568 max = 106.162 min = 89.8828 std = 6.40418 |
| |
| % ./test_hash_null.exe 1000 0/100 1 1024 2 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_null.exe: avg = 45.4526 max = 46.4281 min = 45.1954 std = 0.487791 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_null.exe: avg = 46.0238 max = 49.2861 min = 45.1852 std = 1.63127 |
| % ./test_hash_null.exe: nmilli: 1000 update/total: 0/100 nelements: 1 nbuckets: 1024 nthreads: 2 |
| % ./test_hash_null.exe: avg = 46.6858 max = 52.6278 min = 45.1761 std = 2.97102 |
| |
| Another important instance of data ownership occurs when the data |
| is read-only, in which case, |
| all threads can ``own'' it via replication. |
| |
| Where data locking partitions both the address space (with one |
| hash buckets per partition) and time (using per-bucket locks), data |
| ownership partitions only the address space. |
| The reason that data ownership need not partition time is because a |
| given thread or CPU is assigned permanent ownership of a given |
| address-space partition. |
| |
| \QuickQuiz{ |
| But won't system boot and shutdown (or application startup |
| and shutdown) be partitioning time, even for data ownership? |
| }\QuickQuizAnswer{ |
| You can indeed think in these terms. |
| |
| And if you are working on a persistent data store where state |
| survives shutdown, thinking in these terms might even be useful. |
| }\QuickQuizEnd |
| |
| Data ownership will be presented in more detail in |
| \cref{chp:Data Ownership}. |
| |
| \subsection{Locking Granularity and Performance} |
| \label{sec:SMPdesign:Locking Granularity and Performance} |
| |
| This section looks at locking granularity and performance from |
| a mathematical synchronization\-/efficiency viewpoint. |
| Readers who are uninspired by mathematics might choose to skip |
| this section. |
| |
| The approach is to use a crude queueing model for the \IX{efficiency} of |
| synchronization mechanism that operate on a single shared global |
| variable, based on an M/M/1 queue. |
| M/M/1 queuing models are based on an exponentially distributed |
| ``inter-arrival rate'' $\lambda$ and an exponentially distributed |
| ``service rate'' $\mu$. |
| The inter-arrival rate $\lambda$ can be thought of as the average |
| number of synchronization operations per second that the system |
| would process if the synchronization were free, in other words, |
| $\lambda$ is an inverse measure of the overhead of each non-synchronization |
| unit of work. |
| For example, if each unit of work was a transaction, and if each transaction |
| took one millisecond to process, excluding synchronization overhead, |
| then $\lambda$ would be 1,000~transactions per second. |
| |
| The service rate $\mu$ is defined similarly, but for the average |
| number of synchronization operations per second that the system |
| would process if the overhead of each transaction was zero, and |
| ignoring the fact that CPUs must wait on each other to complete |
| their synchronization operations, in other words, $\mu$ can be roughly |
| thought of as the synchronization overhead in absence of contention. |
| For example, suppose that each transaction's synchronization operation |
| involves an atomic increment instruction, and that a computer system is |
| able to do a private-variable atomic increment every 5~nanoseconds on |
| each CPU |
| (see \cref{fig:count:Atomic Increment Scalability on x86}).\footnote{ |
| Of course, if there are 8 CPUs all incrementing the same |
| shared variable, then each CPU must wait at least 35~nanoseconds |
| for each of the other CPUs to do its increment before consuming |
| an additional 5~nanoseconds doing its own increment. |
| In fact, the wait will be longer due to the need |
| to move the variable from one CPU to another.} |
| The value of $\mu$ is therefore about 200,000,000 atomic increments |
| per second. |
| |
| Of course, the value of $\lambda$ increases as increasing numbers of CPUs |
| increment a shared variable because each CPU is capable of processing |
| transactions independently (again, ignoring synchronization): |
| |
| \begin{equation} |
| \lambda = n \lambda_0 |
| \end{equation} |
| |
| Here, $n$ is the number of CPUs and $\lambda_0$ is the transaction-processing |
| capability of a single CPU\@. |
| Note that the expected time for a single CPU to execute a single transaction |
| in the absence of contention is $1 / \lambda_0$. |
| |
| Because the CPUs have to ``wait in line'' behind each other to get their |
| chance to increment the single shared variable, we can use the M/M/1 |
| queueing-model expression for the expected total waiting time: |
| |
| \begin{equation} |
| T = \frac{1}{\mu - \lambda} |
| \end{equation} |
| |
| Substituting the above value of $\lambda$: |
| |
| \begin{equation} |
| T = \frac{1}{\mu - n \lambda_0} |
| \end{equation} |
| |
| Now, the efficiency is just the ratio of the time required to process |
| a transaction in absence of synchronization ($1 / \lambda_0$) |
| to the time required including synchronization ($T + 1 / \lambda_0$): |
| |
| \begin{equation} |
| e = \frac{1 / \lambda_0}{T + 1 / \lambda_0} |
| \end{equation} |
| |
| Substituting the above value for $T$ and simplifying: |
| |
| \begin{equation} |
| e = \frac{\frac{\mu}{\lambda_0} - n}{\frac{\mu}{\lambda_0} - (n - 1)} |
| \end{equation} |
| |
| But the value of $\mu / \lambda_0$ is just the ratio of the time required |
| to process the transaction (absent synchronization overhead) to that of |
| the synchronization overhead itself (absent contention). |
| If we call this ratio $f$, we have: |
| |
| \begin{equation} |
| e = \frac{f - n}{f - (n - 1)} |
| \end{equation} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{SMPdesign/synceff}} |
| \caption{Synchronization Efficiency} |
| \label{fig:SMPdesign:Synchronization Efficiency} |
| \end{figure} |
| |
| \Cref{fig:SMPdesign:Synchronization Efficiency} plots the synchronization |
| efficiency $e$ as a function of the number of CPUs/threads $n$ for |
| a few values of the overhead ratio $f$. |
| For example, again using the 5-nanosecond atomic increment, the $f=10$ |
| line corresponds to each CPU attempting an atomic increment every |
| 50~nanoseconds, and the $f=100$ line corresponds to each CPU attempting |
| an atomic increment every 500~nanoseconds, which in turn corresponds to |
| some hundreds (perhaps thousands) of instructions. |
| Given that each trace drops off sharply with increasing numbers of |
| CPUs or threads, we can conclude that |
| synchronization mechanisms based on |
| atomic manipulation of a single global shared variable will not |
| scale well if used heavily on current commodity hardware. |
| This is an abstract mathematical depiction of the forces leading |
| to the parallel counting algorithms that were discussed in |
| \cref{chp:Counting}. |
| Your real-world mileage may differ. |
| |
| Nevertheless, the concept of efficiency is useful, and even in cases |
| having little or no formal synchronization. |
| Consider for example a matrix multiply, in which the columns of one |
| matrix are multiplied (via ``dot product'') by the rows of another, |
| resulting in an entry in a third matrix. |
| Because none of these operations conflict, it is possible to partition |
| the columns of the first matrix among a group of threads, with each thread |
| computing the corresponding columns of the result matrix. |
| The threads can therefore operate entirely independently, with no |
| synchronization overhead whatsoever, as is done in |
| \path{matmul.c}. |
| One might therefore expect a perfect efficiency of 1.0. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/data/hps.2020.03.30a/matmuleff}} |
| \caption{Matrix Multiply Efficiency} |
| \label{fig:SMPdesign:Matrix Multiply Efficiency} |
| \end{figure} |
| |
| However, |
| \cref{fig:SMPdesign:Matrix Multiply Efficiency} |
| tells a different story, especially for a 64-by-64 matrix multiply, |
| which never gets above an efficiency of about 0.3, even when running |
| single-threaded, and drops sharply as more threads are added.\footnote{ |
| In contrast to the smooth traces of |
| \cref{fig:SMPdesign:Synchronization Efficiency}, |
| the wide error bars and jagged traces of |
| \cref{fig:SMPdesign:Matrix Multiply Efficiency} |
| gives evidence of its real-world nature.} |
| The 128-by-128 matrix does better, but still fails to demonstrate |
| much performance increase with added threads. |
| The 256-by-256 matrix does scale reasonably well, but only up to a handful |
| of CPUs. |
| The 512-by-512 matrix multiply's efficiency is measurably less |
| than 1.0 on as few as 10 threads, and even the 1024-by-1024 matrix |
| multiply deviates noticeably from perfection at a few tens of threads. |
| Nevertheless, this figure clearly demonstrates the performance and |
| scalability benefits of batching: |
| If you must incur synchronization overhead, you may as well get your |
| money's worth, which is the solution to the problem of deciding on |
| granularity of synchronization put forth on |
| \cpageref{sec:SMPdesign:Problems Granularity}. |
| |
| \QuickQuiz{ |
| How can a single-threaded 64-by-64 matrix multiple possibly |
| have an efficiency of less than 1.0? |
| Shouldn't all of the traces in |
| \cref{fig:SMPdesign:Matrix Multiply Efficiency} |
| have efficiency of exactly 1.0 when running on one thread? |
| }\QuickQuizAnswer{ |
| The \path{matmul.c} program creates the specified number of |
| worker threads, so even the single-worker-thread case incurs |
| thread-creation overhead. |
| Making the changes required to optimize away thread-creation |
| overhead in the single-worker-thread case is left as an |
| exercise to the reader. |
| }\QuickQuizEnd |
| |
| Given these inefficiencies, |
| it is worthwhile to look into more-scalable approaches |
| such as the data locking described in |
| \cref{sec:SMPdesign:Data Locking} |
| or the parallel-fastpath approach discussed in the next section. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How are data-parallel techniques going to help with matrix |
| multiply? |
| It is \emph{already} data parallel!!! |
| }\QuickQuizAnswerB{ |
| I am glad that you are paying attention! |
| This example serves to show that although data parallelism can |
| be a very good thing, it is not some magic wand that automatically |
| wards off any and all sources of inefficiency. |
| Linear scaling at full performance, even to ``only'' 64 threads, |
| requires care at all phases of design and implementation. |
| |
| In particular, you need to pay careful attention to the |
| size of the partitions. |
| For example, if you split a 64-by-64 matrix multiply across |
| 64 threads, each thread gets only 64 floating-point multiplies. |
| The cost of a floating-point multiply is minuscule compared to |
| the overhead of thread creation, and cache-miss overhead |
| also plays a role in spoiling the theoretically perfect scalability |
| (and also in making the traces so jagged). |
| The full 448~hardware threads would require a matrix with |
| hundreds of thousands of rows and columns to attain good |
| scalability, but by that point GPGPUs become quite attractive, |
| especially from a price/performance viewpoint. |
| |
| Moral: |
| If you have a parallel program with variable input, |
| always include a check for the input size being too small to |
| be worth parallelizing. |
| And when it is not helpful to parallelize, it is not helpful |
| to incur the overhead required to spawn a thread, now is it? |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| What did you do to validate this matrix multiply algorithm? |
| }\QuickQuizAnswerE{ |
| For this simple approach, very little. |
| |
| However, the validation of production-quality matrix multiply |
| requires great care and attention. |
| Some cases require careful handling of floating-point rounding |
| errors, others involve complex sparse-matrix data structures, |
| and still others make use of special-purpose arithmetic hardware |
| such as vector units or GPGPUs. |
| Adequate tests for handling of floating-point rounding errors |
| can be especially challenging. |
| }\QuickQuizEndE |
| } |
| |
| \section{Parallel Fastpath} |
| \label{sec:SMPdesign:Parallel Fastpath} |
| % |
| \epigraph{There are two ways of meeting difficulties: |
| You alter the difficulties, or you alter yourself to meet them.} |
| {Phyllis Bottome} |
| |
| Fine-grained (and therefore \emph{usually} higher-performance) |
| designs are typically more complex than are coarser-grained designs. |
| In many cases, most of the overhead is incurred by a small fraction |
| of the code~\cite{Knuth73}. |
| So why not focus effort on that small fraction? |
| |
| This is the idea behind the parallel-fastpath design pattern, to aggressively |
| parallelize the common-case code path without incurring the complexity |
| that would be required to aggressively parallelize the entire algorithm. |
| You must understand not only the specific algorithm you wish |
| to parallelize, but also the workload that the algorithm will |
| be subjected to. |
| Great creativity and design effort is often required to construct |
| a parallel fastpath. |
| |
| Parallel fastpath combines different patterns (one for the |
| fastpath, one elsewhere) and is therefore a template pattern. |
| The following instances of parallel |
| fastpath occur often enough to warrant their own patterns, |
| as depicted in \cref{fig:SMPdesign:Parallel-Fastpath Design Patterns}: |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.3in}{!}{\includegraphics{SMPdesign/ParallelFastpath}} |
| % \includegraphics{SMPdesign/ParallelFastpath} |
| \caption{Parallel-Fastpath Design Patterns} |
| \label{fig:SMPdesign:Parallel-Fastpath Design Patterns} |
| \end{figure} |
| |
| \begin{enumerate} |
| \item Reader/Writer Locking |
| (described below in \cref{sec:SMPdesign:Reader/Writer Locking}). |
| \item Read-copy update (RCU), which may be used as a high-performance |
| replacement for reader/writer locking, is introduced in |
| \cref{sec:defer:Read-Copy Update (RCU)}. |
| Other alternatives include hazard pointers |
| (\cref{sec:defer:Hazard Pointers}) |
| and sequence locking (\cref{sec:defer:Sequence Locks}). |
| These alternatives will not be discussed further in this chapter. |
| \item Hierarchical Locking~(\cite{McKenney95b}), which is touched upon |
| in \cref{sec:SMPdesign:Hierarchical Locking}. |
| \item Resource Allocator Caches~(\cite{McKenney95b,McKenney93}). |
| See \cref{sec:SMPdesign:Resource Allocator Caches} |
| for more detail. |
| \end{enumerate} |
| |
| \subsection{Reader/Writer Locking} |
| \label{sec:SMPdesign:Reader/Writer Locking} |
| |
| If synchronization overhead is negligible (for example, if the program |
| uses coarse-grained parallelism with large \IXpl{critical section}), and if |
| only a small fraction of the critical sections modify data, then allowing |
| multiple readers to proceed in parallel can greatly increase scalability. |
| Writers exclude both readers and each other. |
| There are many implementations of reader-writer locking, including |
| the POSIX implementation described in |
| \cref{sec:toolsoftrade:POSIX Reader-Writer Locking}. |
| \Cref{lst:SMPdesign:Reader-Writer-Locking Hash Table Search} |
| shows how the hash search might be implemented using reader-writer locking. |
| |
| \begin{listing} |
| \begin{VerbatimL}[commandchars=\\\@\$] |
| rwlock_t hash_lock; |
| |
| struct hash_table |
| { |
| long nbuckets; |
| struct node **buckets; |
| }; |
| |
| typedef struct node { |
| unsigned long key; |
| struct node *next; |
| } node_t; |
| |
| int hash_search(struct hash_table *h, long key) |
| { |
| struct node *cur; |
| int retval; |
| |
| read_lock(&hash_lock); |
| cur = h->buckets[key % h->nbuckets]; |
| while (cur != NULL) { |
| if (cur->key >= key) { |
| retval = (cur->key == key); |
| read_unlock(&hash_lock); |
| return retval; |
| } |
| cur = cur->next; |
| } |
| read_unlock(&hash_lock); |
| return 0; |
| } |
| \end{VerbatimL} |
| \caption{Reader-Writer-Locking Hash Table Search} |
| \label{lst:SMPdesign:Reader-Writer-Locking Hash Table Search} |
| \end{listing} |
| |
| Reader/writer locking is a simple instance of asymmetric locking. |
| Snaman~\cite{Snaman87} describes a more ornate six-mode |
| asymmetric locking design used in several clustered systems. |
| Locking in general and reader-writer locking in particular is described |
| extensively in |
| \cref{chp:Locking}. |
| |
| \subsection{Hierarchical Locking} |
| \label{sec:SMPdesign:Hierarchical Locking} |
| |
| The idea behind hierarchical locking is to have a coarse-grained lock |
| that is held only long enough to work out which fine-grained lock |
| to acquire. |
| \Cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search} |
| shows how our hash-table search might be adapted to do hierarchical |
| locking, but also shows the great weakness of this approach: |
| We have paid the overhead of acquiring a second lock, but we only |
| hold it for a short time. |
| In this case, the data-locking approach would be simpler |
| and likely perform better. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:SMPdesign:Hierarchical-Locking Hash Table Search] |
| \begin{VerbatimL}[commandchars=\\\@\$] |
| struct hash_table |
| { |
| long nbuckets; |
| struct bucket **buckets; |
| }; |
| |
| struct bucket { |
| spinlock_t bucket_lock; |
| node_t *list_head; |
| }; |
| |
| typedef struct node { |
| spinlock_t node_lock; |
| unsigned long key; |
| struct node *next; |
| } node_t; |
| |
| int hash_search(struct hash_table *h, long key) |
| { |
| struct bucket *bp; |
| struct node *cur; |
| int retval; |
| |
| bp = h->buckets[key % h->nbuckets]; |
| spin_lock(&bp->bucket_lock); |
| cur = bp->list_head; |
| while (cur != NULL) { |
| if (cur->key >= key) { |
| spin_lock(&cur->node_lock); |
| spin_unlock(&bp->bucket_lock); |
| retval = (cur->key == key);\lnlbl@retval$ |
| spin_unlock(&cur->node_lock); |
| return retval; |
| } |
| cur = cur->next; |
| } |
| spin_unlock(&bp->bucket_lock); |
| return 0; |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Hierarchical-Locking Hash Table Search} |
| \label{lst:SMPdesign:Hierarchical-Locking Hash Table Search} |
| \end{listing} |
| |
| \QuickQuiz{ |
| In what situation would hierarchical locking work well? |
| }\QuickQuizAnswer{ |
| If the comparison on |
| \clnrefr{ln:SMPdesign:Hierarchical-Locking Hash Table Search:retval} of |
| \cref{lst:SMPdesign:Hierarchical-Locking Hash Table Search} |
| were replaced by a much heavier-weight operation, |
| then releasing \co{bp->bucket_lock} \emph{might} reduce lock |
| contention enough to outweigh the overhead of the extra |
| acquisition and release of \co{cur->node_lock}. |
| }\QuickQuizEnd |
| |
| \subsection{Resource Allocator Caches} |
| \label{sec:SMPdesign:Resource Allocator Caches} |
| |
| This section presents a simplified schematic of a parallel fixed-block-size |
| memory allocator. |
| More detailed descriptions may be found in the |
| literature~\cite{McKenney92a,McKenney93,Bonwick01slab,McKenney01e,JasonEvans2011jemalloc,ChrisKennelly2020tcmalloc} |
| or in the Linux kernel~\cite{Torvalds2.6kernel}. |
| |
| \subsubsection{Parallel Resource Allocation Problem} |
| |
| The basic problem facing a parallel memory allocator is the tension |
| between the need to provide extremely fast memory allocation and |
| freeing in the common case and the need to efficiently distribute |
| memory in face of unfavorable allocation and freeing patterns. |
| |
| To see this tension, consider a straightforward application of |
| data ownership to this problem---simply carve up memory so that |
| each CPU owns its share. |
| For example, suppose that a system with 12~CPUs has 64~gigabytes |
| of memory, for example, the laptop I am using right now. |
| We could simply assign each CPU a five-gigabyte region of memory, |
| and allow each CPU to allocate from its own region, without the need |
| for locking and its complexities and overheads. |
| Unfortunately, this scheme fails when CPU~0 only allocates memory and |
| CPU~1 only frees it, as happens in simple producer-consumer workloads. |
| |
| The other extreme, \IXh{code}{locking}, suffers from excessive |
| \IX{lock contention} |
| and overhead~\cite{McKenney93}. |
| |
| \subsubsection{Parallel Fastpath for Resource Allocation} |
| \label{sec:SMPdesign:Parallel Fastpath for Resource Allocation} |
| |
| The commonly used solution uses parallel fastpath with each CPU |
| owning a modest cache of blocks, and with a large code-locked |
| shared pool for additional blocks. |
| To prevent any given CPU from monopolizing the memory blocks, |
| we place a limit on the number of blocks that can be in each CPU's |
| cache. |
| In a two-CPU system, the flow of memory blocks will be as shown |
| in \cref{fig:SMPdesign:Allocator Cache Schematic}: |
| When a given CPU is trying to free a block when its pool is full, |
| it sends blocks to the global pool, and, similarly, when that CPU |
| is trying to allocate a block when its pool is empty, it retrieves |
| blocks from the global pool. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/allocatorcache}} |
| \caption{Allocator Cache Schematic} |
| \label{fig:SMPdesign:Allocator Cache Schematic} |
| \end{figure} |
| |
| \subsubsection{Data Structures} |
| |
| The actual data structures for a ``toy'' implementation of allocator |
| caches are shown in |
| \cref{lst:SMPdesign:Allocator-Cache Data Structures} |
| (``\path{smpalloc.c}''). |
| The ``Global Pool'' of \cref{fig:SMPdesign:Allocator Cache Schematic} |
| is implemented by \co{globalmem} of type \co{struct globalmempool}, |
| and the two CPU pools by the per-thread variable \co{perthreadmem} of |
| type \co{struct perthreadmempool}. |
| Both of these data structures have arrays of pointers to blocks |
| in their \co{pool} fields, which are filled from index zero upwards. |
| Thus, if \co{globalmem.pool[3]} is \co{NULL}, then the remainder of |
| the array from index 4 up must also be \co{NULL}. |
| The \co{cur} fields contain the index of the highest-numbered full |
| element of the \co{pool} array, or $-1$ if all elements are empty. |
| All elements from \co{globalmem.pool[0]} through |
| \co{globalmem.pool[globalmem.cur]} must be full, and all the rest |
| must be empty.\footnote{ |
| Both pool sizes (\co{TARGET_POOL_SIZE} and |
| \co{GLOBAL_POOL_SIZE}) are unrealistically small, but this small |
| size makes it easier to single-step the program in order to get |
| a feel for its operation.} |
| |
| \begin{listing} |
| \input{CodeSamples/SMPdesign/smpalloc@data_struct.fcv} |
| \caption{Allocator-Cache Data Structures} |
| \label{lst:SMPdesign:Allocator-Cache Data Structures} |
| \end{listing} |
| |
| The operation of the pool data structures is illustrated by |
| \cref{fig:SMPdesign:Allocator Pool Schematic}, |
| with the six boxes representing the array of pointers making up |
| the \co{pool} field, and the number preceding them representing |
| the \co{cur} field. |
| The shaded boxes represent non-\co{NULL} pointers, while the empty |
| boxes represent \co{NULL} pointers. |
| An important, though potentially confusing, invariant of this |
| data structure is that the \co{cur} field is always one |
| smaller than the number of non-\co{NULL} pointers. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.6in}{!}{\includegraphics{SMPdesign/AllocatorPool}} |
| \caption{Allocator Pool Schematic} |
| \label{fig:SMPdesign:Allocator Pool Schematic} |
| \end{figure} |
| |
| \subsubsection{Allocation Function} |
| |
| \begin{fcvref}[ln:SMPdesign:smpalloc:alloc] |
| The allocation function \co{memblock_alloc()} may be seen in |
| \cref{lst:SMPdesign:Allocator-Cache Allocator Function}. |
| \Clnref{pick} picks up the current thread's per-thread pool, |
| and \clnref{chk:empty} checks to see if it is empty. |
| |
| If so, \clnrefrange{ack}{rel} attempt to refill it |
| from the global pool |
| under the spinlock acquired on \clnref{ack} and released on \clnref{rel}. |
| \Clnrefrange{loop:b}{loop:e} move blocks from the global |
| to the per-thread pool until |
| either the local pool reaches its target size (half full) or |
| the global pool is exhausted, and \clnref{set} sets the per-thread pool's |
| count to the proper value. |
| |
| In either case, \clnref{chk:notempty} checks for the per-thread |
| pool still being |
| empty, and if not, \clnrefrange{rem:b}{rem:e} remove a block and return it. |
| Otherwise, \clnref{ret:NULL} tells the sad tale of memory exhaustion. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/SMPdesign/smpalloc@alloc.fcv} |
| \caption{Allocator-Cache Allocator Function} |
| \label{lst:SMPdesign:Allocator-Cache Allocator Function} |
| \end{listing} |
| |
| \subsubsection{Free Function} |
| |
| \begin{fcvref}[ln:SMPdesign:smpalloc:free] |
| \Cref{lst:SMPdesign:Allocator-Cache Free Function} shows |
| the memory-block free function. |
| \Clnref{get} gets a pointer to this thread's pool, and |
| \clnref{chk:full} checks to see if this per-thread pool is full. |
| |
| If so, \clnrefrange{acq}{empty:e} empty half of the per-thread pool |
| into the global pool, |
| with \clnref{acq,rel} acquiring and releasing the spinlock. |
| \Clnrefrange{loop:b}{loop:e} implement the loop moving blocks |
| from the local to the |
| global pool, and \clnref{set} sets the per-thread pool's count to the proper |
| value. |
| |
| In either case, \clnref{place} then places the newly freed block into the |
| per-thread pool. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/SMPdesign/smpalloc@free.fcv} |
| \caption{Allocator-Cache Free Function} |
| \label{lst:SMPdesign:Allocator-Cache Free Function} |
| \end{listing} |
| |
| \QuickQuiz{ |
| Doesn't this resource-allocator design resemble that of |
| the approximate limit counters covered in |
| \cref{sec:count:Approximate Limit Counters}? |
| }\QuickQuizAnswer{ |
| Indeed it does! |
| We are used to thinking of allocating and freeing memory, |
| but the algorithms in |
| \cref{sec:count:Approximate Limit Counters} |
| are taking very similar actions to allocate and free |
| ``count''. |
| }\QuickQuizEnd |
| |
| \subsubsection{Performance} |
| |
| Rough performance results\footnote{ |
| This data was not collected in a statistically meaningful way, |
| and therefore should be viewed with great skepticism and suspicion. |
| Good data-collection and -reduction practice is discussed |
| in \cref{chp:Validation}. |
| That said, repeated runs gave similar results, and these results |
| match more careful evaluations of similar algorithms.} |
| are shown in |
| \cref{fig:SMPdesign:Allocator Cache Performance}, |
| running on a dual-core Intel x86 running at 1\,GHz (4300 bogomips per CPU) |
| with at most six blocks allowed in each CPU's cache. |
| In this micro-benchmark, |
| each thread repeatedly allocates a group of blocks and then frees all |
| the blocks in that group, with |
| the number of blocks in the group being the ``allocation run length'' |
| displayed on the x-axis. |
| The y-axis shows the number of successful allocation/free pairs per |
| microsecond---failed allocations are not counted. |
| The ``X''s are from a two-thread run, while the ``+''s are from a |
| single-threaded run. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/SMPdesign/smpalloc}} |
| \caption{Allocator Cache Performance} |
| \label{fig:SMPdesign:Allocator Cache Performance} |
| \end{figure} |
| |
| Note that run lengths up to six scale linearly and give excellent performance, |
| while run lengths greater than six show poor performance and almost always |
| also show \emph{negative} scaling. |
| It is therefore quite important to size \co{TARGET_POOL_SIZE} |
| sufficiently large, |
| which fortunately is usually quite easy to do in actual |
| practice~\cite{McKenney01e}, especially given today's large memories. |
| For example, in most systems, it is quite reasonable to set |
| \co{TARGET_POOL_SIZE} to 100, in which case allocations and frees |
| are guaranteed to be confined to per-thread pools at least 99\pct\ of |
| the time. |
| |
| As can be seen from the figure, the situations where the common-case |
| data-ownership applies (run lengths up to six) provide greatly improved |
| performance compared to the cases where locks must be acquired. |
| Avoiding synchronization in the common case will be a recurring theme through |
| this book. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| In \cref{fig:SMPdesign:Allocator Cache Performance}, |
| there is a pattern of performance rising with increasing run |
| length in groups of three samples, for example, for run lengths |
| 10, 11, and 12. |
| Why? |
| }\QuickQuizAnswerB{ |
| This is due to the per-CPU target value being three. |
| A run length of 12 must acquire the global-pool lock twice, |
| while a run length of 13 must acquire the global-pool lock |
| three times. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Allocation failures were observed in the two-thread |
| tests at run lengths of 19 and greater. |
| Given the global-pool size of 40 and the per-thread target |
| pool size $s$ of three, number of threads $n$ equal to two, |
| and assuming that the per-thread pools are initially |
| empty with none of the memory in use, what is the smallest allocation |
| run length $m$ at which failures can occur? |
| (Recall that each thread repeatedly allocates $m$ block of memory, |
| and then frees the $m$ blocks of memory.) |
| Alternatively, given $n$ threads each with pool size $s$, and |
| where each thread repeatedly first allocates $m$ blocks of memory |
| and then frees those $m$ blocks, how large must the global pool |
| size be? |
| \emph{Note:} |
| Obtaining the correct answer will require you to |
| examine the \path{smpalloc.c} source code, and very likely |
| single-step it as well. |
| You have been warned! |
| }\QuickQuizAnswerE{ |
| This solution is adapted from one put forward by Alexey Roytman. |
| It is based on the following definitions: |
| |
| \begin{description} |
| \item[$g$] Number of blocks globally available. |
| \item[$i$] Number of blocks left in the initializing thread's |
| per-thread pool. |
| (This is one reason you needed to look at the code!) |
| \item[$m$] Allocation/free run length. |
| \item[$n$] Number of threads, excluding the initialization thread. |
| \item[$p$] Per-thread maximum block consumption, including |
| both the blocks actually allocated and the blocks |
| remaining in the per-thread pool. |
| \end{description} |
| |
| The values $g$, $m$, and $n$ are given. |
| The value for $p$ is $m$ rounded up to the next multiple of $s$, |
| as follows: |
| |
| \begin{equation} |
| p = s \left \lceil \frac{m}{s} \right \rceil |
| \label{sec:SMPdesign:p} |
| \end{equation} |
| |
| The value for $i$ is as follows: |
| |
| \begin{equation} |
| i = \left \{ |
| \begin{array}{l} |
| g \pmod{2 s} = 0: 2 s \\ |
| g \pmod{2 s} \ne 0: g \pmod{2 s} |
| \end{array} |
| \right . |
| \label{sec:SMPdesign:i} |
| \end{equation} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/smpalloclim}} |
| \caption{Allocator Cache Run-Length Analysis} |
| \label{fig:SMPdesign:Allocator Cache Run-Length Analysis} |
| \end{figure} |
| |
| The relationships between these quantities are shown in |
| \cref{fig:SMPdesign:Allocator Cache Run-Length Analysis}. |
| The global pool is shown on the top of this figure, and |
| the ``extra'' initializer thread's per-thread pool and |
| per-thread allocations are the left-most pair of boxes. |
| The initializer thread has no blocks allocated, but has |
| $i$ blocks stranded in its per-thread pool. |
| The rightmost two pairs of boxes are the per-thread pools and |
| per-thread allocations of threads holding the maximum possible |
| number of blocks, while the second-from-left pair of boxes |
| represents the thread currently trying to allocate. |
| |
| The total number of blocks is $g$, and adding up the per-thread |
| allocations and per-thread pools, we see that the global pool |
| contains $g-i-p(n-1)$ blocks. |
| If the allocating thread is to be successful, it needs at least |
| $m$ blocks in the global pool, in other words: |
| |
| \begin{equation} |
| g - i - p(n - 1) \ge m |
| \label{sec:SMPdesign:g-vs-m} |
| \end{equation} |
| |
| The question has $g=40$, $s=3$, and $n=2$. |
| \Cref{sec:SMPdesign:i} gives $i=4$, and |
| \cref{sec:SMPdesign:p} gives $p=18$ for $m=18$ |
| and $p=21$ for $m=19$. |
| Plugging these into \cref{sec:SMPdesign:g-vs-m} |
| shows that $m=18$ will not overflow, but that $m=19$ might |
| well do so. |
| |
| The presence of $i$ could be considered to be a bug. |
| After all, why allocate memory only to have it stranded in |
| the initialization thread's cache? |
| One way of fixing this would be to provide a \co{memblock_flush()} |
| function that flushed the current thread's pool into the |
| global pool. |
| The initialization thread could then invoke this function |
| after freeing all of the blocks. |
| }\QuickQuizEndE |
| } |
| |
| \subsubsection{Validation} |
| |
| Validation of this simple allocator spawns a specified number of threads, |
| with each thread repeatedly allocating a specified number of memory |
| blocks and then deallocating them. |
| This simple regimen suffices to exercise both the per-thread caches |
| and the global pool, as can be seen in |
| \cref{fig:SMPdesign:Allocator Cache Performance}. |
| |
| Much more aggressive validation is required for memory allocators that |
| are to be used in production. |
| The test suites for tcmalloc~\cite{ChrisKennelly2020tcmalloc} and |
| jemalloc~\cite{JasonEvans2011jemalloc} are instructive, as are the |
| tests for the Linux kernel's memory allocator. |
| |
| \subsubsection{Real-World Design} |
| |
| The toy parallel resource allocator was quite simple, but real-world |
| designs expand on this approach in a number of ways. |
| |
| First, real-world allocators are required to handle a wide range |
| of allocation sizes, as opposed to the single size shown in this |
| toy example. |
| One popular way to do this is to offer a fixed set of sizes, spaced |
| so as to balance external and internal \IX{fragmentation}, such as in |
| the late-1980s BSD memory allocator~\cite{McKusick88}. |
| Doing this would mean that the ``globalmem'' variable would need |
| to be replicated on a per-size basis, and that the associated |
| lock would similarly be replicated, resulting in \IXh{data}{locking} |
| rather than the toy program's code locking. |
| |
| Second, production-quality systems must be able to repurpose memory, |
| meaning that they must be able to coalesce blocks into larger structures, |
| such as pages~\cite{McKenney93}. |
| This coalescing will also need to be protected by a lock, which again |
| could be replicated on a per-size basis. |
| |
| Third, coalesced memory must be returned to the underlying memory |
| system, and pages of memory must also be allocated from the underlying |
| memory system. |
| The locking required at this level will depend on that of the underlying |
| memory system, but could well be code locking. |
| Code locking can often be tolerated at this level, because this |
| level is so infrequently reached in well-designed systems~\cite{McKenney01e}. |
| |
| Concurrent userspace allocators face similar |
| challenges~\cite{ChrisKennelly2020tcmalloc,JasonEvans2011jemalloc}. |
| |
| Despite this real-world design's greater complexity, the underlying |
| idea is the same---repeated application of parallel fastpath, |
| as shown in |
| \cref{fig:app:questions:Schematic of Real-World Parallel Allocator}. |
| |
| And ``parallel fastpath'' is one of the solutions to the non-partitionable |
| application problem put forth on |
| \cpageref{sec:SMPdesign:Problems Parallel Fastpath}. |
| |
| \begin{table} |
| \rowcolors{1}{}{lightgray} |
| \small |
| \centering |
| \renewcommand*{\arraystretch}{1.25} |
| \begin{tabularx}{\twocolumnwidth}{ll>{\raggedright\arraybackslash}X} |
| \toprule |
| Level & Locking & Purpose \\ |
| \midrule |
| Per-thread pool & Data ownership & High-speed allocation \\ |
| Global block pool & Data locking & Distributing blocks among threads \\ |
| Coalescing & Data locking & Combining blocks into pages \\ |
| System memory & Code locking & Memory from/to system \\ |
| \bottomrule |
| \end{tabularx} |
| \caption{Schematic of Real-World Parallel Allocator} |
| \label{fig:app:questions:Schematic of Real-World Parallel Allocator} |
| \end{table} |
| |
| \input{SMPdesign/beyond} |
| |
| \QuickQuizAnswersChp{qqzSMPdesign} |