| % SMPdesign/partexercises.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Partitioning Exercises} |
| \label{sec:SMPdesign:Partitioning Exercises} |
| % |
| \epigraph{Whenever a theory appears to you as the only possible one, |
| take this as a sign that you have neither understood the theory |
| nor the problem which it was intended to solve.} |
| {Karl Popper} |
| |
| Although partitioning is more widely understood than it was in the early |
| 2000s, its value is still underappreciated. |
| \Cref{sec:SMPdesign:Dining Philosophers Problem} |
| therefore takes more highly parallel look at the classic Dining |
| Philosophers problem and |
| \cref{sec:SMPdesign:Double-Ended Queue} |
| revisits the double-ended queue. |
| |
| \subsection{Dining Philosophers Problem} |
| \label{sec:SMPdesign:Dining Philosophers Problem} |
| |
| \begin{figure} |
| \centering |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5} |
| \caption{Dining Philosophers Problem} |
| \ContributedBy{fig:SMPdesign:Dining Philosophers Problem}{Kornilios Kourtis} |
| \end{figure} |
| |
| \Cref{fig:SMPdesign:Dining Philosophers Problem} shows a diagram |
| of the classic \IX{Dining Philosophers problem}~\cite{Dijkstra1971HOoSP}. |
| This problem features five philosophers who do nothing but think and |
| eat a ``very difficult kind of spaghetti'' which requires two forks |
| to eat.\footnote{ |
| But feel free to instead think in terms of chopsticks.} |
| A given philosopher is permitted to use only the forks to his or her |
| immediate right and left, but will not put a given fork down until sated. |
| |
| \begin{figure*} |
| \centering |
| \IfTwoColumn{ |
| \resizebox{5in}{!}{\includegraphics{cartoons/Dining-philosophers}} |
| }{ |
| \resizebox{\onecolumntextwidth}{!}{\includegraphics{cartoons/Dining-philosophers}} |
| } |
| \caption{Partial Starvation Is Also Bad} |
| \ContributedBy{fig:cpu:Partial Starvation Is Also Bad}{Melissa Broussard} |
| \end{figure*} |
| |
| The object is to construct an algorithm that, quite literally, |
| prevents \IX{starvation}. |
| One starvation scenario would be if all of the philosophers picked up |
| their leftmost forks simultaneously. |
| Because none of them will put down their fork until after they finished |
| eating, and because none of them may pick up their second fork until at |
| least one of them has finished eating, they all starve. |
| Please note that it is not sufficient to allow at least one philosopher |
| to eat. |
| As \cref{fig:cpu:Partial Starvation Is Also Bad} |
| shows, starvation of even a few of the philosophers is to be avoided. |
| |
| \begin{figure} |
| \centering |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5TB} |
| \caption{Dining Philosophers Problem, Textbook Solution} |
| \ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}{Kornilios Kourtis} |
| \end{figure} |
| |
| \pplsur{Edsger W.}{Dijkstra}'s solution used a global semaphore, |
| which works fine assuming |
| negligible communications delays, an assumption that became invalid |
| in the late 1980s or early 1990s.\footnote{ |
| It is all too easy to denigrate Dijkstra from the viewpoint |
| of the year 2021, more than 50 years after the fact. |
| If you still feel the need to denigrate Dijkstra, my advice |
| is to publish something, wait 50 years, and then see |
| how well \emph{your} ideas stood the test of time.} |
| More recent solutions number the forks as shown in |
| \cref{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}. |
| Each philosopher picks up the lowest-numbered fork next to his or her |
| plate, then picks up the other fork. |
| The philosopher sitting in the uppermost position in the diagram thus |
| picks up the leftmost fork first, then the rightmost fork, while the |
| rest of the philosophers instead pick up their rightmost fork first. |
| Because two of the philosophers will attempt to pick up fork~1 first, |
| and because only one of those two philosophers will succeed, |
| there will be five forks available to four philosophers. |
| At least one of these four will have two forks, and will thus be able |
| to eat. |
| |
| This general technique of numbering resources and acquiring them in |
| numerical order is heavily used as a deadlock-prevention technique. |
| However, it is easy to imagine a sequence of events that will result |
| in only one philosopher eating at a time even though all are hungry: |
| |
| \begin{enumerate} |
| \item P2 picks up fork~1, preventing P1 from taking a fork. |
| \item P3 picks up fork~2. |
| \item P4 picks up fork~3. |
| \item P5 picks up fork~4. |
| \item P5 picks up fork~5 and eats. |
| \item P5 puts down forks~4 and 5. |
| \item P4 picks up fork~4 and eats. |
| \end{enumerate} |
| |
| In short, this algorithm can result in only one philosopher eating at |
| a given time, even when all five philosophers are hungry, |
| despite the fact that there are more than enough forks for two |
| philosophers to eat concurrently. |
| It should be possible to do better than this! |
| |
| \begin{figure} |
| \centering |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher4part-b} |
| \caption{Dining Philosophers Problem, Partitioned} |
| \ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Partitioned}{Kornilios Kourtis} |
| \end{figure} |
| |
| One approach is shown in |
| \cref{fig:SMPdesign:Dining Philosophers Problem; Partitioned}, |
| which includes four philosophers rather than five to better illustrate the |
| partition technique. |
| Here the upper and rightmost philosophers share a pair of forks, |
| while the lower and leftmost philosophers share another pair of forks. |
| If all philosophers are simultaneously hungry, at least two will |
| always be able to eat concurrently. |
| In addition, as shown in the figure, the forks can now be bundled |
| so that the pair are picked up and put down simultaneously, simplifying |
| the acquisition and release algorithms. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Is there a better solution to the Dining |
| Philosophers Problem? |
| }\QuickQuizAnswerB{ |
| One such improved solution is shown in |
| \cref{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned}, |
| where the philosophers are simply provided with an additional |
| five forks. |
| All five philosophers may now eat simultaneously, and there |
| is never any need for philosophers to wait on one another. |
| In addition, this approach offers greatly improved disease control. |
| |
| \begin{figure} |
| \centering |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM} |
| \caption{Dining Philosophers Problem, Fully Partitioned} |
| \QContributedBy{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned}{Kornilios Kourtis} |
| \end{figure} |
| |
| This solution might seem like cheating to some, but such |
| ``cheating'' is key to finding good solutions to many |
| concurrency problems, as any hungry philosopher would agree. |
| |
| And this is one solution to the Dining Philosophers |
| concurrent-consumption problem called out on |
| \cpageref{sec:SMPdesign:Problems Dining Philosophers}. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| How would you valididate an algorithm alleged to solve the Dining |
| Philosophers Problem? |
| }\QuickQuizAnswerE{ |
| Much depends on the details of the algorithm, but here are a |
| couple of places to start. |
| |
| First, for algorithms in which picking up left-hand and right-hand |
| forks are separate operations, start with all forks on the table. |
| Then have all philosophers attempt to pick up their first fork. |
| Once all philosophers either have their first fork or are waiting |
| for someone to put down their first fork, have each non-waiting |
| philosopher pick up their second fork. |
| At this point in any starvation-free solution, at least one |
| philosopher will be eating. |
| If there were any waiting philosophers, repeat this test, |
| preferably imposing random variations in timing. |
| |
| Second, create a stress test in which philosphers start and |
| stop eating at random times. |
| Generate starvation and fairness conditions and verify that |
| these conditions are met. |
| Here are a couple of example starvation and fairness conditions: |
| |
| \begin{enumerate} |
| \item If all other philosophers have stopped eating $N$ times |
| since a given philosopher attempted to pick up a given |
| fork, that philosopher should have succeeded in picking |
| up that fork. |
| For high-quality solutions using high-quality locking |
| primitives (or high-quality atomic operations), $N=1$ |
| is doable. |
| \item Given an upper bound $T$ on the time any philosopher holds |
| onto both forks before putting them down, the maximum |
| waiting time for any philosopher should be bounded by |
| $NT$ for some $N$ that is not hugely larger than the |
| number of philosophers. |
| \item Generate some statistic representing the time from |
| when philosophers attempt to pick up their first fork |
| to the time when they start eating. |
| The smaller this statistic, the better the solution. |
| Mean, median, and maximum are all useful statistics, |
| but examining the full distribution can also be |
| enlightening. |
| \end{enumerate} |
| |
| Readers are encouraged to actually try testing any of the |
| solutions presented in this book, and especially testing solutions |
| of their own devising. |
| }\QuickQuizEndE |
| } |
| |
| This is an example of ``horizontal parallelism''~\cite{Inman85} |
| or ``data parallelism'', |
| so named because there is no dependency among the pairs of philosophers. |
| In a horizontally parallel data-processing system, a given item of data |
| would be processed by only one of a replicated set of software |
| components. |
| |
| \QuickQuiz{ |
| And in just what sense can this ``horizontal parallelism'' be |
| said to be ``horizontal''? |
| }\QuickQuizAnswer{ |
| Jack Inman~\cite{Inman85} |
| was working with protocol stacks, which are normally |
| depicted vertically, with the application on top and the |
| hardware interconnect on the bottom. |
| Data flows up and down this stack. |
| ``Horizontal parallelism'' processes packets from different network |
| connections in parallel, while ``vertical parallelism'' |
| handles different protocol-processing steps for a given |
| packet in parallel. |
| |
| ``Vertical parallelism'' is also called ``pipelining''. |
| }\QuickQuizEnd |
| |
| \subsection{Double-Ended Queue} |
| \label{sec:SMPdesign:Double-Ended Queue} |
| |
| A double-ended queue is a data structure containing a list of elements |
| that may be inserted or removed from either end~\cite{Knuth73}. |
| It has been claimed that a lock-based implementation permitting |
| concurrent operations on both ends of the double-ended queue is |
| difficult~\cite{DanGrossman2007TMGCAnalogy}. |
| This section shows how a partitioning design strategy can result |
| in a reasonably simple implementation, looking at three |
| general approaches in the following sections. |
| But first, how should we validate a concurrent double-ended queue? |
| |
| \subsubsection{Double-Ended Queue Validation} |
| \label{sec:SMPdesign:Double-Ended Queue Validation} |
| |
| A good place to start is with invariants. |
| For example, if elements are pushed onto one end of a double-ended |
| queue and popped off of the other, the order of those elements must |
| be preserved. |
| Similarly, if elements are pushed onto one end of the queue and popped |
| off of that same end, the order of those elements must be reversed. |
| Any element popped from the queue must have been most recently pushed |
| onto that queue, and if the queue is emptied, all elements pushed onto |
| it must have already been popped from it. |
| |
| The beginnings of a test suite for concurrent double-ended queues |
| (``\path{deqtorture.h}'') provides the following checks: |
| |
| \begin{enumerate} |
| \item Element-ordering checks provided by \co{CHECK_SEQUENCE_PAIR()}. |
| \item Checks that elements popped were most recently pushed, provided |
| by \co{melee()}. |
| \item Checks that elements pushed are popped before the queue is |
| emptied, also provided by \co{melee()}. |
| \end{enumerate} |
| |
| This suite includes both sequential and concurrent tests. |
| Although this suite is good and sufficient for textbook code, you |
| should test considerably more thoroughly for code intended for |
| production use. |
| \Cref{chp:Validation,chp:Formal Verification} cover a large array |
| of validation tools and techniques. |
| |
| But with a prototype test suite in place, we are ready to look at the |
| double-ended-queue algorithms in the next sections. |
| |
| \subsubsection{Left- and Right-Hand Locks} |
| \label{sec:SMPdesign:Left- and Right-Hand Locks} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeq}} |
| \caption{Double-Ended Queue With Left- and Right-Hand Locks} |
| \label{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks} |
| \end{figure} |
| |
| One seemingly straightforward approach would be to use a doubly |
| linked list with a left-hand lock |
| for left-hand-end enqueue and dequeue operations along with a right-hand |
| lock for right-hand-end operations, as shown in |
| \cref{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}. |
| However, the problem with this approach is that the two locks' |
| domains must overlap when there are fewer than four elements on the |
| list. |
| This overlap is due to the fact that removing any given element affects |
| not only that element, but also its left- and right-hand neighbors. |
| These domains are indicated by color in the figure, with blue |
| with downward stripes indicating |
| the domain of the left-hand lock, red with upward stripes |
| indicating the domain of the right-hand |
| lock, and purple (with no stripes) indicating overlapping domains. |
| Although it is possible to create an algorithm that works this way, |
| perhaps using a dummy element similar to the two-lock queue |
| presented by Michael and Scott~\cite{MichaelScott96}, |
| the fact that it has no fewer than five special cases should raise |
| a big red flag, especially given that concurrent activity at the other |
| end of the list can shift the queue from one special case to another |
| at any time. |
| |
| It is far better to consider other designs. |
| |
| \subsubsection{Compound Double-Ended Queue} |
| \label{sec:SMPdesign:Compound Double-Ended Queue} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqpair}} |
| \caption{Compound Double-Ended Queue} |
| \label{fig:SMPdesign:Compound Double-Ended Queue} |
| \end{figure} |
| |
| One way of forcing non-overlapping lock domains is shown in |
| \cref{fig:SMPdesign:Compound Double-Ended Queue}. |
| Two separate double-ended queues are run in tandem, each protected by |
| its own lock. |
| This means that elements must occasionally be shuttled from one of |
| the double-ended queues to the other, in which case both locks must |
| be held. |
| A simple lock hierarchy may be used to avoid \IX{deadlock}, for example, |
| always acquiring the left-hand lock before acquiring the right-hand lock. |
| This will be much simpler than applying two locks to the same |
| double-ended queue, as we can unconditionally left-enqueue elements |
| to the left-hand queue and right-enqueue elements to the right-hand |
| queue. |
| The main complication arises when dequeuing from an empty queue, in |
| which case it is necessary to: |
| |
| \begin{enumerate} |
| \item If holding the right-hand lock, release it and acquire the |
| left-hand lock. |
| \item Acquire the right-hand lock. |
| \item Rebalance the elements across the two queues. |
| \item Remove the required element if there is one. |
| \item Release both locks. |
| \end{enumerate} |
| |
| \QuickQuiz{ |
| In this compound double-ended queue implementation, what should |
| be done if the queue has become non-empty while releasing |
| and reacquiring the lock? |
| }\QuickQuizAnswer{ |
| In this case, simply dequeue an item from the non-empty |
| queue, release both locks, and return. |
| }\QuickQuizEnd |
| |
| The resulting code (\path{locktdeq.c}) is quite straightforward. |
| The rebalancing operation might well shuttle a given element back |
| and forth between the two queues, wasting time and possibly requiring |
| workload-dependent heuristics to obtain optimal performance. |
| Although this might well be the best approach in some cases, it is |
| interesting to try for an algorithm with greater determinism. |
| |
| \subsubsection{Hashed Double-Ended Queue} |
| \label{sec:SMPdesign:Hashed Double-Ended Queue} |
| |
| One of the simplest and most effective ways to deterministically |
| partition a data structure is to hash it. |
| It is possible to trivially hash a double-ended queue by assigning |
| each element a sequence number based on its position in the list, |
| so that the first element left-enqueued into an empty queue is numbered |
| zero and the first element right-enqueued into an empty queue is numbered |
| one. |
| A series of elements left-enqueued into an otherwise-idle queue would |
| be assigned decreasing numbers ($-1$, $-2$, $-3$, \ldots), while a series of |
| elements right-enqueued into an otherwise-idle queue would be assigned |
| increasing numbers (2, 3, 4, \ldots). |
| A key point is that it is not necessary to actually represent a given |
| element's number, as this number will be implied by its position in |
| the queue. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash}} |
| \caption{Hashed Double-Ended Queue} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue} |
| \end{figure} |
| |
| Given this approach, we assign one lock to guard the left-hand index, |
| one to guard the right-hand index, and one lock for each hash chain. |
| \Cref{fig:SMPdesign:Hashed Double-Ended Queue} shows the resulting |
| data structure given four hash chains. |
| Note that the lock domains do not overlap, and that deadlock is avoided |
| by acquiring the index locks before the chain locks, and by never |
| acquiring more than one lock of a given type (index or chain) at a time. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash1R}} |
| \caption{Hashed Double-Ended Queue After Insertions} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue After Insertions} |
| \end{figure} |
| |
| Each hash chain is itself a double-ended queue, and in this example, |
| each holds every fourth element. |
| The uppermost portion of |
| \cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions} |
| shows the state after a single element (``R$_1$'') has been |
| right-enqueued, with the right-hand index having been incremented to |
| reference hash chain~2. |
| The middle portion of this same figure shows the state after |
| three more elements have been right-enqueued. |
| As you can see, the indexes are back to their initial states |
| (see \cref{fig:SMPdesign:Hashed Double-Ended Queue}), however, |
| each hash chain is now non-empty. |
| The lower portion of this figure shows the state after three additional |
| elements have been left-enqueued and an additional element has been |
| right-enqueued. |
| |
| From the last state shown in |
| \cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}, |
| a left-dequeue operation would return element ``L$_{-2}$'' and leave |
| the left-hand index referencing hash chain~2, which would then |
| contain only a single element (``R$_2$''). |
| In this state, a left-enqueue running concurrently with a right-enqueue |
| would result in \IX{lock contention}, but the probability of such contention |
| can be reduced to arbitrarily low levels by using a larger hash table. |
| |
| \begin{figure} |
| \centering |
| \resizebox{1.5in}{!}{\includegraphics{SMPdesign/lockdeqhashlots}} |
| \caption{Hashed Double-Ended Queue With 16 Elements} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements} |
| \end{figure} |
| |
| \Cref{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements} |
| shows how 16 elements would be organized in a four-hash-bucket |
| parallel double-ended queue. |
| Each underlying single-lock double-ended queue holds a one-quarter |
| slice of the full parallel double-ended queue. |
| |
| \begin{listing} |
| \input{CodeSamples/SMPdesign/lockhdeq@struct_pdeq.fcv} |
| \caption{Lock-Based Parallel Double-Ended Queue Data Structure} |
| \label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure} |
| \end{listing} |
| |
| \Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure} |
| shows the corresponding C-language data structure, assuming an |
| existing \co{struct deq} that provides a trivially locked |
| double-ended-queue implementation. |
| \begin{fcvref}[ln:SMPdesign:lockhdeq:struct_pdeq] |
| This data structure contains the left-hand lock on \clnref{llock}, |
| the left-hand index on \clnref{lidx}, the right-hand lock on \clnref{rlock} |
| (which is cache-aligned in the actual implementation), |
| the right-hand index on \clnref{ridx}, and, finally, the hashed array |
| of simple lock-based double-ended queues on \clnref{bkt}. |
| A high-performance implementation would of course use padding or special |
| alignment directives to avoid \IX{false sharing}. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/SMPdesign/lockhdeq@pop_push.fcv} |
| \caption{Lock-Based Parallel Double-Ended Queue Implementation} |
| \label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation} |
| \end{listing} |
| |
| \Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation} |
| (\path{lockhdeq.c}) |
| shows the implementation of the enqueue and dequeue functions.\footnote{ |
| One could easily create a polymorphic implementation in any |
| number of languages, but doing so is left as an exercise for |
| the reader.} |
| Discussion will focus on the left-hand operations, as the right-hand |
| operations are trivially derived from them. |
| |
| \begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:popl] |
| \Clnrefrange{b}{e} show \co{pdeq_pop_l()}, |
| which left\-/dequeues and returns |
| an element if possible, returning \co{NULL} otherwise. |
| \Clnref{acq} acquires the left-hand spinlock, |
| and \clnref{idx} computes the |
| index to be dequeued from. |
| \Clnref{deque} dequeues the element, and, |
| if \clnref{check} finds the result to be |
| non-\co{NULL}, \clnref{record} records the new left-hand index. |
| Either way, \clnref{rel} releases the lock, and, |
| finally, \clnref{return} returns |
| the element if there was one, or \co{NULL} otherwise. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:pushl] |
| \Clnrefrange{b}{e} show \co{pdeq_push_l()}, |
| which left-enqueues the specified |
| element. |
| \Clnref{acq} acquires the left-hand lock, |
| and \clnref{idx} picks up the left-hand |
| index. |
| \Clnref{enque} left-enqueues the specified element |
| onto the double-ended queue |
| indexed by the left-hand index. |
| \Clnref{update} then updates the left-hand index |
| and \clnref{rel} releases the lock. |
| \end{fcvref} |
| |
| As noted earlier, the right-hand operations are completely analogous |
| to their left-handed counterparts, so their analysis is left as an |
| exercise for the reader. |
| |
| \QuickQuiz{ |
| Is the hashed double-ended queue a good solution? |
| Why or why not? |
| }\QuickQuizAnswer{ |
| The best way to answer this is to run \path{lockhdeq.c} on |
| a number of different multiprocessor systems, and you are |
| encouraged to do so in the strongest possible terms. |
| One reason for concern is that each operation on this |
| implementation must acquire not one but two locks. |
| % Getting about 500 nanoseconds per element when used as |
| % a queue on a 4.2GHz Power system. This is roughly the same as |
| % the version covered by a single lock. Sequential (unlocked |
| % variant is more than an order of magnitude faster! |
| |
| The first well-designed performance study will be cited.\footnote{ |
| The studies by Dalessandro |
| et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque} |
| and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are |
| excellent starting points.} |
| Do not forget to compare to a sequential implementation! |
| }\QuickQuizEnd |
| |
| \subsubsection{Compound Double-Ended Queue Revisited} |
| \label{sec:SMPdesign:Compound Double-Ended Queue Revisited} |
| |
| This section revisits the compound double-ended queue, using a trivial |
| rebalancing scheme that moves all the elements from the non-empty |
| queue to the now-empty queue. |
| |
| \QuickQuiz{ |
| Move \emph{all} the elements to the queue that became empty? |
| In what possible universe is this brain-dead solution in any |
| way optimal??? |
| }\QuickQuizAnswer{ |
| It is optimal in the case where data flow switches direction only |
| rarely. |
| It would of course be an extremely poor choice if the double-ended |
| queue was being emptied from both ends concurrently. |
| This of course raises another question, namely, in what possible |
| universe emptying from both ends concurrently would be a reasonable |
| thing to do. |
| Work-stealing queues are one possible answer to this question. |
| }\QuickQuizEnd |
| |
| In contrast to the hashed implementation presented in |
| the previous section, the compound implementation will build on |
| a sequential implementation of a double-ended queue that uses |
| neither locks nor atomic operations. |
| |
| \begin{listing} |
| \ebresizeverb{.97}{\input{CodeSamples/SMPdesign/locktdeq@pop_push.fcv}} |
| \caption{Compound Parallel Double-Ended Queue Implementation} |
| \label{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| \end{listing} |
| |
| \Cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| shows the implementation. |
| Unlike the hashed implementation, this compound implementation is |
| asymmetric, so that we must consider the \co{pdeq_pop_l()} |
| and \co{pdeq_pop_r()} implementations separately. |
| |
| \QuickQuiz{ |
| Why can't the compound parallel double-ended queue |
| implementation be symmetric? |
| }\QuickQuizAnswer{ |
| The need to avoid deadlock by imposing a lock hierarchy |
| forces the asymmetry, just as it does in the fork-numbering |
| solution to the Dining Philosophers Problem |
| (see \cref{sec:SMPdesign:Dining Philosophers Problem}). |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popl] |
| The \co{pdeq_pop_l()} implementation is shown on |
| \clnrefrange{b}{e} |
| of the figure. |
| \Clnref{acq:l} acquires the left-hand lock, |
| which \clnref{rel:l} releases. |
| \Clnref{deq:ll} attempts to left-dequeue an element |
| from the left-hand underlying |
| double-ended queue, and, if successful, |
| skips \clnrefrange{acq:r}{skip} to simply |
| return this element. |
| Otherwise, \clnref{acq:r} acquires the right-hand lock, \clnref{deq:lr} |
| left-dequeues an element from the right-hand queue, |
| and \clnref{move} moves any remaining elements on the right-hand |
| queue to the left-hand queue, \clnref{init:r} initializes |
| the right-hand queue, |
| and \clnref{rel:r} releases the right-hand lock. |
| The element, if any, that was dequeued on \clnref{deq:lr} will be returned. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr] |
| The \co{pdeq_pop_r()} implementation is shown on \clnrefrange{b}{e} |
| of the figure. |
| As before, \clnref{acq:r1} acquires the right-hand lock |
| (and \clnref{rel:r2} |
| releases it), and \clnref{deq:rr1} attempts to right-dequeue an element |
| from the right-hand queue, and, if successful, |
| skips \clnrefrange{rel:r1}{skip2} |
| to simply return this element. |
| However, if \clnref{check1} determines that there was no element to dequeue, |
| \clnref{rel:r1} releases the right-hand lock and |
| \clnrefrange{acq:l}{acq:r2} acquire both |
| locks in the proper order. |
| \Clnref{deq:rr2} then attempts to right-dequeue an element |
| from the right-hand |
| list again, and if \clnref{check2} determines that this second attempt has |
| failed, \clnref{deq:rl} right-dequeues an element from the left-hand queue |
| (if there is one available), \clnref{move} moves any remaining elements |
| from the left-hand queue to the right-hand queue, and \clnref{init:l} |
| initializes the left-hand queue. |
| Either way, \clnref{rel:l} releases the left-hand lock. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Why is it necessary to retry the right-dequeue operation |
| on \clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:deq:rr2} of |
| \cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}? |
| }\QuickQuizAnswerB{ |
| \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr] |
| This retry is necessary because some other thread might have |
| enqueued an element between the time that this thread dropped |
| \co{d->rlock} on \clnref{rel:r1} and the time that it reacquired this |
| same lock on \clnref{acq:r2}. |
| \end{fcvref} |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Surely the left-hand lock must \emph{sometimes} be available!!! |
| So why is it necessary that |
| \clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:rel:r1} of |
| \cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| unconditionally release the right-hand lock? |
| }\QuickQuizAnswerE{ |
| It would be possible to use \co{spin_trylock()} to attempt |
| to acquire the left-hand lock when it was available. |
| However, the failure case would still need to drop the |
| right-hand lock and then re-acquire the two locks in order. |
| Making this transformation (and determining whether or not |
| it is worthwhile) is left as an exercise for the reader. |
| }\QuickQuizEndE |
| } |
| |
| \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushl] |
| The \co{pdeq_push_l()} implementation is shown on |
| \clnrefrange{b}{e} of |
| \cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}. |
| \Clnref{acq:l} acquires the left-hand spinlock, |
| \clnref{que:l} left-enqueues the |
| element onto the left-hand queue, and finally \clnref{rel:l} releases |
| the lock. |
| \end{fcvref} |
| \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushr] |
| The \co{pdeq_push_r()} implementation (shown on \clnrefrange{b}{e}) |
| is quite similar. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| But in the case where data is flowing in only one direction, |
| the algorithm shown in |
| \cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| will have both ends attempting to acquire the same lock |
| whenever the consuming end empties its underlying |
| double-ended queue. |
| Doesn't that mean that sometimes this algorithm fails to |
| provide concurrent access to both ends of the queue even |
| when the queue contains an arbitrarily large number of elements? |
| }\QuickQuizAnswer{ |
| Indeed it does! |
| |
| But the same is true of other algorithms claiming this property. |
| For example, in solutions using software transactional memory |
| mechanisms based on hashed arrays of locks, |
| the leftmost and rightmost elements' addresses will sometimes |
| happen to hash to the same lock. |
| These hash collisions will also prevent concurrent access. |
| For another example, solutions using hardware transactional |
| memory mechanisms with software |
| fallbacks~\cite{Yoo:2013:PEI:2503210.2503232,RickMerrit2011PowerTM,ChristianJacobi2012MainframeTM} |
| often use locking within those software fallbacks, and thus |
| suffer (albeit hopefully rarely) from whatever concurrency |
| limitations that these locking solutions suffer from. |
| |
| Therefore, as of 2021, all practical solutions to the |
| concurrent double-ended queue problem fail to provide full |
| concurrency in at least some circumstances, including the |
| compound double-ended queue. |
| }\QuickQuizEnd |
| |
| \subsubsection{Double-Ended Queue Discussion} |
| \label{sec:SMPdesign:Double-Ended Queue Discussion} |
| |
| The compound implementation is somewhat more complex than the |
| hashed variant presented in |
| \cref{sec:SMPdesign:Hashed Double-Ended Queue}, |
| but is still reasonably simple. |
| Of course, a more intelligent rebalancing scheme could be arbitrarily |
| complex, but the simple scheme shown here has been shown to |
| perform well compared to software |
| alternatives~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque} |
| and even compared to algorithms using hardware |
| assist~\cite{DavidDice:2010:SCA:HTM:deque}. |
| Nevertheless, the best we can hope for from such a scheme |
| is 2x scalability, as at most two threads can be holding the |
| dequeue's locks concurrently. |
| This limitation also applies to algorithms based on \IXacrl{nbs}, |
| such as the compare-and-swap-based dequeue algorithm of |
| Michael~\cite{DBLP:conf/europar/Michael03}.\footnote{ |
| This paper is interesting in that it showed that special |
| double-compare-and-swap (DCAS) instructions are not needed |
| for lock-free implementations of double-ended queues. |
| Instead, the common compare-and-swap (e.g., x86 cmpxchg) |
| suffices.} |
| |
| \QuickQuiz{ |
| Why are there not one but two solutions to the double-ended queue |
| problem? |
| }\QuickQuizAnswer{ |
| There are actually at least three. |
| The third, by Dominik Dingel, makes interesting use of |
| reader-writer locking, and may be found in \path{lockrwdeq.c}. |
| |
| And so there is not one, but rather three solutions to the |
| lock-based double-ended queue problem on |
| \cpageref{sec:SMPdesign:Problems Double-Ended Queue}! |
| }\QuickQuizEnd |
| |
| In fact, as noted by Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque}, |
| an unsynchronized single-threaded double-ended queue significantly |
| outperforms any of the parallel implementations they studied. |
| Therefore, the key point is that there can be significant overhead enqueuing to |
| or dequeuing from a shared queue, regardless of implementation. |
| This should come as no surprise in light of the material in |
| \cref{chp:Hardware and its Habits}, given the strict |
| first-in-first-out (FIFO) nature of these queues. |
| |
| Furthermore, these strict FIFO queues are strictly FIFO only with |
| respect to |
| \emph{linearization points}~\cite{Herlihy:1990:LCC:78969.78972}\footnote{ |
| In short, a linearization point is a single point within a given |
| function where that function can be said to have taken effect. |
| In this lock-based implementation, the linearization points |
| can be said to be anywhere within the critical section that |
| does the work.} |
| that are not visible to the caller, in fact, in these examples, |
| the linearization points are buried in the lock-based critical |
| sections. |
| These queues are not strictly FIFO with respect to (say) the times at which |
| the individual operations started~\cite{AndreasHaas2012FIFOisnt}. |
| This indicates that the strict FIFO property is not all that valuable |
| in concurrent programs, and in fact, Kirsch et al.\ present less-strict |
| queues that provide improved performance and |
| scalability~\cite{ChristophMKirsch2012FIFOisntTR}.\footnote{ |
| Nir Shavit produced relaxed stacks for roughly the same |
| reasons~\cite{Shavit:2011:DSM:1897852.1897873}. |
| This situation leads some to believe that the linearization |
| points are useful to theorists rather than developers, and |
| leads others to wonder to what extent the designers of such |
| data structures and algorithms were considering the needs |
| of their users.} |
| All that said, if you are pushing all the data used by your concurrent |
| program through a single queue, you really need to rethink your |
| overall design. |
| |
| \subsection{Partitioning Example Discussion} |
| \label{sec:SMPdesign:Partitioning Example Discussion} |
| |
| The optimal solution to the dining philosophers problem given in |
| the answer to the Quick Quiz in |
| \cref{sec:SMPdesign:Dining Philosophers Problem} |
| is an excellent example of ``horizontal parallelism'' or |
| ``data parallelism''. |
| The synchronization overhead in this case is nearly (or even exactly) |
| zero. |
| In contrast, the double-ended |
| queue implementations are examples of ``vertical parallelism'' or |
| ``pipelining'', given that data moves from one thread to another. |
| The tighter coordination required for pipelining in turn requires |
| larger units of work to obtain a given level of \IX{efficiency}. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| The tandem double-ended queue runs about twice as fast as |
| the hashed double-ended queue, even when I increase the |
| size of the hash table to an insanely large number. |
| Why is that? |
| }\QuickQuizAnswerB{ |
| The hashed double-ended queue's locking design only permits |
| one thread at a time at each end, and further requires |
| two lock acquisitions for each operation. |
| The tandem double-ended queue also permits one thread at a time |
| at each end, and in the common case requires only one lock |
| acquisition per operation. |
| Therefore, the tandem double-ended queue should be expected to |
| outperform the hashed double-ended queue. |
| |
| Can you create a double-ended queue that allows multiple |
| concurrent operations at each end? |
| If so, how? |
| If not, why not? |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Is there a significantly better way of handling concurrency |
| for double-ended queues? |
| }\QuickQuizAnswerE{ |
| One approach is to transform the problem to be solved |
| so that multiple double-ended queues can be used in parallel, |
| allowing the simpler single-lock double-ended queue to be used, |
| and perhaps also replace each double-ended queue with a pair of |
| conventional single-ended queues. |
| Without such ``horizontal scaling'', the speedup is limited |
| to 2.0. |
| In contrast, horizontal-scaling designs can achieve very large |
| speedups, and are especially attractive if there are multiple threads |
| working either end of the queue, because in this |
| multiple-thread case the dequeue |
| simply cannot provide strong ordering guarantees. |
| After all, the fact that a given thread removed an item first |
| in no way implies that it will process that item |
| first~\cite{AndreasHaas2012FIFOisnt}. |
| And if there are no guarantees, we may as well obtain the |
| performance benefits that come with refusing to provide these |
| guarantees. |
| % about twice as fast as hashed version on 4.2GHz Power. |
| |
| Regardless of whether or not the problem can be transformed |
| to use multiple queues, it is worth asking whether work can |
| be batched so that each enqueue and dequeue operation corresponds |
| to larger units of work. |
| This batching approach decreases contention on the queue data |
| structures, which increases both performance and scalability, |
| as will be seen in |
| \cref{sec:SMPdesign:Synchronization Granularity}. |
| After all, if you must incur high synchronization overheads, |
| be sure you are getting your money's worth. |
| |
| Other researchers are working on other ways to take advantage |
| of limited ordering guarantees in |
| queues~\cite{ChristophMKirsch2012FIFOisntTR}. |
| }\QuickQuizEndE |
| } |
| |
| These two examples show just how powerful partitioning can be in |
| devising parallel algorithms. |
| \Cref{sec:SMPdesign:Locking Granularity and Performance} |
| looks briefly at a third example, matrix multiply. |
| However, all three of these examples beg for more and better design |
| criteria for parallel programs, a topic taken up in the next section. |