| % SMPdesign/partexercises.tex |
| |
| \section{Partitioning Exercises} |
| \label{sec:SMPdesign:Partitioning Exercises} |
| |
| This section uses a pair of exercises (the classic Dining Philosophers |
| problem and a double-ended queue) to demonstrate the value of partitioning. |
| |
| \subsection{Dining Philosophers Problem} |
| \label{sec:SMPdesign:Dining Philosophers Problem} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5} |
| \end{center} |
| \caption{Dining Philosophers Problem} |
| \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem}{Kornilios Kourtis} |
| \end{figure} |
| |
| Figure~\ref{fig:SMPdesign:Dining Philosophers Problem} shows a diagram |
| of the classic Dining Philosophers problem~\cite{Dijkstra1971HOoSP}. |
| This problem features five philosophers who do nothing but think and |
| eat a ``very difficult kind of spaghetti'' which requires two forks |
| to eat. |
| A given philosopher is permitted to use only the forks to his or her |
| immediate right and left, and once a philosopher picks up a fork, |
| he or she will not put it down until sated.\footnote{ |
| Readers who have difficulty imagining a food that requires |
| two forks are invited to instead think in terms of chopsticks.} |
| |
| \begin{figure*}[tb] |
| \begin{center} |
| \resizebox{5in}{!}{\includegraphics{cartoons/Dining-philosophers}} |
| \end{center} |
| \caption{Partial Starvation Is Also Bad} |
| \ContributedBy{Figure}{fig:cpu:Partial Starvation Is Also Bad}{Melissa Broussard} |
| \end{figure*} |
| |
| The object is to construct an algorithm that, quite literally, |
| prevents starvation. |
| One starvation scenario would be if all of the philosophers picked up |
| their leftmost forks simultaneously. |
| Because none of them would put down their fork until after they ate, |
| and because none of them may pick up their second fork until at least |
| one has finished eating, they all starve. |
| Please note that it is not sufficient to allow at least one philosopher |
| to eat. |
| As Figure~\ref{fig:cpu:Partial Starvation Is Also Bad} |
| shows, starvation of even a few of the philosophers is to be avoided. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5TB} |
| \end{center} |
| \caption{Dining Philosophers Problem, Textbook Solution} |
| \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Textbook Solution}{Kornilios Kourtis} |
| \end{figure} |
| |
| Dijkstra's solution used a global semaphore, which works fine assuming |
| negligible communications delays, an assumption that became invalid |
| in the late 1980s or early 1990s.\footnote{ |
| It is all too easy to denigrate Dijkstra from the viewpoint |
| of the year 2012, more than 40 years after the fact. |
| If you still feel the need to denigrate Dijkstra, my advice |
| is to publish something, wait 40 years, and then see |
| how \emph{your} words stood the test of time.} |
| Therefore, recent solutions number the forks as shown in |
| Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Textbook Solution}. |
| Each philosopher picks up the lowest-numbered fork next to his or her |
| plate, then picks up the highest-numbered fork. |
| The philosopher sitting in the uppermost position in the diagram thus |
| picks up the leftmost fork first, then the rightmost fork, while the |
| rest of the philosophers instead pick up their rightmost fork first. |
| Because two of the philosophers will attempt to pick up fork~1 first, |
| and because only one of those two philosophers will succeed, |
| there will be five forks available to four philosophers. |
| At least one of these four will be guaranteed to have two forks, |
| and thus be able to proceed eating. |
| |
| This general technique of numbering resources and acquiring them in |
| numerical order is heavily used as a deadlock-prevention technique. |
| However, it is easy to imagine a sequence of events that will result |
| in only one philosopher eating at a time even though all are hungry: |
| |
| \begin{enumerate} |
| \item P2 picks up fork~1, preventing P1 from taking a fork. |
| \item P3 picks up fork~2. |
| \item P4 picks up fork~3. |
| \item P5 picks up fork~4. |
| \item P5 picks up fork~5 and eats. |
| \item P5 puts down forks~4 and 5. |
| \item P4 picks up fork~4 and eats. |
| \end{enumerate} |
| |
| In short, this algorithm can result in only one philosopher eating at |
| a given time, even when all five philosophers are hungry, |
| despite the fact that there are more than enough forks for two |
| philosophers to eat concurrently. |
| |
| Please think about ways of partitioning the Dining Philosophers Problem |
| before reading further. |
| \cleardoublepage |
| |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher4part-b} |
| \end{center} |
| \caption{Dining Philosophers Problem, Partitioned} |
| \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Partitioned}{Kornilios Kourtis} |
| \end{figure} |
| |
| One approach is shown in |
| Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Partitioned}, |
| which includes four philosophers rather than five to better illustrate the |
| partition technique. |
| Here the upper and rightmost philosophers share a pair of forks, |
| while the lower and leftmost philosophers share another pair of forks. |
| If all philosophers are simultaneously hungry, at least two will |
| always be able to eat concurrently. |
| In addition, as shown in the figure, the forks can now be bundled |
| so that the pair are picked up and put down simultaneously, simplifying |
| the acquisition and release algorithms. |
| |
| \QuickQuiz{} |
| Is there a better solution to the Dining |
| Philosophers Problem? |
| \QuickQuizAnswer{ |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM} |
| \end{center} |
| \caption{Dining Philosophers Problem, Fully Partitioned} |
| \QContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis} |
| \end{figure} |
| |
| One such improved solution is shown in |
| Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}, |
| where the philosophers are simply provided with an additional |
| five forks. |
| All five philosophers may now eat simultaneously, and there |
| is never any need for philosophers to wait on one another. |
| In addition, this approach offers greatly improved disease control. |
| |
| This solution might seem like cheating to some, but such |
| ``cheating'' is key to finding good solutions to many |
| concurrency problems. |
| } \QuickQuizEnd |
| |
| This is an example of ``horizontal parallelism''~\cite{Inman85} |
| or ``data parallelism'', |
| so named because there is no dependency among the pairs of philosophers. |
| In a horizontally parallel data-processing system, a given item of data |
| would be processed by only one of a replicated set of software |
| components. |
| |
| \QuickQuiz{} |
| And in just what sense can this ``horizontal parallelism'' be |
| said to be ``horizontal''? |
| \QuickQuizAnswer{ |
| Inman was working with protocol stacks, which are normally |
| depicted vertically, with the application on top and the |
| hardware interconnect on the bottom. |
| Data flows up and down this stack. |
| ``Horizontal parallelism'' processes packets from different network |
| connections in parallel, while ``vertical parallelism'' |
| handles different protocol-processing steps for a given |
| packet in parallel. |
| |
| ``Vertical parallelism'' is also called ``pipelining''. |
| } \QuickQuizEnd |
| |
| \subsection{Double-Ended Queue} |
| \label{sec:SMPdesign:Double-Ended Queue} |
| |
| A double-ended queue is a data structure containing a list of elements |
| that may be inserted or removed from either end~\cite{Knuth73}. |
| It has been claimed that a lock-based implementation permitting |
| concurrent operations on both ends of the double-ended queue is |
| difficult~\cite{DanGrossman2007TMGCAnalogy}. |
| This section shows how a partitioning design strategy can result |
| in a reasonably simple implementation, looking at three |
| general approaches in the following sections. |
| |
| \subsubsection{Left- and Right-Hand Locks} |
| \label{sec:SMPdesign:Left- and Right-Hand Locks} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeq}} |
| \end{center} |
| \caption{Double-Ended Queue With Left- and Right-Hand Locks} |
| \label{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks} |
| \end{figure} |
| |
| One seemingly straightforward approach would be to use a doubly |
| linked list with a left-hand lock |
| for left-hand-end enqueue and dequeue operations along with a right-hand |
| lock for right-hand-end operations, as shown in |
| Figure~\ref{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}. |
| However, the problem with this approach is that the two locks' |
| domains must overlap when there are fewer than four elements on the |
| list. |
| This overlap is due to the fact that removing any given element affects |
| not only that element, but also its left- and right-hand neighbors. |
| These domains are indicated by color in the figure, with blue |
| with downward stripes indicating |
| the domain of the left-hand lock, red with upward stripes |
| indicating the domain of the right-hand |
| lock, and purple (with no stripes) indicating overlapping domains. |
| Although it is possible to create an algorithm that works this way, |
| the fact that it has no fewer than five special cases should raise |
| a big red flag, especially given that concurrent activity at the other |
| end of the list can shift the queue from one special case to another |
| at any time. |
| It is far better to consider other designs. |
| |
| \subsubsection{Compound Double-Ended Queue} |
| \label{sec:SMPdesign:Compound Double-Ended Queue} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqpair}} |
| \end{center} |
| \caption{Compound Double-Ended Queue} |
| \label{fig:SMPdesign:Compound Double-Ended Queue} |
| \end{figure} |
| |
| One way of forcing non-overlapping lock domains is shown in |
| Figure~\ref{fig:SMPdesign:Compound Double-Ended Queue}. |
| Two separate double-ended queues are run in tandem, each protected by |
| its own lock. |
| This means that elements must occasionally be shuttled from one of |
| the double-ended queues to the other, in which case both locks must |
| be held. |
| A simple lock hierarchy may be used to avoid deadlock, for example, |
| always acquiring the left-hand lock before acquiring the right-hand lock. |
| This will be much simpler than applying two locks to the same |
| double-ended queue, as we can unconditionally left-enqueue elements |
| to the left-hand queue and right-enqueue elements to the right-hand |
| queue. |
| The main complication arises when dequeuing from an empty queue, in |
| which case it is necessary to: |
| |
| \begin{enumerate} |
| \item If holding the right-hand lock, release it and acquire the |
| left-hand lock. |
| \item Acquire the right-hand lock. |
| \item Rebalance the elements across the two queues. |
| \item Remove the required element if there is one. |
| \item Release both locks. |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| In this compound double-ended queue implementation, what should |
| be done if the queue has become non-empty while releasing |
| and reacquiring the lock? |
| \QuickQuizAnswer{ |
| In this case, simply dequeue an item from the non-empty |
| queue, release both locks, and return. |
| } \QuickQuizEnd |
| |
| The rebalancing operation might well shuttle a given element back |
| and forth between the two queues, wasting time and possibly requiring |
| workload-dependent heuristics to obtain optimal performance. |
| Although this might well be the best approach in some cases, it is |
| interesting to try for an algorithm with greater determinism. |
| |
| \subsubsection{Hashed Double-Ended Queue} |
| \label{sec:SMPdesign:Hashed Double-Ended Queue} |
| |
| One of the simplest and most effective ways to deterministically |
| partition a data structure is to hash it. |
| It is possible to trivially hash a double-ended queue by assigning |
| each element a sequence number based on its position in the list, |
| so that the first element left-enqueued into an empty queue is numbered |
| zero and the first element right-enqueued into an empty queue is numbered |
| one. |
| A series of elements left-enqueued into an otherwise-idle queue would |
| be assigned decreasing numbers (-1, -2, -3, ...), while a series of |
| elements right-enqueued into an otherwise-idle queue would be assigned |
| increasing numbers (2, 3, 4, ...). |
| A key point is that it is not necessary to actually represent a given |
| element's number, as this number will be implied by its position in |
| the queue. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash}} |
| \end{center} |
| \caption{Hashed Double-Ended Queue} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue} |
| \end{figure} |
| |
| Given this approach, we assign one lock to guard the left-hand index, |
| one to guard the right-hand index, and one lock for each hash chain. |
| Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue} shows the resulting |
| data structure given four hash chains. |
| Note that the lock domains do not overlap, and that deadlock is avoided |
| by acquiring the index locks before the chain locks, and by never |
| acquiring more than one lock of each type (index or chain) at a time. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash1R}} |
| \end{center} |
| \caption{Hashed Double-Ended Queue After Insertions} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue After Insertions} |
| \end{figure} |
| |
| Each hash chain is itself a double-ended queue, and in this example, |
| each holds every fourth element. |
| The uppermost portion of |
| Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions} |
| shows the state after a single element (``R1'') has been |
| right-enqueued, with the right-hand index having been incremented to |
| reference hash chain~2. |
| The middle portion of this same figure shows the state after |
| three more elements have been right-enqueued. |
| As you can see, the indexes are back to their initial states |
| (see Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue}), however, |
| each hash chain is now non-empty. |
| The lower portion of this figure shows the state after three additional |
| elements have been left-enqueued and an additional element has been |
| right-enqueued. |
| |
| From the last state shown in |
| Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}, |
| a left-dequeue operation would return element ``L-2'' and leave |
| the left-hand index referencing hash chain~2, which would then |
| contain only a single element (``R2''). |
| In this state, a left-enqueue running concurrently with a right-enqueue |
| would result in lock contention, but the probability of such contention |
| can be reduced to arbitrarily low levels by using a larger hash table. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{1.5in}{!}{\includegraphics{SMPdesign/lockdeqhashlots}} |
| \end{center} |
| \caption{Hashed Double-Ended Queue With 12 Elements} |
| \label{fig:SMPdesign:Hashed Double-Ended Queue With 12 Elements} |
| \end{figure} |
| |
| Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue With 12 Elements} |
| shows how 12 elements would be organized in a four-hash-bucket |
| parallel double-ended queue. |
| Each underlying single-lock double-ended queue holds a one-quarter |
| slice of the full parallel double-ended queue. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct pdeq { |
| 2 spinlock_t llock; |
| 3 int lidx; |
| 4 spinlock_t rlock; |
| 5 int ridx; |
| 6 struct deq bkt[DEQ_N_BKTS]; |
| 7 }; |
| \end{verbatim} |
| } |
| \caption{Lock-Based Parallel Double-Ended Queue Data Structure} |
| \label{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure} |
| \end{figure} |
| |
| Figure~\ref{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure} |
| shows the corresponding C-language data structure, assuming an |
| existing \co{struct deq} that provides a trivially locked |
| double-ended-queue implementation. |
| This data structure contains the left-hand lock on line~2, |
| the left-hand index on line~3, the right-hand lock on line~4 |
| (which is cache-aligned in the actual implementation), |
| the right-hand index on line~5, and, finally, the hashed array |
| of simple lock-based double-ended queues on line~6. |
| A high-performance implementation would of course use padding or special |
| alignment directives to avoid false sharing. |
| |
| \begin{figure*}[bp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct cds_list_head *pdeq_pop_l(struct pdeq *d) |
| 2 { |
| 3 struct cds_list_head *e; |
| 4 int i; |
| 5 |
| 6 spin_lock(&d->llock); |
| 7 i = moveright(d->lidx); |
| 8 e = deq_pop_l(&d->bkt[i]); |
| 9 if (e != NULL) |
| 10 d->lidx = i; |
| 11 spin_unlock(&d->llock); |
| 12 return e; |
| 13 } |
| 14 |
| 15 struct cds_list_head *pdeq_pop_r(struct pdeq *d) |
| 16 { |
| 17 struct cds_list_head *e; |
| 18 int i; |
| 19 |
| 20 spin_lock(&d->rlock); |
| 21 i = moveleft(d->ridx); |
| 22 e = deq_pop_r(&d->bkt[i]); |
| 23 if (e != NULL) |
| 24 d->ridx = i; |
| 25 spin_unlock(&d->rlock); |
| 26 return e; |
| 27 } |
| 28 |
| 29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d) |
| 30 { |
| 31 int i; |
| 32 |
| 33 spin_lock(&d->llock); |
| 34 i = d->lidx; |
| 35 deq_push_l(e, &d->bkt[i]); |
| 36 d->lidx = moveleft(d->lidx); |
| 37 spin_unlock(&d->llock); |
| 38 } |
| 39 |
| 40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d) |
| 41 { |
| 42 int i; |
| 43 |
| 44 spin_lock(&d->rlock); |
| 45 i = d->ridx; |
| 46 deq_push_r(e, &d->bkt[i]); |
| 47 d->ridx = moveright(d->ridx); |
| 48 spin_unlock(&d->rlock); |
| 49 } |
| \end{verbatim} |
| } |
| \caption{Lock-Based Parallel Double-Ended Queue Implementation} |
| \label{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation} |
| \end{figure*} |
| |
| Figure~\ref{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation} |
| (\co{lockhdeq.c}) |
| shows the implementation of the enqueue and dequeue functions.\footnote{ |
| One could easily create a polymorphic implementation in any |
| number of languages, but doing so is left as an exercise for |
| the reader.} |
| Discussion will focus on the left-hand operations, as the right-hand |
| operations are trivially derived from them. |
| |
| Lines~1-13 show \co{pdeq_pop_l()}, which left-dequeues and returns |
| an element if possible, returning \co{NULL} otherwise. |
| Line~6 acquires the left-hand spinlock, and line~7 computes the |
| index to be dequeued from. |
| Line~8 dequeues the element, and, if line~9 finds the result to be |
| non-\co{NULL}, line~10 records the new left-hand index. |
| Either way, line~11 releases the lock, and, finally, line~12 returns |
| the element if there was one, or \co{NULL} otherwise. |
| |
| Lines~29-38 shows \co{pdeq_push_l()}, which left-enqueues the specified |
| element. |
| Line~33 acquires the left-hand lock, and line~34 picks up the left-hand |
| index. |
| Line~35 left-enqueues the specified element onto the double-ended queue |
| indexed by the left-hand index. |
| Line~36 then updates the left-hand index and line~37 releases the lock. |
| |
| As noted earlier, the right-hand operations are completely analogous |
| to their left-handed counterparts, so their analysis is left as an |
| exercise for the reader. |
| |
| \QuickQuiz{} |
| Is the hashed double-ended queue a good solution? |
| Why or why not? |
| \QuickQuizAnswer{ |
| The best way to answer this is to run \url{lockhdeq.c} on |
| a number of different multiprocessor systems, and you are |
| encouraged to do so in the strongest possible terms. |
| One reason for concern is that each operation on this |
| implementation must acquire not one but two locks. |
| % Getting about 500 nanoseconds per element when used as |
| % a queue on a 4.2GHz Power system. This is roughly the same as |
| % the version covered by a single lock. Sequential (unlocked |
| % variant is more than an order of magnitude faster! |
| |
| The first well-designed performance study will be cited.\footnote{ |
| The studies by Dalessandro |
| et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque} |
| and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are |
| good starting points.} |
| Do not forget to compare to a sequential implementation! |
| } \QuickQuizEnd |
| |
| \subsubsection{Compound Double-Ended Queue Revisited} |
| \label{sec:SMPdesign:Compound Double-Ended Queue Revisited} |
| |
| This section revisits the compound double-ended queue, using a trivial |
| rebalancing scheme that moves all the elements from the non-empty |
| queue to the now-empty queue. |
| |
| \QuickQuiz{} |
| Move \emph{all} the elements to the queue that became empty? |
| In what possible universe is this brain-dead solution in any |
| way optimal??? |
| \QuickQuizAnswer{ |
| It is optimal in the case where data flow switches direction only |
| rarely. |
| It would of course be an extremely poor choice if the double-ended |
| queue was being emptied from both ends concurrently. |
| This of course raises another question, namely, in what possible |
| universe emptying from both ends concurrently would be a reasonable |
| thing to do. |
| Work-stealing queues are one possible answer to this question. |
| } \QuickQuizEnd |
| |
| In contrast to the hashed implementation presented in |
| the previous section, the compound implementation will build on |
| a sequential implementation of a double-ended queue that uses |
| neither locks nor atomic operations. |
| |
| \begin{figure*}[bp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct cds_list_head *pdeq_pop_l(struct pdeq *d) |
| 2 { |
| 3 struct cds_list_head *e; |
| 4 |
| 5 spin_lock(&d->llock); |
| 6 e = deq_pop_l(&d->ldeq); |
| 7 if (e == NULL) { |
| 8 spin_lock(&d->rlock); |
| 9 e = deq_pop_l(&d->rdeq); |
| 10 cds_list_splice(&d->rdeq.chain, &d->ldeq.chain); |
| 11 CDS_INIT_LIST_HEAD(&d->rdeq.chain); |
| 12 spin_unlock(&d->rlock); |
| 13 } |
| 14 spin_unlock(&d->llock); |
| 15 return e; |
| 16 } |
| 17 |
| 18 struct cds_list_head *pdeq_pop_r(struct pdeq *d) |
| 19 { |
| 20 struct cds_list_head *e; |
| 21 |
| 22 spin_lock(&d->rlock); |
| 23 e = deq_pop_r(&d->rdeq); |
| 24 if (e == NULL) { |
| 25 spin_unlock(&d->rlock); |
| 26 spin_lock(&d->llock); |
| 27 spin_lock(&d->rlock); |
| 28 e = deq_pop_r(&d->rdeq); |
| 29 if (e == NULL) { |
| 30 e = deq_pop_r(&d->ldeq); |
| 31 cds_list_splice(&d->ldeq.chain, &d->rdeq.chain); |
| 32 CDS_INIT_LIST_HEAD(&d->ldeq.chain); |
| 33 } |
| 34 spin_unlock(&d->llock); |
| 35 } |
| 36 spin_unlock(&d->rlock); |
| 37 return e; |
| 38 } |
| 39 |
| 40 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d) |
| 41 { |
| 42 int i; |
| 43 |
| 44 spin_lock(&d->llock); |
| 45 deq_push_l(e, &d->ldeq); |
| 46 spin_unlock(&d->llock); |
| 47 } |
| 48 |
| 49 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d) |
| 50 { |
| 51 int i; |
| 52 |
| 53 spin_lock(&d->rlock); |
| 54 deq_push_r(e, &d->rdeq); |
| 55 spin_unlock(&d->rlock); |
| 56 } |
| \end{verbatim} |
| } |
| \caption{Compound Parallel Double-Ended Queue Implementation} |
| \label{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| \end{figure*} |
| |
| Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| shows the implementation. |
| Unlike the hashed implementation, this compound implementation is |
| asymmetric, so that we must consider the \co{pdeq_pop_l()} |
| and \co{pdeq_pop_r()} implementations separately. |
| |
| \QuickQuiz{} |
| Why can't the compound parallel double-ended queue |
| implementation be symmetric? |
| \QuickQuizAnswer{ |
| The need to avoid deadlock by imposing a lock hierarchy |
| forces the asymmetry, just as it does in the fork-numbering |
| solution to the Dining Philosophers Problem |
| (see Section~\ref{sec:SMPdesign:Dining Philosophers Problem}). |
| } \QuickQuizEnd |
| |
| The \co{pdeq_pop_l()} implementation is shown on lines~1-16 |
| of the figure. |
| Line 5 acquires the left-hand lock, which line~14 releases. |
| Line~6 attempts to left-dequeue an element from the left-hand underlying |
| double-ended queue, and, if successful, skips lines~8-13 to simply |
| return this element. |
| Otherwise, line~8 acquires the right-hand lock, line~9 |
| left-dequeues an element from the right-hand queue, |
| and line~10 moves any remaining elements on the right-hand |
| queue to the left-hand queue, line~11 initializes the right-hand queue, |
| and line~12 releases the right-hand lock. |
| The element, if any, that was dequeued on line~10 will be returned. |
| |
| The \co{pdeq_pop_r()} implementation is shown on lines~18-38 |
| of the figure. |
| As before, line~22 acquires the right-hand lock (and line~36 |
| releases it), and line~23 attempts to right-dequeue an element |
| from the right-hand queue, and, if successful, skips lines~24-35 |
| to simply return this element. |
| However, if line~24 determines that there was no element to dequeue, |
| line~25 releases the right-hand lock and lines~26-27 acquire both |
| locks in the proper order. |
| Line~28 then attempts to right-dequeue an element from the right-hand |
| list again, and if line~29 determines that this second attempt has |
| failed, line~30 right-dequeues an element from the left-hand queue |
| (if there is one available), line~31 moves any remaining elements |
| from the left-hand queue to the right-hand queue, and line~32 |
| initializes the left-hand queue. |
| Either way, line~34 releases the left-hand lock. |
| |
| \QuickQuiz{} |
| Why is it necessary to retry the right-dequeue operation |
| on line~28 of |
| Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}? |
| \QuickQuizAnswer{ |
| This retry is necessary because some other thread might have |
| enqueued an element between the time that this thread dropped |
| \co{d->rlock} on line~25 and the time that it reacquired this |
| same lock on line~27. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Surely the left-hand lock must \emph{sometimes} be available!!! |
| So why is it necessary that line~25 of |
| Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation} |
| unconditionally release the right-hand lock? |
| \QuickQuizAnswer{ |
| It would be possible to use \co{spin_trylock()} to attempt |
| to acquire the left-hand lock when it was available. |
| However, the failure case would still need to drop the |
| right-hand lock and then re-acquire the two locks in order. |
| Making this transformation (and determining whether or not |
| it is worthwhile) is left as an exercise for the reader. |
| } \QuickQuizEnd |
| |
| The \co{pdeq_push_l()} implementation is shown on lines~40-47 of |
| Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}. |
| Line~44 acquires the left-hand spinlock, line~45 left-enqueues the |
| element onto the left-hand queue, and finally line~46 releases |
| the lock. |
| The \co{pdeq_enqueue_r()} implementation (shown on lines~49-56) |
| is quite similar. |
| |
| \subsubsection{Double-Ended Queue Discussion} |
| \label{sec:SMPdesign:Double-Ended Queue Discussion} |
| |
| The compound implementation is somewhat more complex than the |
| hashed variant presented in |
| Section~\ref{sec:SMPdesign:Hashed Double-Ended Queue}, |
| but is still reasonably simple. |
| Of course, a more intelligent rebalancing scheme could be arbitrarily |
| complex, but the simple scheme shown here has been shown to |
| perform well compared to software |
| alternatives~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque} |
| and even compared to algorithms using hardware |
| assist~\cite{DavidDice:2010:SCA:HTM:deque}. |
| Nevertheless, the best we can hope for from such a scheme |
| is 2x scalability, as at most two threads can be holding the |
| dequeue's locks concurrently. |
| This limitation also applies to algorithms based on non-blocking |
| synchronization, such as the compare-and-swap-based dequeue algorithm of |
| Michael~\cite{DBLP:conf/europar/Michael03}.\footnote{ |
| This paper is interesting in that it showed that special |
| double-compare-and-swap (DCAS) instructions are not needed |
| for lock-free implementations of double-ended queues. |
| Instead, the common compare-and-swap (e.g., x86 cmpxchg) |
| suffices.} |
| |
| In fact, as noted by Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque}, |
| an unsynchronized single-threaded double-ended queue significantly |
| outperforms any of the parallel implementations they studied. |
| Therefore, the key point is that there can be significant overhead enqueuing to |
| or dequeuing from a shared queue, regardless of implementation. |
| This should come as no surprise given the material in |
| Chapter~\ref{chp:Hardware and its Habits}, given the strict |
| FIFO nature of these queues. |
| |
| Furthermore, these strict FIFO queues are strictly FIFO only with |
| respect to |
| \emph{linearization points}~\cite{Herlihy:1990:LCC:78969.78972}\footnote{ |
| In short, a linearization point is a single point within a given |
| function where that function can be said to have taken effect. |
| In this lock-based implementation, the linearization points |
| can be said to be anywhere within the critical section that |
| does the work.} |
| that are not visible to the caller, in fact, in these examples, |
| the linearization points are buried in the lock-based critical |
| sections. |
| These queues are not strictly FIFO with respect to (say) the times at which |
| the individual operations started~\cite{AndreasHaas2012FIFOisnt}. |
| This indicates that the strict FIFO property is not all that valuable |
| in concurrent programs, and in fact, Kirsch et al.~present less-strict |
| queues that provide improved performance and |
| scalability~\cite{ChristophMKirsch2012FIFOisntTR}.\footnote{ |
| Nir Shavit produced relaxed stacks for roughly the same |
| reasons~\cite{Shavit:2011:DSM:1897852.1897873}. |
| This situation leads some to believe that the linearization |
| points are useful to theorists rather than developers, and |
| leads others to wonder to what extent the designers of such |
| data structures and algorithms were considering the needs |
| of their users.} |
| All that said, if you are pushing all the data used by your concurrent |
| program through a single queue, you really need to rethink your |
| overall design. |
| |
| \subsection{Partitioning Example Discussion} |
| \label{sec:SMPdesign:Partitioning Example Discussion} |
| |
| The optimal solution to the dining philosophers problem given in |
| the answer to the Quick Quiz in |
| Section~\ref{sec:SMPdesign:Dining Philosophers Problem} |
| is an excellent example of ``horizontal parallelism'' or |
| ``data parallelism''. |
| The synchronization overhead in this case is nearly (or even exactly) |
| zero. |
| In contrast, the double-ended |
| queue implementations are examples of ``vertical parallelism'' or |
| ``pipelining'', given that data moves from one thread to another. |
| The tighter coordination required for pipelining in turn requires |
| larger units of work to obtain a given level of efficiency. |
| |
| \QuickQuiz{} |
| The tandem double-ended queue runs about twice as fast as |
| the hashed double-ended queue, even when I increase the |
| size of the hash table to an insanely large number. |
| Why is that? |
| \QuickQuizAnswer{ |
| The hashed double-ended queue's locking design only permits |
| one thread at a time at each end, and further requires |
| two lock acquisitions for each operation. |
| The tandem double-ended queue also permits one thread at a time |
| at each end, and in the common case requires only one lock |
| acquisition per operation. |
| Therefore, the tandem double-ended queue should be expected to |
| outperform the hashed double-ended queue. |
| |
| Can you created a double-ended queue that allows multiple |
| concurrent operations at each end? |
| If so, how? If not, why not? |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Is there a significantly better way of handling concurrency |
| for double-ended queues? |
| \QuickQuizAnswer{ |
| One approach is to transform the problem to be solved |
| so that multiple double-ended queues can be used in parallel, |
| allowing the simpler single-lock double-ended queue to be used, |
| and perhaps also replace each double-ended queue with a pair of |
| conventional single-ended queues. |
| Without such ``horizontal scaling'', the speedup is limited |
| to 2.0. |
| In contrast, horizontal-scaling designs can achieve very large |
| speedups, and are especially attractive if there are multiple threads |
| working either end of the queue, because in this |
| multiple-thread case the dequeue |
| simply cannot provide strong ordering guarantees. |
| After all, the fact that a given thread removed an item first |
| in no way implies that it will process that item |
| first~\cite{AndreasHaas2012FIFOisnt}. |
| And if there are no guarantees, we may as well obtain the |
| performance benefits that come with refusing to provide these |
| guarantees. |
| % about twice as fast as hashed version on 4.2GHz Power. |
| |
| Regardless of whether or not the problem can be transformed |
| to use multiple queues, it is worth asking whether work can |
| be batched so that each enqueue and dequeue operation corresponds |
| to larger units of work. |
| This batching approach decreases contention on the queue data |
| structures, which increases both performance and scalability, |
| as will be seen in |
| Section~\ref{sec:SMPdesign:Synchronization Granularity}. |
| After all, if you must incur high synchronization overheads, |
| be sure you are getting your money's worth. |
| |
| Other researchers are working on other ways to take advantage |
| of limited ordering guarantees in |
| queues~\cite{ChristophMKirsch2012FIFOisntTR}. |
| } \QuickQuizEnd |
| |
| These two examples show just how powerful partitioning can be in |
| devising parallel algorithms. |
| Section~\ref{sec:SMPdesign:Locking Granularity and Performance} |
| looks briefly at a third example, matrix multiply. |
| However, all three of these examples beg for more and better design |
| criteria for parallel programs, a topic taken up in the next section. |