SMPdesign/partexercises.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % SMPdesign/partexercises.tex

 \section{Partitioning Exercises}
 \label{sec:SMPdesign:Partitioning Exercises}

 This section uses a pair of exercises (the classic Dining Philosophers
 problem and a double-ended queue) to demonstrate the value of partitioning.

 \subsection{Dining Philosophers Problem}
 \label{sec:SMPdesign:Dining Philosophers Problem}

 \begin{figure}[tb]
 \begin{center}
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5}
 \end{center}
 \caption{Dining Philosophers Problem}
 \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem}{Kornilios Kourtis}
 \end{figure}

 Figure~\ref{fig:SMPdesign:Dining Philosophers Problem} shows a diagram
 of the classic Dining Philosophers problem~\cite{Dijkstra1971HOoSP}.
 This problem features five philosophers who do nothing but think and
 eat a ``very difficult kind of spaghetti'' which requires two forks
 to eat.
 A given philosopher is permitted to use only the forks to his or her
 immediate right and left, and once a philosopher picks up a fork,
 he or she will not put it down until sated.\footnote{
 	Readers who have difficulty imagining a food that requires
 	two forks are invited to instead think in terms of chopsticks.}

 \begin{figure*}[tb]
 \begin{center}
 \resizebox{5in}{!}{\includegraphics{cartoons/Dining-philosophers}}
 \end{center}
 \caption{Partial Starvation Is Also Bad}
 \ContributedBy{Figure}{fig:cpu:Partial Starvation Is Also Bad}{Melissa Broussard}
 \end{figure*}

 The object is to construct an algorithm that, quite literally,
 prevents starvation.
 One starvation scenario would be if all of the philosophers picked up
 their leftmost forks simultaneously.
 Because none of them would put down their fork until after they ate,
 and because none of them may pick up their second fork until at least
 one has finished eating, they all starve.
 Please note that it is not sufficient to allow at least one philosopher
 to eat.
 As Figure~\ref{fig:cpu:Partial Starvation Is Also Bad}
 shows, starvation of even a few of the philosophers is to be avoided.

 \begin{figure}[tb]
 \begin{center}
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5TB}
 \end{center}
 \caption{Dining Philosophers Problem, Textbook Solution}
 \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Textbook Solution}{Kornilios Kourtis}
 \end{figure}

 Dijkstra's solution used a global semaphore, which works fine assuming
 negligible communications delays, an assumption that became invalid
 in the late 1980s or early 1990s.\footnote{
 	It is all too easy to denigrate Dijkstra from the viewpoint
 	of the year 2012, more than 40 years after the fact.
 	If you still feel the need to denigrate Dijkstra, my advice
 	is to publish something, wait 40 years, and then see
 	how \emph{your} words stood the test of time.}
 Therefore, recent solutions number the forks as shown in
 Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Textbook Solution}.
 Each philosopher picks up the lowest-numbered fork next to his or her
 plate, then picks up the highest-numbered fork.
 The philosopher sitting in the uppermost position in the diagram thus
 picks up the leftmost fork first, then the rightmost fork, while the
 rest of the philosophers instead pick up their rightmost fork first.
 Because two of the philosophers will attempt to pick up fork~1 first,
 and because only one of those two philosophers will succeed,
 there will be five forks available to four philosophers.
 At least one of these four will be guaranteed to have two forks,
 and thus be able to proceed eating.

 This general technique of numbering resources and acquiring them in
 numerical order is heavily used as a deadlock-prevention technique.
 However, it is easy to imagine a sequence of events that will result
 in only one philosopher eating at a time even though all are hungry:

 \begin{enumerate}
     \item P2 picks up fork~1, preventing P1 from taking a fork.
     \item P3 picks up fork~2.
     \item P4 picks up fork~3.
     \item P5 picks up fork~4.
     \item P5 picks up fork~5 and eats.
     \item P5 puts down forks~4 and 5.
     \item P4 picks up fork~4 and eats.
 \end{enumerate}

 In short, this algorithm can result in only one philosopher eating at
 a given time, even when all five philosophers are hungry,
 despite the fact that there are more than enough forks for two
 philosophers to eat concurrently.

 Please think about ways of partitioning the Dining Philosophers Problem
 before reading further.
 \cleardoublepage


 \begin{figure}[tb]
 \begin{center}
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher4part-b}
 \end{center}
 \caption{Dining Philosophers Problem, Partitioned}
 \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Partitioned}{Kornilios Kourtis}
 \end{figure}

 One approach is shown in
 Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Partitioned},
 which includes four philosophers rather than five to better illustrate the
 partition technique.
 Here the upper and rightmost philosophers share a pair of forks,
 while the lower and leftmost philosophers share another pair of forks.
 If all philosophers are simultaneously hungry, at least two will
 always be able to eat concurrently.
 In addition, as shown in the figure, the forks can now be bundled
 so that the pair are picked up and put down simultaneously, simplifying
 the acquisition and release algorithms.

 \QuickQuiz{}
 	Is there a better solution to the Dining
 	Philosophers Problem?
 \QuickQuizAnswer{

 \begin{figure}[tb]
 \begin{center}
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}
 \end{center}
 \caption{Dining Philosophers Problem, Fully Partitioned}
 \QContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis}
 \end{figure}

 	One such improved solution is shown in
 	Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned},
 	where the philosophers are simply provided with an additional
 	five forks.
 	All five philosophers may now eat simultaneously, and there
 	is never any need for philosophers to wait on one another.
 	In addition, this approach offers greatly improved disease control.

 	This solution might seem like cheating to some, but such
 	``cheating'' is key to finding good solutions to many
 	concurrency problems.
 } \QuickQuizEnd

 This is an example of ``horizontal parallelism''~\cite{Inman85}
 or ``data parallelism'',
 so named because there is no dependency among the pairs of philosophers.
 In a horizontally parallel data-processing system, a given item of data
 would be processed by only one of a replicated set of software
 components.

 \QuickQuiz{}
 	And in just what sense can this ``horizontal parallelism'' be
 	said to be ``horizontal''?
 \QuickQuizAnswer{
 	Inman was working with protocol stacks, which are normally
 	depicted vertically, with the application on top and the
 	hardware interconnect on the bottom.
 	Data flows up and down this stack.
 	``Horizontal parallelism'' processes packets from different network
 	connections in parallel, while ``vertical parallelism''
 	handles different protocol-processing steps for a given
 	packet in parallel.

 	``Vertical parallelism'' is also called ``pipelining''.
 } \QuickQuizEnd

 \subsection{Double-Ended Queue}
 \label{sec:SMPdesign:Double-Ended Queue}

 A double-ended queue is a data structure containing a list of elements
 that may be inserted or removed from either end~\cite{Knuth73}.
 It has been claimed that a lock-based implementation permitting
 concurrent operations on both ends of the double-ended queue is
 difficult~\cite{DanGrossman2007TMGCAnalogy}.
 This section shows how a partitioning design strategy can result
 in a reasonably simple implementation, looking at three
 general approaches in the following sections.

 \subsubsection{Left- and Right-Hand Locks}
 \label{sec:SMPdesign:Left- and Right-Hand Locks}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeq}}
 \end{center}
 \caption{Double-Ended Queue With Left- and Right-Hand Locks}
 \label{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}
 \end{figure}

 One seemingly straightforward approach would be to use a doubly
 linked list with a left-hand lock
 for left-hand-end enqueue and dequeue operations along with a right-hand
 lock for right-hand-end operations, as shown in
 Figure~\ref{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}.
 However, the problem with this approach is that the two locks'
 domains must overlap when there are fewer than four elements on the
 list.
 This overlap is due to the fact that removing any given element affects
 not only that element, but also its left- and right-hand neighbors.
 These domains are indicated by color in the figure, with blue
 with downward stripes indicating
 the domain of the left-hand lock, red with upward stripes
 indicating the domain of the right-hand
 lock, and purple (with no stripes) indicating overlapping domains.
 Although it is possible to create an algorithm that works this way,
 the fact that it has no fewer than five special cases should raise
 a big red flag, especially given that concurrent activity at the other
 end of the list can shift the queue from one special case to another
 at any time.
 It is far better to consider other designs.

 \subsubsection{Compound Double-Ended Queue}
 \label{sec:SMPdesign:Compound Double-Ended Queue}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqpair}}
 \end{center}
 \caption{Compound Double-Ended Queue}
 \label{fig:SMPdesign:Compound Double-Ended Queue}
 \end{figure}

 One way of forcing non-overlapping lock domains is shown in
 Figure~\ref{fig:SMPdesign:Compound Double-Ended Queue}.
 Two separate double-ended queues are run in tandem, each protected by
 its own lock.
 This means that elements must occasionally be shuttled from one of
 the double-ended queues to the other, in which case both locks must
 be held.
 A simple lock hierarchy may be used to avoid deadlock, for example,
 always acquiring the left-hand lock before acquiring the right-hand lock.
 This will be much simpler than applying two locks to the same
 double-ended queue, as we can unconditionally left-enqueue elements
 to the left-hand queue and right-enqueue elements to the right-hand
 queue.
 The main complication arises when dequeuing from an empty queue, in
 which case it is necessary to:

 \begin{enumerate}
 \item	If holding the right-hand lock, release it and acquire the
 	left-hand lock.
 \item	Acquire the right-hand lock.
 \item	Rebalance the elements across the two queues.
 \item	Remove the required element if there is one.
 \item	Release both locks.
 \end{enumerate}

 \QuickQuiz{}
 	In this compound double-ended queue implementation, what should
 	be done if the queue has become non-empty while releasing
 	and reacquiring the lock?
 \QuickQuizAnswer{
 	In this case, simply dequeue an item from the non-empty
 	queue, release both locks, and return.
 } \QuickQuizEnd

 The rebalancing operation might well shuttle a given element back
 and forth between the two queues, wasting time and possibly requiring
 workload-dependent heuristics to obtain optimal performance.
 Although this might well be the best approach in some cases, it is
 interesting to try for an algorithm with greater determinism.

 \subsubsection{Hashed Double-Ended Queue}
 \label{sec:SMPdesign:Hashed Double-Ended Queue}

 One of the simplest and most effective ways to deterministically
 partition a data structure is to hash it.
 It is possible to trivially hash a double-ended queue by assigning
 each element a sequence number based on its position in the list,
 so that the first element left-enqueued into an empty queue is numbered
 zero and the first element right-enqueued into an empty queue is numbered
 one.
 A series of elements left-enqueued into an otherwise-idle queue would
 be assigned decreasing numbers (-1, -2, -3, ...), while a series of
 elements right-enqueued into an otherwise-idle queue would be assigned
 increasing numbers (2, 3, 4, ...).
 A key point is that it is not necessary to actually represent a given
 element's number, as this number will be implied by its position in
 the queue.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash}}
 \end{center}
 \caption{Hashed Double-Ended Queue}
 \label{fig:SMPdesign:Hashed Double-Ended Queue}
 \end{figure}

 Given this approach, we assign one lock to guard the left-hand index,
 one to guard the right-hand index, and one lock for each hash chain.
 Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue} shows the resulting
 data structure given four hash chains.
 Note that the lock domains do not overlap, and that deadlock is avoided
 by acquiring the index locks before the chain locks, and by never
 acquiring more than one lock of each type (index or chain) at a time.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash1R}}
 \end{center}
 \caption{Hashed Double-Ended Queue After Insertions}
 \label{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
 \end{figure}

 Each hash chain is itself a double-ended queue, and in this example,
 each holds every fourth element.
 The uppermost portion of
 Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
 shows the state after a single element (``R1'') has been
 right-enqueued, with the right-hand index having been incremented to
 reference hash chain~2.
 The middle portion of this same figure shows the state after
 three more elements have been right-enqueued.
 As you can see, the indexes are back to their initial states
 (see Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue}), however,
 each hash chain is now non-empty.
 The lower portion of this figure shows the state after three additional
 elements have been left-enqueued and an additional element has been
 right-enqueued.

 From the last state shown in
 Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions},
 a left-dequeue operation would return element ``L-2'' and leave
 the left-hand index referencing hash chain~2, which would then
 contain only a single element (``R2'').
 In this state, a left-enqueue running concurrently with a right-enqueue
 would result in lock contention, but the probability of such contention
 can be reduced to arbitrarily low levels by using a larger hash table.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{1.5in}{!}{\includegraphics{SMPdesign/lockdeqhashlots}}
 \end{center}
 \caption{Hashed Double-Ended Queue With 12 Elements}
 \label{fig:SMPdesign:Hashed Double-Ended Queue With 12 Elements}
 \end{figure}

 Figure~\ref{fig:SMPdesign:Hashed Double-Ended Queue With 12 Elements}
 shows how 12 elements would be organized in a four-hash-bucket
 parallel double-ended queue.
 Each underlying single-lock double-ended queue holds a one-quarter
 slice of the full parallel double-ended queue.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 struct pdeq {
   2   spinlock_t llock;
   3   int lidx;
   4   spinlock_t rlock;
   5   int ridx;
   6   struct deq bkt[DEQ_N_BKTS];
   7 };
 \end{verbatim}
 }
 \caption{Lock-Based Parallel Double-Ended Queue Data Structure}
 \label{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
 \end{figure}

 Figure~\ref{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
 shows the corresponding C-language data structure, assuming an
 existing \co{struct deq} that provides a trivially locked
 double-ended-queue implementation.
 This data structure contains the left-hand lock on line~2,
 the left-hand index on line~3, the right-hand lock on line~4
 (which is cache-aligned in the actual implementation),
 the right-hand index on line~5, and, finally, the hashed array
 of simple lock-based double-ended queues on line~6.
 A high-performance implementation would of course use padding or special
 alignment directives to avoid false sharing.

 \begin{figure*}[bp]
 { \scriptsize
 \begin{verbatim}
   1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
   2 {
   3   struct cds_list_head *e;
   4   int i;
   5
   6   spin_lock(&d->llock);
   7   i = moveright(d->lidx);
   8   e = deq_pop_l(&d->bkt[i]);
   9   if (e != NULL)
  10     d->lidx = i;
  11   spin_unlock(&d->llock);
  12   return e;
  13 }
  14
  15 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
  16 {
  17   struct cds_list_head *e;
  18   int i;
  19
  20   spin_lock(&d->rlock);
  21   i = moveleft(d->ridx);
  22   e = deq_pop_r(&d->bkt[i]);
  23   if (e != NULL)
  24     d->ridx = i;
  25   spin_unlock(&d->rlock);
  26   return e;
  27 }
  28
  29 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
  30 {
  31   int i;
  32
  33   spin_lock(&d->llock);
  34   i = d->lidx;
  35   deq_push_l(e, &d->bkt[i]);
  36   d->lidx = moveleft(d->lidx);
  37   spin_unlock(&d->llock);
  38 }
  39
  40 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
  41 {
  42   int i;
  43
  44   spin_lock(&d->rlock);
  45   i = d->ridx;
  46   deq_push_r(e, &d->bkt[i]);
  47   d->ridx = moveright(d->ridx);
  48   spin_unlock(&d->rlock);
  49 }
 \end{verbatim}
 }
 \caption{Lock-Based Parallel Double-Ended Queue Implementation}
 \label{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
 \end{figure*}

 Figure~\ref{fig:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
 (\co{lockhdeq.c})
 shows the implementation of the enqueue and dequeue functions.\footnote{
 	One could easily create a polymorphic implementation in any
 	number of languages, but doing so is left as an exercise for
 	the reader.}
 Discussion will focus on the left-hand operations, as the right-hand
 operations are trivially derived from them.

 Lines~1-13 show \co{pdeq_pop_l()}, which left-dequeues and returns
 an element if possible, returning \co{NULL} otherwise.
 Line~6 acquires the left-hand spinlock, and line~7 computes the
 index to be dequeued from.
 Line~8 dequeues the element, and, if line~9 finds the result to be
 non-\co{NULL}, line~10 records the new left-hand index.
 Either way, line~11 releases the lock, and, finally, line~12 returns
 the element if there was one, or \co{NULL} otherwise.

 Lines~29-38 shows \co{pdeq_push_l()}, which left-enqueues the specified
 element.
 Line~33 acquires the left-hand lock, and line~34 picks up the left-hand
 index.
 Line~35 left-enqueues the specified element onto the double-ended queue
 indexed by the left-hand index.
 Line~36 then updates the left-hand index and line~37 releases the lock.

 As noted earlier, the right-hand operations are completely analogous
 to their left-handed counterparts, so their analysis is left as an
 exercise for the reader.

 \QuickQuiz{}
 	Is the hashed double-ended queue a good solution?
 	Why or why not?
 \QuickQuizAnswer{
 	The best way to answer this is to run \url{lockhdeq.c} on
 	a number of different multiprocessor systems, and you are
 	encouraged to do so in the strongest possible terms.
 	One reason for concern is that each operation on this
 	implementation must acquire not one but two locks.
 	% Getting about 500 nanoseconds per element when used as
 	% a queue on a 4.2GHz Power system.  This is roughly the same as
 	% the version covered by a single lock.  Sequential (unlocked
 	% variant is more than an order of magnitude faster!

 	The first well-designed performance study will be cited.\footnote{
 		The studies by Dalessandro
 		et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
 		and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are
 		good starting points.}
 	Do not forget to compare to a sequential implementation!
 } \QuickQuizEnd

 \subsubsection{Compound Double-Ended Queue Revisited}
 \label{sec:SMPdesign:Compound Double-Ended Queue Revisited}

 This section revisits the compound double-ended queue, using a trivial
 rebalancing scheme that moves all the elements from the non-empty
 queue to the now-empty queue.

 \QuickQuiz{}
 	Move \emph{all} the elements to the queue that became empty?
 	In what possible universe is this brain-dead solution in any
 	way optimal???
 \QuickQuizAnswer{
 	It is optimal in the case where data flow switches direction only
 	rarely.
 	It would of course be an extremely poor choice if the double-ended
 	queue was being emptied from both ends concurrently.
 	This of course raises another question, namely, in what possible
 	universe emptying from both ends concurrently would be a reasonable
 	thing to do.
 	Work-stealing queues are one possible answer to this question.
 } \QuickQuizEnd

 In contrast to the hashed implementation presented in
 the previous section, the compound implementation will build on
 a sequential implementation of a double-ended queue that uses
 neither locks nor atomic operations.

 \begin{figure*}[bp]
 { \scriptsize
 \begin{verbatim}
   1 struct cds_list_head *pdeq_pop_l(struct pdeq *d)
   2 {
   3   struct cds_list_head *e;
   4
   5   spin_lock(&d->llock);
   6   e = deq_pop_l(&d->ldeq);
   7   if (e == NULL) {
   8     spin_lock(&d->rlock);
   9     e = deq_pop_l(&d->rdeq);
  10     cds_list_splice(&d->rdeq.chain, &d->ldeq.chain);
  11     CDS_INIT_LIST_HEAD(&d->rdeq.chain);
  12     spin_unlock(&d->rlock);
  13   }
  14   spin_unlock(&d->llock);
  15   return e;
  16 }
  17
  18 struct cds_list_head *pdeq_pop_r(struct pdeq *d)
  19 {
  20   struct cds_list_head *e;
  21
  22   spin_lock(&d->rlock);
  23   e = deq_pop_r(&d->rdeq);
  24   if (e == NULL) {
  25     spin_unlock(&d->rlock);
  26     spin_lock(&d->llock);
  27     spin_lock(&d->rlock);
  28     e = deq_pop_r(&d->rdeq);
  29     if (e == NULL) {
  30       e = deq_pop_r(&d->ldeq);
  31       cds_list_splice(&d->ldeq.chain, &d->rdeq.chain);
  32       CDS_INIT_LIST_HEAD(&d->ldeq.chain);
  33     }
  34     spin_unlock(&d->llock);
  35   }
  36   spin_unlock(&d->rlock);
  37   return e;
  38 }
  39
  40 void pdeq_push_l(struct cds_list_head *e, struct pdeq *d)
  41 {
  42   int i;
  43
  44   spin_lock(&d->llock);
  45   deq_push_l(e, &d->ldeq);
  46   spin_unlock(&d->llock);
  47 }
  48
  49 void pdeq_push_r(struct cds_list_head *e, struct pdeq *d)
  50 {
  51   int i;
  52
  53   spin_lock(&d->rlock);
  54   deq_push_r(e, &d->rdeq);
  55   spin_unlock(&d->rlock);
  56 }
 \end{verbatim}
 }
 \caption{Compound Parallel Double-Ended Queue Implementation}
 \label{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 \end{figure*}

 Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 shows the implementation.
 Unlike the hashed implementation, this compound implementation is
 asymmetric, so that we must consider the \co{pdeq_pop_l()}
 and \co{pdeq_pop_r()} implementations separately.

 \QuickQuiz{}
 	Why can't the compound parallel double-ended queue
 	implementation be symmetric?
 \QuickQuizAnswer{
 	The need to avoid deadlock by imposing a lock hierarchy
 	forces the asymmetry, just as it does in the fork-numbering
 	solution to the Dining Philosophers Problem
 	(see Section~\ref{sec:SMPdesign:Dining Philosophers Problem}).
 } \QuickQuizEnd

 The \co{pdeq_pop_l()} implementation is shown on lines~1-16
 of the figure.
 Line 5 acquires the left-hand lock, which line~14 releases.
 Line~6 attempts to left-dequeue an element from the left-hand underlying
 double-ended queue, and, if successful, skips lines~8-13 to simply
 return this element.
 Otherwise, line~8 acquires the right-hand lock, line~9
 left-dequeues an element from the right-hand queue,
 and line~10 moves any remaining elements on the right-hand
 queue to the left-hand queue, line~11 initializes the right-hand queue,
 and line~12 releases the right-hand lock.
 The element, if any, that was dequeued on line~10 will be returned.

 The \co{pdeq_pop_r()} implementation is shown on lines~18-38
 of the figure.
 As before, line~22 acquires the right-hand lock (and line~36
 releases it), and line~23 attempts to right-dequeue an element
 from the right-hand queue, and, if successful, skips lines~24-35
 to simply return this element.
 However, if line~24 determines that there was no element to dequeue,
 line~25 releases the right-hand lock and lines~26-27 acquire both
 locks in the proper order.
 Line~28 then attempts to right-dequeue an element from the right-hand
 list again, and if line~29 determines that this second attempt has
 failed, line~30 right-dequeues an element from the left-hand queue
 (if there is one available), line~31 moves any remaining elements
 from the left-hand queue to the right-hand queue, and line~32
 initializes the left-hand queue.
 Either way, line~34 releases the left-hand lock.

 \QuickQuiz{}
 	Why is it necessary to retry the right-dequeue operation
 	on line~28 of
 	Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
 \QuickQuizAnswer{
 	This retry is necessary because some other thread might have
 	enqueued an element between the time that this thread dropped
 	\co{d->rlock} on line~25 and the time that it reacquired this
 	same lock on line~27.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Surely the left-hand lock must \emph{sometimes} be available!!!
 	So why is it necessary that line~25 of
 	Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 	unconditionally release the right-hand lock?
 \QuickQuizAnswer{
 	It would be possible to use \co{spin_trylock()} to attempt
 	to acquire the left-hand lock when it was available.
 	However, the failure case would still need to drop the
 	right-hand lock and then re-acquire the two locks in order.
 	Making this transformation (and determining whether or not
 	it is worthwhile) is left as an exercise for the reader.
 } \QuickQuizEnd

 The \co{pdeq_push_l()} implementation is shown on lines~40-47 of
 Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}.
 Line~44 acquires the left-hand spinlock, line~45 left-enqueues the
 element onto the left-hand queue, and finally line~46 releases
 the lock.
 The \co{pdeq_enqueue_r()} implementation (shown on lines~49-56)
 is quite similar.

 \subsubsection{Double-Ended Queue Discussion}
 \label{sec:SMPdesign:Double-Ended Queue Discussion}

 The compound implementation is somewhat more complex than the
 hashed variant presented in
 Section~\ref{sec:SMPdesign:Hashed Double-Ended Queue},
 but is still reasonably simple.
 Of course, a more intelligent rebalancing scheme could be arbitrarily
 complex, but the simple scheme shown here has been shown to
 perform well compared to software
 alternatives~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
 and even compared to algorithms using hardware
 assist~\cite{DavidDice:2010:SCA:HTM:deque}.
 Nevertheless, the best we can hope for from such a scheme
 is 2x scalability, as at most two threads can be holding the
 dequeue's locks concurrently.
 This limitation also applies to algorithms based on non-blocking
 synchronization, such as the compare-and-swap-based dequeue algorithm of
 Michael~\cite{DBLP:conf/europar/Michael03}.\footnote{
 	This paper is interesting in that it showed that special
 	double-compare-and-swap (DCAS) instructions are not needed
 	for lock-free implementations of double-ended queues.
 	Instead, the common compare-and-swap (e.g., x86 cmpxchg)
 	suffices.}

 In fact, as noted by Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque},
 an unsynchronized single-threaded double-ended queue significantly
 outperforms any of the parallel implementations they studied.
 Therefore, the key point is that there can be significant overhead enqueuing to
 or dequeuing from a shared queue, regardless of implementation.
 This should come as no surprise given the material in
 Chapter~\ref{chp:Hardware and its Habits}, given the strict
 FIFO nature of these queues.

 Furthermore, these strict FIFO queues are strictly FIFO only with
 respect to
 \emph{linearization points}~\cite{Herlihy:1990:LCC:78969.78972}\footnote{
 	In short, a linearization point is a single point within a given
 	function where that function can be said to have taken effect.
 	In this lock-based implementation, the linearization points
 	can be said to be anywhere within the critical section that
 	does the work.}
 that are not visible to the caller, in fact, in these examples,
 the linearization points are buried in the lock-based critical
 sections.
 These queues are not strictly FIFO with respect to (say) the times at which
 the individual operations started~\cite{AndreasHaas2012FIFOisnt}.
 This indicates that the strict FIFO property is not all that valuable
 in concurrent programs, and in fact, Kirsch et al.~present less-strict
 queues that provide improved performance and
 scalability~\cite{ChristophMKirsch2012FIFOisntTR}.\footnote{
 	Nir Shavit produced relaxed stacks for roughly the same
 	reasons~\cite{Shavit:2011:DSM:1897852.1897873}.
 	This situation leads some to believe that the linearization
 	points are useful to theorists rather than developers, and
 	leads others to wonder to what extent the designers of such
 	data structures and algorithms were considering the needs
 	of their users.}
 All that said, if you are pushing all the data used by your concurrent
 program through a single queue, you really need to rethink your
 overall design.

 \subsection{Partitioning Example Discussion}
 \label{sec:SMPdesign:Partitioning Example Discussion}

 The optimal solution to the dining philosophers problem given in
 the answer to the Quick Quiz in
 Section~\ref{sec:SMPdesign:Dining Philosophers Problem}
 is an excellent example of ``horizontal parallelism'' or
 ``data parallelism''.
 The synchronization overhead in this case is nearly (or even exactly)
 zero.
 In contrast, the double-ended
 queue implementations are examples of ``vertical parallelism'' or
 ``pipelining'', given that data moves from one thread to another.
 The tighter coordination required for pipelining in turn requires
 larger units of work to obtain a given level of efficiency.

 \QuickQuiz{}
 	The tandem double-ended queue runs about twice as fast as
 	the hashed double-ended queue, even when I increase the
 	size of the hash table to an insanely large number.
 	Why is that?
 \QuickQuizAnswer{
 	The hashed double-ended queue's locking design only permits
 	one thread at a time at each end, and further requires
 	two lock acquisitions for each operation.
 	The tandem double-ended queue also permits one thread at a time
 	at each end, and in the common case requires only one lock
 	acquisition per operation.
 	Therefore, the tandem double-ended queue should be expected to
 	outperform the hashed double-ended queue.

 	Can you created a double-ended queue that allows multiple
 	concurrent operations at each end?
 	If so, how?  If not, why not?
 } \QuickQuizEnd

 \QuickQuiz{}
 	Is there a significantly better way of handling concurrency
 	for double-ended queues?
 \QuickQuizAnswer{
 	One approach is to transform the problem to be solved
 	so that multiple double-ended queues can be used in parallel,
 	allowing the simpler single-lock double-ended queue to be used,
 	and perhaps also replace each double-ended queue with a pair of
 	conventional single-ended queues.
 	Without such ``horizontal scaling'', the speedup is limited
 	to 2.0.
 	In contrast, horizontal-scaling designs can achieve very large
 	speedups, and are especially attractive if there are multiple threads
 	working either end of the queue, because in this
 	multiple-thread case the dequeue
 	simply cannot provide strong ordering guarantees.
 	After all, the fact that a given thread removed an item first
 	in no way implies that it will process that item
 	first~\cite{AndreasHaas2012FIFOisnt}.
 	And if there are no guarantees, we may as well obtain the
 	performance benefits that come with refusing to provide these
 	guarantees.
 	% about twice as fast as hashed version on 4.2GHz Power.

 	Regardless of whether or not the problem can be transformed
 	to use multiple queues, it is worth asking whether work can
 	be batched so that each enqueue and dequeue operation corresponds
 	to larger units of work.
 	This batching approach decreases contention on the queue data
 	structures, which increases both performance and scalability,
 	as will be seen in
 	Section~\ref{sec:SMPdesign:Synchronization Granularity}.
 	After all, if you must incur high synchronization overheads,
 	be sure you are getting your money's worth.

 	Other researchers are working on other ways to take advantage
 	of limited ordering guarantees in
 	queues~\cite{ChristophMKirsch2012FIFOisntTR}.
 } \QuickQuizEnd

 These two examples show just how powerful partitioning can be in
 devising parallel algorithms.
 Section~\ref{sec:SMPdesign:Locking Granularity and Performance}
 looks briefly at a third example, matrix multiply.
 However, all three of these examples beg for more and better design
 criteria for parallel programs, a topic taken up in the next section.