SMPdesign/partexercises.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % SMPdesign/partexercises.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Partitioning Exercises}
 \label{sec:SMPdesign:Partitioning Exercises}
 %
 \epigraph{Whenever a theory appears to you as the only possible one,
 	  take this as a sign that you have neither understood the theory
 	  nor the problem which it was intended to solve.}
 	  {Karl Popper}

 Although partitioning is more widely understood than it was in the early
 2000s, its value is still underappreciated.
 \Cref{sec:SMPdesign:Dining Philosophers Problem}
 therefore takes more highly parallel look at the classic Dining
 Philosophers problem and
 \cref{sec:SMPdesign:Double-Ended Queue}
 revisits the double-ended queue.

 \subsection{Dining Philosophers Problem}
 \label{sec:SMPdesign:Dining Philosophers Problem}

 \begin{figure}
 \centering
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5}
 \caption{Dining Philosophers Problem}
 \ContributedBy{fig:SMPdesign:Dining Philosophers Problem}{Kornilios Kourtis}
 \end{figure}

 \Cref{fig:SMPdesign:Dining Philosophers Problem} shows a diagram
 of the classic \IX{Dining Philosophers problem}~\cite{Dijkstra1971HOoSP}.
 This problem features five philosophers who do nothing but think and
 eat a ``very difficult kind of spaghetti'' which requires two forks
 to eat.\footnote{
 	But feel free to instead think in terms of chopsticks.}
 A given philosopher is permitted to use only the forks to his or her
 immediate right and left, but will not put a given fork down until sated.

 \begin{figure*}
 \centering
 \IfTwoColumn{
 \resizebox{5in}{!}{\includegraphics{cartoons/Dining-philosophers}}
 }{
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{cartoons/Dining-philosophers}}
 }
 \caption{Partial Starvation Is Also Bad}
 \ContributedBy{fig:cpu:Partial Starvation Is Also Bad}{Melissa Broussard}
 \end{figure*}

 The object is to construct an algorithm that, quite literally,
 prevents \IX{starvation}.
 One starvation scenario would be if all of the philosophers picked up
 their leftmost forks simultaneously.
 Because none of them will put down their fork until after they finished
 eating, and because none of them may pick up their second fork until at
 least one of them has finished eating, they all starve.
 Please note that it is not sufficient to allow at least one philosopher
 to eat.
 As \cref{fig:cpu:Partial Starvation Is Also Bad}
 shows, starvation of even a few of the philosophers is to be avoided.

 \begin{figure}
 \centering
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5TB}
 \caption{Dining Philosophers Problem, Textbook Solution}
 \ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}{Kornilios Kourtis}
 \end{figure}

 \pplsur{Edsger W.}{Dijkstra}'s solution used a global semaphore,
 which works fine assuming
 negligible communications delays, an assumption that became invalid
 in the late 1980s or early 1990s.\footnote{
 	It is all too easy to denigrate Dijkstra from the viewpoint
 	of the year 2021, more than 50 years after the fact.
 	If you still feel the need to denigrate Dijkstra, my advice
 	is to publish something, wait 50 years, and then see
 	how well \emph{your} ideas stood the test of time.}
 More recent solutions number the forks as shown in
 \cref{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}.
 Each philosopher picks up the lowest-numbered fork next to his or her
 plate, then picks up the other fork.
 The philosopher sitting in the uppermost position in the diagram thus
 picks up the leftmost fork first, then the rightmost fork, while the
 rest of the philosophers instead pick up their rightmost fork first.
 Because two of the philosophers will attempt to pick up fork~1 first,
 and because only one of those two philosophers will succeed,
 there will be five forks available to four philosophers.
 At least one of these four will have two forks, and will thus be able
 to eat.

 This general technique of numbering resources and acquiring them in
 numerical order is heavily used as a deadlock-prevention technique.
 However, it is easy to imagine a sequence of events that will result
 in only one philosopher eating at a time even though all are hungry:

 \begin{enumerate}
     \item P2 picks up fork~1, preventing P1 from taking a fork.
     \item P3 picks up fork~2.
     \item P4 picks up fork~3.
     \item P5 picks up fork~4.
     \item P5 picks up fork~5 and eats.
     \item P5 puts down forks~4 and 5.
     \item P4 picks up fork~4 and eats.
 \end{enumerate}

 In short, this algorithm can result in only one philosopher eating at
 a given time, even when all five philosophers are hungry,
 despite the fact that there are more than enough forks for two
 philosophers to eat concurrently.
 It should be possible to do better than this!

 \begin{figure}
 \centering
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher4part-b}
 \caption{Dining Philosophers Problem, Partitioned}
 \ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Partitioned}{Kornilios Kourtis}
 \end{figure}

 One approach is shown in
 \cref{fig:SMPdesign:Dining Philosophers Problem; Partitioned},
 which includes four philosophers rather than five to better illustrate the
 partition technique.
 Here the upper and rightmost philosophers share a pair of forks,
 while the lower and leftmost philosophers share another pair of forks.
 If all philosophers are simultaneously hungry, at least two will
 always be able to eat concurrently.
 In addition, as shown in the figure, the forks can now be bundled
 so that the pair are picked up and put down simultaneously, simplifying
 the acquisition and release algorithms.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Is there a better solution to the Dining
 	Philosophers Problem?
 }\QuickQuizAnswerB{
 	One such improved solution is shown in
 	\cref{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned},
 	where the philosophers are simply provided with an additional
 	five forks.
 	All five philosophers may now eat simultaneously, and there
 	is never any need for philosophers to wait on one another.
 	In addition, this approach offers greatly improved disease control.

 \begin{figure}
 \centering
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}
 \caption{Dining Philosophers Problem, Fully Partitioned}
 \QContributedBy{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned}{Kornilios Kourtis}
 \end{figure}

 	This solution might seem like cheating to some, but such
 	``cheating'' is key to finding good solutions to many
 	concurrency problems, as any hungry philosopher would agree.

 	And this is one solution to the Dining Philosophers
 	concurrent-consumption problem called out on
 	\cpageref{sec:SMPdesign:Problems Dining Philosophers}.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	How would you valididate an algorithm alleged to solve the Dining
 	Philosophers Problem?
 }\QuickQuizAnswerE{
 	Much depends on the details of the algorithm, but here are a
 	couple of places to start.

 	First, for algorithms in which picking up left-hand and right-hand
 	forks are separate operations, start with all forks on the table.
 	Then have all philosophers attempt to pick up their first fork.
 	Once all philosophers either have their first fork or are waiting
 	for someone to put down their first fork, have each non-waiting
 	philosopher pick up their second fork.
 	At this point in any starvation-free solution, at least one
 	philosopher will be eating.
 	If there were any waiting philosophers, repeat this test,
 	preferably imposing random variations in timing.

 	Second, create a stress test in which philosphers start and
 	stop eating at random times.
 	Generate starvation and fairness conditions and verify that
 	these conditions are met.
 	Here are a couple of example starvation and fairness conditions:

 	\begin{enumerate}
 	\item	If all other philosophers have stopped eating $N$ times
 		since a given philosopher attempted to pick up a given
 		fork, that philosopher should have succeeded in picking
 		up that fork.
 		For high-quality solutions using high-quality locking
 		primitives (or high-quality atomic operations), $N=1$
 		is doable.
 	\item	Given an upper bound $T$ on the time any philosopher holds
 		onto both forks before putting them down, the maximum
 		waiting time for any philosopher should be bounded by
 		$NT$ for some $N$ that is not hugely larger than the
 		number of philosophers.
 	\item	Generate some statistic representing the time from
 		when philosophers attempt to pick up their first fork
 		to the time when they start eating.
 		The smaller this statistic, the better the solution.
 		Mean, median, and maximum are all useful statistics,
 		but examining the full distribution can also be
 		enlightening.
 	\end{enumerate}

 	Readers are encouraged to actually try testing any of the
 	solutions presented in this book, and especially testing solutions
 	of their own devising.
 }\QuickQuizEndE
 }

 This is an example of ``horizontal parallelism''~\cite{Inman85}
 or ``data parallelism'',
 so named because there is no dependency among the pairs of philosophers.
 In a horizontally parallel data-processing system, a given item of data
 would be processed by only one of a replicated set of software
 components.

 \QuickQuiz{
 	And in just what sense can this ``horizontal parallelism'' be
 	said to be ``horizontal''?
 }\QuickQuizAnswer{
 	Jack Inman~\cite{Inman85}
 	was working with protocol stacks, which are normally
 	depicted vertically, with the application on top and the
 	hardware interconnect on the bottom.
 	Data flows up and down this stack.
 	``Horizontal parallelism'' processes packets from different network
 	connections in parallel, while ``vertical parallelism''
 	handles different protocol-processing steps for a given
 	packet in parallel.

 	``Vertical parallelism'' is also called ``pipelining''.
 }\QuickQuizEnd

 \subsection{Double-Ended Queue}
 \label{sec:SMPdesign:Double-Ended Queue}

 A double-ended queue is a data structure containing a list of elements
 that may be inserted or removed from either end~\cite{Knuth73}.
 It has been claimed that a lock-based implementation permitting
 concurrent operations on both ends of the double-ended queue is
 difficult~\cite{DanGrossman2007TMGCAnalogy}.
 This section shows how a partitioning design strategy can result
 in a reasonably simple implementation, looking at three
 general approaches in the following sections.
 But first, how should we validate a concurrent double-ended queue?

 \subsubsection{Double-Ended Queue Validation}
 \label{sec:SMPdesign:Double-Ended Queue Validation}

 A good place to start is with invariants.
 For example, if elements are pushed onto one end of a double-ended
 queue and popped off of the other, the order of those elements must
 be preserved.
 Similarly, if elements are pushed onto one end of the queue and popped
 off of that same end, the order of those elements must be reversed.
 Any element popped from the queue must have been most recently pushed
 onto that queue, and if the queue is emptied, all elements pushed onto
 it must have already been popped from it.

 The beginnings of a test suite for concurrent double-ended queues
 (``\path{deqtorture.h}'') provides the following checks:

 \begin{enumerate}
 \item	Element-ordering checks provided by \co{CHECK_SEQUENCE_PAIR()}.
 \item	Checks that elements popped were most recently pushed, provided
 	by \co{melee()}.
 \item	Checks that elements pushed are popped before the queue is
 	emptied, also provided by \co{melee()}.
 \end{enumerate}

 This suite includes both sequential and concurrent tests.
 Although this suite is good and sufficient for textbook code, you
 should test considerably more thoroughly for code intended for
 production use.
 \Cref{chp:Validation,chp:Formal Verification} cover a large array
 of validation tools and techniques.

 But with a prototype test suite in place, we are ready to look at the
 double-ended-queue algorithms in the next sections.

 \subsubsection{Left- and Right-Hand Locks}
 \label{sec:SMPdesign:Left- and Right-Hand Locks}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeq}}
 \caption{Double-Ended Queue With Left- and Right-Hand Locks}
 \label{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}
 \end{figure}

 One seemingly straightforward approach would be to use a doubly
 linked list with a left-hand lock
 for left-hand-end enqueue and dequeue operations along with a right-hand
 lock for right-hand-end operations, as shown in
 \cref{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}.
 However, the problem with this approach is that the two locks'
 domains must overlap when there are fewer than four elements on the
 list.
 This overlap is due to the fact that removing any given element affects
 not only that element, but also its left- and right-hand neighbors.
 These domains are indicated by color in the figure, with blue
 with downward stripes indicating
 the domain of the left-hand lock, red with upward stripes
 indicating the domain of the right-hand
 lock, and purple (with no stripes) indicating overlapping domains.
 Although it is possible to create an algorithm that works this way,
 perhaps using a dummy element similar to the two-lock queue
 presented by Michael and Scott~\cite{MichaelScott96},
 the fact that it has no fewer than five special cases should raise
 a big red flag, especially given that concurrent activity at the other
 end of the list can shift the queue from one special case to another
 at any time.

 It is far better to consider other designs.

 \subsubsection{Compound Double-Ended Queue}
 \label{sec:SMPdesign:Compound Double-Ended Queue}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqpair}}
 \caption{Compound Double-Ended Queue}
 \label{fig:SMPdesign:Compound Double-Ended Queue}
 \end{figure}

 One way of forcing non-overlapping lock domains is shown in
 \cref{fig:SMPdesign:Compound Double-Ended Queue}.
 Two separate double-ended queues are run in tandem, each protected by
 its own lock.
 This means that elements must occasionally be shuttled from one of
 the double-ended queues to the other, in which case both locks must
 be held.
 A simple lock hierarchy may be used to avoid \IX{deadlock}, for example,
 always acquiring the left-hand lock before acquiring the right-hand lock.
 This will be much simpler than applying two locks to the same
 double-ended queue, as we can unconditionally left-enqueue elements
 to the left-hand queue and right-enqueue elements to the right-hand
 queue.
 The main complication arises when dequeuing from an empty queue, in
 which case it is necessary to:

 \begin{enumerate}
 \item	If holding the right-hand lock, release it and acquire the
 	left-hand lock.
 \item	Acquire the right-hand lock.
 \item	Rebalance the elements across the two queues.
 \item	Remove the required element if there is one.
 \item	Release both locks.
 \end{enumerate}

 \QuickQuiz{
 	In this compound double-ended queue implementation, what should
 	be done if the queue has become non-empty while releasing
 	and reacquiring the lock?
 }\QuickQuizAnswer{
 	In this case, simply dequeue an item from the non-empty
 	queue, release both locks, and return.
 }\QuickQuizEnd

 The resulting code (\path{locktdeq.c}) is quite straightforward.
 The rebalancing operation might well shuttle a given element back
 and forth between the two queues, wasting time and possibly requiring
 workload-dependent heuristics to obtain optimal performance.
 Although this might well be the best approach in some cases, it is
 interesting to try for an algorithm with greater determinism.

 \subsubsection{Hashed Double-Ended Queue}
 \label{sec:SMPdesign:Hashed Double-Ended Queue}

 One of the simplest and most effective ways to deterministically
 partition a data structure is to hash it.
 It is possible to trivially hash a double-ended queue by assigning
 each element a sequence number based on its position in the list,
 so that the first element left-enqueued into an empty queue is numbered
 zero and the first element right-enqueued into an empty queue is numbered
 one.
 A series of elements left-enqueued into an otherwise-idle queue would
 be assigned decreasing numbers ($-1$, $-2$, $-3$, \ldots), while a series of
 elements right-enqueued into an otherwise-idle queue would be assigned
 increasing numbers (2, 3, 4, \ldots).
 A key point is that it is not necessary to actually represent a given
 element's number, as this number will be implied by its position in
 the queue.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash}}
 \caption{Hashed Double-Ended Queue}
 \label{fig:SMPdesign:Hashed Double-Ended Queue}
 \end{figure}

 Given this approach, we assign one lock to guard the left-hand index,
 one to guard the right-hand index, and one lock for each hash chain.
 \Cref{fig:SMPdesign:Hashed Double-Ended Queue} shows the resulting
 data structure given four hash chains.
 Note that the lock domains do not overlap, and that deadlock is avoided
 by acquiring the index locks before the chain locks, and by never
 acquiring more than one lock of a given type (index or chain) at a time.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash1R}}
 \caption{Hashed Double-Ended Queue After Insertions}
 \label{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
 \end{figure}

 Each hash chain is itself a double-ended queue, and in this example,
 each holds every fourth element.
 The uppermost portion of
 \cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
 shows the state after a single element (``R$_1$'') has been
 right-enqueued, with the right-hand index having been incremented to
 reference hash chain~2.
 The middle portion of this same figure shows the state after
 three more elements have been right-enqueued.
 As you can see, the indexes are back to their initial states
 (see \cref{fig:SMPdesign:Hashed Double-Ended Queue}), however,
 each hash chain is now non-empty.
 The lower portion of this figure shows the state after three additional
 elements have been left-enqueued and an additional element has been
 right-enqueued.

 From the last state shown in
 \cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions},
 a left-dequeue operation would return element ``L$_{-2}$'' and leave
 the left-hand index referencing hash chain~2, which would then
 contain only a single element (``R$_2$'').
 In this state, a left-enqueue running concurrently with a right-enqueue
 would result in \IX{lock contention}, but the probability of such contention
 can be reduced to arbitrarily low levels by using a larger hash table.

 \begin{figure}
 \centering
 \resizebox{1.5in}{!}{\includegraphics{SMPdesign/lockdeqhashlots}}
 \caption{Hashed Double-Ended Queue With 16 Elements}
 \label{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements}
 \end{figure}

 \Cref{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements}
 shows how 16 elements would be organized in a four-hash-bucket
 parallel double-ended queue.
 Each underlying single-lock double-ended queue holds a one-quarter
 slice of the full parallel double-ended queue.

 \begin{listing}
 \input{CodeSamples/SMPdesign/lockhdeq@struct_pdeq.fcv}
 \caption{Lock-Based Parallel Double-Ended Queue Data Structure}
 \label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
 \end{listing}

 \Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
 shows the corresponding C-language data structure, assuming an
 existing \co{struct deq} that provides a trivially locked
 double-ended-queue implementation.
 \begin{fcvref}[ln:SMPdesign:lockhdeq:struct_pdeq]
 This data structure contains the left-hand lock on \clnref{llock},
 the left-hand index on \clnref{lidx}, the right-hand lock on \clnref{rlock}
 (which is cache-aligned in the actual implementation),
 the right-hand index on \clnref{ridx}, and, finally, the hashed array
 of simple lock-based double-ended queues on \clnref{bkt}.
 A high-performance implementation would of course use padding or special
 alignment directives to avoid \IX{false sharing}.
 \end{fcvref}

 \begin{listing}
 \input{CodeSamples/SMPdesign/lockhdeq@pop_push.fcv}
 \caption{Lock-Based Parallel Double-Ended Queue Implementation}
 \label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
 \end{listing}

 \Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
 (\path{lockhdeq.c})
 shows the implementation of the enqueue and dequeue functions.\footnote{
 	One could easily create a polymorphic implementation in any
 	number of languages, but doing so is left as an exercise for
 	the reader.}
 Discussion will focus on the left-hand operations, as the right-hand
 operations are trivially derived from them.

 \begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:popl]
 \Clnrefrange{b}{e} show \co{pdeq_pop_l()},
 which left\-/dequeues and returns
 an element if possible, returning \co{NULL} otherwise.
 \Clnref{acq} acquires the left-hand spinlock,
 and \clnref{idx} computes the
 index to be dequeued from.
 \Clnref{deque} dequeues the element, and,
 if \clnref{check} finds the result to be
 non-\co{NULL}, \clnref{record} records the new left-hand index.
 Either way, \clnref{rel} releases the lock, and,
 finally, \clnref{return} returns
 the element if there was one, or \co{NULL} otherwise.
 \end{fcvref}

 \begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:pushl]
 \Clnrefrange{b}{e} show \co{pdeq_push_l()},
 which left-enqueues the specified
 element.
 \Clnref{acq} acquires the left-hand lock,
 and \clnref{idx} picks up the left-hand
 index.
 \Clnref{enque} left-enqueues the specified element
 onto the double-ended queue
 indexed by the left-hand index.
 \Clnref{update} then updates the left-hand index
 and \clnref{rel} releases the lock.
 \end{fcvref}

 As noted earlier, the right-hand operations are completely analogous
 to their left-handed counterparts, so their analysis is left as an
 exercise for the reader.

 \QuickQuiz{
 	Is the hashed double-ended queue a good solution?
 	Why or why not?
 }\QuickQuizAnswer{
 	The best way to answer this is to run \path{lockhdeq.c} on
 	a number of different multiprocessor systems, and you are
 	encouraged to do so in the strongest possible terms.
 	One reason for concern is that each operation on this
 	implementation must acquire not one but two locks.
 	% Getting about 500 nanoseconds per element when used as
 	% a queue on a 4.2GHz Power system.  This is roughly the same as
 	% the version covered by a single lock.  Sequential (unlocked
 	% variant is more than an order of magnitude faster!

 	The first well-designed performance study will be cited.\footnote{
 		The studies by Dalessandro
 		et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
 		and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are
 		excellent starting points.}
 	Do not forget to compare to a sequential implementation!
 }\QuickQuizEnd

 \subsubsection{Compound Double-Ended Queue Revisited}
 \label{sec:SMPdesign:Compound Double-Ended Queue Revisited}

 This section revisits the compound double-ended queue, using a trivial
 rebalancing scheme that moves all the elements from the non-empty
 queue to the now-empty queue.

 \QuickQuiz{
 	Move \emph{all} the elements to the queue that became empty?
 	In what possible universe is this brain-dead solution in any
 	way optimal???
 }\QuickQuizAnswer{
 	It is optimal in the case where data flow switches direction only
 	rarely.
 	It would of course be an extremely poor choice if the double-ended
 	queue was being emptied from both ends concurrently.
 	This of course raises another question, namely, in what possible
 	universe emptying from both ends concurrently would be a reasonable
 	thing to do.
 	Work-stealing queues are one possible answer to this question.
 }\QuickQuizEnd

 In contrast to the hashed implementation presented in
 the previous section, the compound implementation will build on
 a sequential implementation of a double-ended queue that uses
 neither locks nor atomic operations.

 \begin{listing}
 \ebresizeverb{.97}{\input{CodeSamples/SMPdesign/locktdeq@pop_push.fcv}}
 \caption{Compound Parallel Double-Ended Queue Implementation}
 \label{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 \end{listing}

 \Cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 shows the implementation.
 Unlike the hashed implementation, this compound implementation is
 asymmetric, so that we must consider the \co{pdeq_pop_l()}
 and \co{pdeq_pop_r()} implementations separately.

 \QuickQuiz{
 	Why can't the compound parallel double-ended queue
 	implementation be symmetric?
 }\QuickQuizAnswer{
 	The need to avoid deadlock by imposing a lock hierarchy
 	forces the asymmetry, just as it does in the fork-numbering
 	solution to the Dining Philosophers Problem
 	(see \cref{sec:SMPdesign:Dining Philosophers Problem}).
 }\QuickQuizEnd

 \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popl]
 The \co{pdeq_pop_l()} implementation is shown on
 \clnrefrange{b}{e}
 of the figure.
 \Clnref{acq:l} acquires the left-hand lock,
 which \clnref{rel:l} releases.
 \Clnref{deq:ll} attempts to left-dequeue an element
 from the left-hand underlying
 double-ended queue, and, if successful,
 skips \clnrefrange{acq:r}{skip} to simply
 return this element.
 Otherwise, \clnref{acq:r} acquires the right-hand lock, \clnref{deq:lr}
 left-dequeues an element from the right-hand queue,
 and \clnref{move} moves any remaining elements on the right-hand
 queue to the left-hand queue, \clnref{init:r} initializes
 the right-hand queue,
 and \clnref{rel:r} releases the right-hand lock.
 The element, if any, that was dequeued on \clnref{deq:lr} will be returned.
 \end{fcvref}

 \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr]
 The \co{pdeq_pop_r()} implementation is shown on \clnrefrange{b}{e}
 of the figure.
 As before, \clnref{acq:r1} acquires the right-hand lock
 (and \clnref{rel:r2}
 releases it), and \clnref{deq:rr1} attempts to right-dequeue an element
 from the right-hand queue, and, if successful,
 skips \clnrefrange{rel:r1}{skip2}
 to simply return this element.
 However, if \clnref{check1} determines that there was no element to dequeue,
 \clnref{rel:r1} releases the right-hand lock and
 \clnrefrange{acq:l}{acq:r2} acquire both
 locks in the proper order.
 \Clnref{deq:rr2} then attempts to right-dequeue an element
 from the right-hand
 list again, and if \clnref{check2} determines that this second attempt has
 failed, \clnref{deq:rl} right-dequeues an element from the left-hand queue
 (if there is one available), \clnref{move} moves any remaining elements
 from the left-hand queue to the right-hand queue, and \clnref{init:l}
 initializes the left-hand queue.
 Either way, \clnref{rel:l} releases the left-hand lock.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	Why is it necessary to retry the right-dequeue operation
 	on \clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:deq:rr2} of
 	\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
 }\QuickQuizAnswerB{
 	\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr]
 	This retry is necessary because some other thread might have
 	enqueued an element between the time that this thread dropped
 	\co{d->rlock} on \clnref{rel:r1} and the time that it reacquired this
 	same lock on \clnref{acq:r2}.
 	\end{fcvref}
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Surely the left-hand lock must \emph{sometimes} be available!!!
 	So why is it necessary that
 	\clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:rel:r1} of
 	\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 	unconditionally release the right-hand lock?
 }\QuickQuizAnswerE{
 	It would be possible to use \co{spin_trylock()} to attempt
 	to acquire the left-hand lock when it was available.
 	However, the failure case would still need to drop the
 	right-hand lock and then re-acquire the two locks in order.
 	Making this transformation (and determining whether or not
 	it is worthwhile) is left as an exercise for the reader.
 }\QuickQuizEndE
 }

 \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushl]
 The \co{pdeq_push_l()} implementation is shown on
 \clnrefrange{b}{e} of
 \cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}.
 \Clnref{acq:l} acquires the left-hand spinlock,
 \clnref{que:l} left-enqueues the
 element onto the left-hand queue, and finally \clnref{rel:l} releases
 the lock.
 \end{fcvref}
 \begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushr]
 The \co{pdeq_push_r()} implementation (shown on \clnrefrange{b}{e})
 is quite similar.
 \end{fcvref}

 \QuickQuiz{
 	But in the case where data is flowing in only one direction,
 	the algorithm shown in
 	\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 	will have both ends attempting to acquire the same lock
 	whenever the consuming end empties its underlying
 	double-ended queue.
 	Doesn't that mean that sometimes this algorithm fails to
 	provide concurrent access to both ends of the queue even
 	when the queue contains an arbitrarily large number of elements?
 }\QuickQuizAnswer{
 	Indeed it does!

 	But the same is true of other algorithms claiming this property.
 	For example, in solutions using software transactional memory
 	mechanisms based on hashed arrays of locks,
 	the leftmost and rightmost elements' addresses will sometimes
 	happen to hash to the same lock.
 	These hash collisions will also prevent concurrent access.
 	For another example, solutions using hardware transactional
 	memory mechanisms with software
 	fallbacks~\cite{Yoo:2013:PEI:2503210.2503232,RickMerrit2011PowerTM,ChristianJacobi2012MainframeTM}
 	often use locking within those software fallbacks, and thus
 	suffer (albeit hopefully rarely) from whatever concurrency
 	limitations that these locking solutions suffer from.

 	Therefore, as of 2021, all practical solutions to the
 	concurrent double-ended queue problem fail to provide full
 	concurrency in at least some circumstances, including the
 	compound double-ended queue.
 }\QuickQuizEnd

 \subsubsection{Double-Ended Queue Discussion}
 \label{sec:SMPdesign:Double-Ended Queue Discussion}

 The compound implementation is somewhat more complex than the
 hashed variant presented in
 \cref{sec:SMPdesign:Hashed Double-Ended Queue},
 but is still reasonably simple.
 Of course, a more intelligent rebalancing scheme could be arbitrarily
 complex, but the simple scheme shown here has been shown to
 perform well compared to software
 alternatives~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
 and even compared to algorithms using hardware
 assist~\cite{DavidDice:2010:SCA:HTM:deque}.
 Nevertheless, the best we can hope for from such a scheme
 is 2x scalability, as at most two threads can be holding the
 dequeue's locks concurrently.
 This limitation also applies to algorithms based on \IXacrl{nbs},
 such as the compare-and-swap-based dequeue algorithm of
 Michael~\cite{DBLP:conf/europar/Michael03}.\footnote{
 	This paper is interesting in that it showed that special
 	double-compare-and-swap (DCAS) instructions are not needed
 	for lock-free implementations of double-ended queues.
 	Instead, the common compare-and-swap (e.g., x86 cmpxchg)
 	suffices.}

 \QuickQuiz{
 	Why are there not one but two solutions to the double-ended queue
 	problem?
 }\QuickQuizAnswer{
 	There are actually at least three.
 	The third, by Dominik Dingel, makes interesting use of
 	reader-writer locking, and may be found in \path{lockrwdeq.c}.

 	And so there is not one, but rather three solutions to the
 	lock-based double-ended queue problem on
 	\cpageref{sec:SMPdesign:Problems Double-Ended Queue}!
 }\QuickQuizEnd

 In fact, as noted by Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque},
 an unsynchronized single-threaded double-ended queue significantly
 outperforms any of the parallel implementations they studied.
 Therefore, the key point is that there can be significant overhead enqueuing to
 or dequeuing from a shared queue, regardless of implementation.
 This should come as no surprise in light of the material in
 \cref{chp:Hardware and its Habits}, given the strict
 first-in-first-out (FIFO) nature of these queues.

 Furthermore, these strict FIFO queues are strictly FIFO only with
 respect to
 \emph{linearization points}~\cite{Herlihy:1990:LCC:78969.78972}\footnote{
 	In short, a linearization point is a single point within a given
 	function where that function can be said to have taken effect.
 	In this lock-based implementation, the linearization points
 	can be said to be anywhere within the critical section that
 	does the work.}
 that are not visible to the caller, in fact, in these examples,
 the linearization points are buried in the lock-based critical
 sections.
 These queues are not strictly FIFO with respect to (say) the times at which
 the individual operations started~\cite{AndreasHaas2012FIFOisnt}.
 This indicates that the strict FIFO property is not all that valuable
 in concurrent programs, and in fact, Kirsch et al.\ present less-strict
 queues that provide improved performance and
 scalability~\cite{ChristophMKirsch2012FIFOisntTR}.\footnote{
 	Nir Shavit produced relaxed stacks for roughly the same
 	reasons~\cite{Shavit:2011:DSM:1897852.1897873}.
 	This situation leads some to believe that the linearization
 	points are useful to theorists rather than developers, and
 	leads others to wonder to what extent the designers of such
 	data structures and algorithms were considering the needs
 	of their users.}
 All that said, if you are pushing all the data used by your concurrent
 program through a single queue, you really need to rethink your
 overall design.

 \subsection{Partitioning Example Discussion}
 \label{sec:SMPdesign:Partitioning Example Discussion}

 The optimal solution to the dining philosophers problem given in
 the answer to the Quick Quiz in
 \cref{sec:SMPdesign:Dining Philosophers Problem}
 is an excellent example of ``horizontal parallelism'' or
 ``data parallelism''.
 The synchronization overhead in this case is nearly (or even exactly)
 zero.
 In contrast, the double-ended
 queue implementations are examples of ``vertical parallelism'' or
 ``pipelining'', given that data moves from one thread to another.
 The tighter coordination required for pipelining in turn requires
 larger units of work to obtain a given level of \IX{efficiency}.

 \QuickQuizSeries{%
 \QuickQuizB{
 	The tandem double-ended queue runs about twice as fast as
 	the hashed double-ended queue, even when I increase the
 	size of the hash table to an insanely large number.
 	Why is that?
 }\QuickQuizAnswerB{
 	The hashed double-ended queue's locking design only permits
 	one thread at a time at each end, and further requires
 	two lock acquisitions for each operation.
 	The tandem double-ended queue also permits one thread at a time
 	at each end, and in the common case requires only one lock
 	acquisition per operation.
 	Therefore, the tandem double-ended queue should be expected to
 	outperform the hashed double-ended queue.

 	Can you create a double-ended queue that allows multiple
 	concurrent operations at each end?
 	If so, how?
 	If not, why not?
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Is there a significantly better way of handling concurrency
 	for double-ended queues?
 }\QuickQuizAnswerE{
 	One approach is to transform the problem to be solved
 	so that multiple double-ended queues can be used in parallel,
 	allowing the simpler single-lock double-ended queue to be used,
 	and perhaps also replace each double-ended queue with a pair of
 	conventional single-ended queues.
 	Without such ``horizontal scaling'', the speedup is limited
 	to 2.0.
 	In contrast, horizontal-scaling designs can achieve very large
 	speedups, and are especially attractive if there are multiple threads
 	working either end of the queue, because in this
 	multiple-thread case the dequeue
 	simply cannot provide strong ordering guarantees.
 	After all, the fact that a given thread removed an item first
 	in no way implies that it will process that item
 	first~\cite{AndreasHaas2012FIFOisnt}.
 	And if there are no guarantees, we may as well obtain the
 	performance benefits that come with refusing to provide these
 	guarantees.
 	% about twice as fast as hashed version on 4.2GHz Power.

 	Regardless of whether or not the problem can be transformed
 	to use multiple queues, it is worth asking whether work can
 	be batched so that each enqueue and dequeue operation corresponds
 	to larger units of work.
 	This batching approach decreases contention on the queue data
 	structures, which increases both performance and scalability,
 	as will be seen in
 	\cref{sec:SMPdesign:Synchronization Granularity}.
 	After all, if you must incur high synchronization overheads,
 	be sure you are getting your money's worth.

 	Other researchers are working on other ways to take advantage
 	of limited ordering guarantees in
 	queues~\cite{ChristophMKirsch2012FIFOisntTR}.
 }\QuickQuizEndE
 }

 These two examples show just how powerful partitioning can be in
 devising parallel algorithms.
 \Cref{sec:SMPdesign:Locking Granularity and Performance}
 looks briefly at a third example, matrix multiply.
 However, all three of these examples beg for more and better design
 criteria for parallel programs, a topic taken up in the next section.