blob: d21515225c280b1722c0d7c70d088146ab57c675 [file] [log] [blame]
% SMPdesign/partexercises.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Partitioning Exercises}
\label{sec:SMPdesign:Partitioning Exercises}
%
\epigraph{Whenever a theory appears to you as the only possible one,
take this as a sign that you have neither understood the theory
nor the problem which it was intended to solve.}
{Karl Popper}
Although partitioning is more widely understood than it was in the early
2000s, its value is still underappreciated.
\Cref{sec:SMPdesign:Dining Philosophers Problem}
therefore takes more highly parallel look at the classic Dining
Philosophers problem and
\cref{sec:SMPdesign:Double-Ended Queue}
revisits the double-ended queue.
\subsection{Dining Philosophers Problem}
\label{sec:SMPdesign:Dining Philosophers Problem}
\begin{figure}
\centering
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5}
\caption{Dining Philosophers Problem}
\ContributedBy{fig:SMPdesign:Dining Philosophers Problem}{Kornilios Kourtis}
\end{figure}
\Cref{fig:SMPdesign:Dining Philosophers Problem} shows a diagram
of the classic \IX{Dining Philosophers problem}~\cite{Dijkstra1971HOoSP}.
This problem features five philosophers who do nothing but think and
eat a ``very difficult kind of spaghetti'' which requires two forks
to eat.\footnote{
But feel free to instead think in terms of chopsticks.}
A given philosopher is permitted to use only the forks to his or her
immediate right and left, but will not put a given fork down until sated.
\begin{figure*}
\centering
\IfTwoColumn{
\resizebox{5in}{!}{\includegraphics{cartoons/Dining-philosophers}}
}{
\resizebox{\onecolumntextwidth}{!}{\includegraphics{cartoons/Dining-philosophers}}
}
\caption{Partial Starvation Is Also Bad}
\ContributedBy{fig:cpu:Partial Starvation Is Also Bad}{Melissa Broussard}
\end{figure*}
The object is to construct an algorithm that, quite literally,
prevents \IX{starvation}.
One starvation scenario would be if all of the philosophers picked up
their leftmost forks simultaneously.
Because none of them will put down their fork until after they finished
eating, and because none of them may pick up their second fork until at
least one of them has finished eating, they all starve.
Please note that it is not sufficient to allow at least one philosopher
to eat.
As \cref{fig:cpu:Partial Starvation Is Also Bad}
shows, starvation of even a few of the philosophers is to be avoided.
\begin{figure}
\centering
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5TB}
\caption{Dining Philosophers Problem, Textbook Solution}
\ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}{Kornilios Kourtis}
\end{figure}
\pplsur{Edsger W.}{Dijkstra}'s solution used a global semaphore,
which works fine assuming
negligible communications delays, an assumption that became invalid
in the late 1980s or early 1990s.\footnote{
It is all too easy to denigrate Dijkstra from the viewpoint
of the year 2021, more than 50 years after the fact.
If you still feel the need to denigrate Dijkstra, my advice
is to publish something, wait 50 years, and then see
how well \emph{your} ideas stood the test of time.}
More recent solutions number the forks as shown in
\cref{fig:SMPdesign:Dining Philosophers Problem; Textbook Solution}.
Each philosopher picks up the lowest-numbered fork next to his or her
plate, then picks up the other fork.
The philosopher sitting in the uppermost position in the diagram thus
picks up the leftmost fork first, then the rightmost fork, while the
rest of the philosophers instead pick up their rightmost fork first.
Because two of the philosophers will attempt to pick up fork~1 first,
and because only one of those two philosophers will succeed,
there will be five forks available to four philosophers.
At least one of these four will have two forks, and will thus be able
to eat.
This general technique of numbering resources and acquiring them in
numerical order is heavily used as a deadlock-prevention technique.
However, it is easy to imagine a sequence of events that will result
in only one philosopher eating at a time even though all are hungry:
\begin{enumerate}
\item P2 picks up fork~1, preventing P1 from taking a fork.
\item P3 picks up fork~2.
\item P4 picks up fork~3.
\item P5 picks up fork~4.
\item P5 picks up fork~5 and eats.
\item P5 puts down forks~4 and 5.
\item P4 picks up fork~4 and eats.
\end{enumerate}
In short, this algorithm can result in only one philosopher eating at
a given time, even when all five philosophers are hungry,
despite the fact that there are more than enough forks for two
philosophers to eat concurrently.
It should be possible to do better than this!
\begin{figure}
\centering
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher4part-b}
\caption{Dining Philosophers Problem, Partitioned}
\ContributedBy{fig:SMPdesign:Dining Philosophers Problem; Partitioned}{Kornilios Kourtis}
\end{figure}
One approach is shown in
\cref{fig:SMPdesign:Dining Philosophers Problem; Partitioned},
which includes four philosophers rather than five to better illustrate the
partition technique.
Here the upper and rightmost philosophers share a pair of forks,
while the lower and leftmost philosophers share another pair of forks.
If all philosophers are simultaneously hungry, at least two will
always be able to eat concurrently.
In addition, as shown in the figure, the forks can now be bundled
so that the pair are picked up and put down simultaneously, simplifying
the acquisition and release algorithms.
\QuickQuizSeries{%
\QuickQuizB{
Is there a better solution to the Dining
Philosophers Problem?
}\QuickQuizAnswerB{
One such improved solution is shown in
\cref{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned},
where the philosophers are simply provided with an additional
five forks.
All five philosophers may now eat simultaneously, and there
is never any need for philosophers to wait on one another.
In addition, this approach offers greatly improved disease control.
\begin{figure}
\centering
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}
\caption{Dining Philosophers Problem, Fully Partitioned}
\QContributedBy{fig:SMPdesign:Dining Philosophers Problem; Fully Partitioned}{Kornilios Kourtis}
\end{figure}
This solution might seem like cheating to some, but such
``cheating'' is key to finding good solutions to many
concurrency problems, as any hungry philosopher would agree.
And this is one solution to the Dining Philosophers
concurrent-consumption problem called out on
\cpageref{sec:SMPdesign:Problems Dining Philosophers}.
}\QuickQuizEndB
%
\QuickQuizE{
How would you valididate an algorithm alleged to solve the Dining
Philosophers Problem?
}\QuickQuizAnswerE{
Much depends on the details of the algorithm, but here are a
couple of places to start.
First, for algorithms in which picking up left-hand and right-hand
forks are separate operations, start with all forks on the table.
Then have all philosophers attempt to pick up their first fork.
Once all philosophers either have their first fork or are waiting
for someone to put down their first fork, have each non-waiting
philosopher pick up their second fork.
At this point in any starvation-free solution, at least one
philosopher will be eating.
If there were any waiting philosophers, repeat this test,
preferably imposing random variations in timing.
Second, create a stress test in which philosphers start and
stop eating at random times.
Generate starvation and fairness conditions and verify that
these conditions are met.
Here are a couple of example starvation and fairness conditions:
\begin{enumerate}
\item If all other philosophers have stopped eating $N$ times
since a given philosopher attempted to pick up a given
fork, that philosopher should have succeeded in picking
up that fork.
For high-quality solutions using high-quality locking
primitives (or high-quality atomic operations), $N=1$
is doable.
\item Given an upper bound $T$ on the time any philosopher holds
onto both forks before putting them down, the maximum
waiting time for any philosopher should be bounded by
$NT$ for some $N$ that is not hugely larger than the
number of philosophers.
\item Generate some statistic representing the time from
when philosophers attempt to pick up their first fork
to the time when they start eating.
The smaller this statistic, the better the solution.
Mean, median, and maximum are all useful statistics,
but examining the full distribution can also be
enlightening.
\end{enumerate}
Readers are encouraged to actually try testing any of the
solutions presented in this book, and especially testing solutions
of their own devising.
}\QuickQuizEndE
}
This is an example of ``horizontal parallelism''~\cite{Inman85}
or ``data parallelism'',
so named because there is no dependency among the pairs of philosophers.
In a horizontally parallel data-processing system, a given item of data
would be processed by only one of a replicated set of software
components.
\QuickQuiz{
And in just what sense can this ``horizontal parallelism'' be
said to be ``horizontal''?
}\QuickQuizAnswer{
Jack Inman~\cite{Inman85}
was working with protocol stacks, which are normally
depicted vertically, with the application on top and the
hardware interconnect on the bottom.
Data flows up and down this stack.
``Horizontal parallelism'' processes packets from different network
connections in parallel, while ``vertical parallelism''
handles different protocol-processing steps for a given
packet in parallel.
``Vertical parallelism'' is also called ``pipelining''.
}\QuickQuizEnd
\subsection{Double-Ended Queue}
\label{sec:SMPdesign:Double-Ended Queue}
A double-ended queue is a data structure containing a list of elements
that may be inserted or removed from either end~\cite{Knuth73}.
It has been claimed that a lock-based implementation permitting
concurrent operations on both ends of the double-ended queue is
difficult~\cite{DanGrossman2007TMGCAnalogy}.
This section shows how a partitioning design strategy can result
in a reasonably simple implementation, looking at three
general approaches in the following sections.
But first, how should we validate a concurrent double-ended queue?
\subsubsection{Double-Ended Queue Validation}
\label{sec:SMPdesign:Double-Ended Queue Validation}
A good place to start is with invariants.
For example, if elements are pushed onto one end of a double-ended
queue and popped off of the other, the order of those elements must
be preserved.
Similarly, if elements are pushed onto one end of the queue and popped
off of that same end, the order of those elements must be reversed.
Any element popped from the queue must have been most recently pushed
onto that queue, and if the queue is emptied, all elements pushed onto
it must have already been popped from it.
The beginnings of a test suite for concurrent double-ended queues
(``\path{deqtorture.h}'') provides the following checks:
\begin{enumerate}
\item Element-ordering checks provided by \co{CHECK_SEQUENCE_PAIR()}.
\item Checks that elements popped were most recently pushed, provided
by \co{melee()}.
\item Checks that elements pushed are popped before the queue is
emptied, also provided by \co{melee()}.
\end{enumerate}
This suite includes both sequential and concurrent tests.
Although this suite is good and sufficient for textbook code, you
should test considerably more thoroughly for code intended for
production use.
\Cref{chp:Validation,chp:Formal Verification} cover a large array
of validation tools and techniques.
But with a prototype test suite in place, we are ready to look at the
double-ended-queue algorithms in the next sections.
\subsubsection{Left- and Right-Hand Locks}
\label{sec:SMPdesign:Left- and Right-Hand Locks}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeq}}
\caption{Double-Ended Queue With Left- and Right-Hand Locks}
\label{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}
\end{figure}
One seemingly straightforward approach would be to use a doubly
linked list with a left-hand lock
for left-hand-end enqueue and dequeue operations along with a right-hand
lock for right-hand-end operations, as shown in
\cref{fig:SMPdesign:Double-Ended Queue With Left- and Right-Hand Locks}.
However, the problem with this approach is that the two locks'
domains must overlap when there are fewer than four elements on the
list.
This overlap is due to the fact that removing any given element affects
not only that element, but also its left- and right-hand neighbors.
These domains are indicated by color in the figure, with blue
with downward stripes indicating
the domain of the left-hand lock, red with upward stripes
indicating the domain of the right-hand
lock, and purple (with no stripes) indicating overlapping domains.
Although it is possible to create an algorithm that works this way,
perhaps using a dummy element similar to the two-lock queue
presented by Michael and Scott~\cite{MichaelScott96},
the fact that it has no fewer than five special cases should raise
a big red flag, especially given that concurrent activity at the other
end of the list can shift the queue from one special case to another
at any time.
It is far better to consider other designs.
\subsubsection{Compound Double-Ended Queue}
\label{sec:SMPdesign:Compound Double-Ended Queue}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqpair}}
\caption{Compound Double-Ended Queue}
\label{fig:SMPdesign:Compound Double-Ended Queue}
\end{figure}
One way of forcing non-overlapping lock domains is shown in
\cref{fig:SMPdesign:Compound Double-Ended Queue}.
Two separate double-ended queues are run in tandem, each protected by
its own lock.
This means that elements must occasionally be shuttled from one of
the double-ended queues to the other, in which case both locks must
be held.
A simple lock hierarchy may be used to avoid \IX{deadlock}, for example,
always acquiring the left-hand lock before acquiring the right-hand lock.
This will be much simpler than applying two locks to the same
double-ended queue, as we can unconditionally left-enqueue elements
to the left-hand queue and right-enqueue elements to the right-hand
queue.
The main complication arises when dequeuing from an empty queue, in
which case it is necessary to:
\begin{enumerate}
\item If holding the right-hand lock, release it and acquire the
left-hand lock.
\item Acquire the right-hand lock.
\item Rebalance the elements across the two queues.
\item Remove the required element if there is one.
\item Release both locks.
\end{enumerate}
\QuickQuiz{
In this compound double-ended queue implementation, what should
be done if the queue has become non-empty while releasing
and reacquiring the lock?
}\QuickQuizAnswer{
In this case, simply dequeue an item from the non-empty
queue, release both locks, and return.
}\QuickQuizEnd
The resulting code (\path{locktdeq.c}) is quite straightforward.
The rebalancing operation might well shuttle a given element back
and forth between the two queues, wasting time and possibly requiring
workload-dependent heuristics to obtain optimal performance.
Although this might well be the best approach in some cases, it is
interesting to try for an algorithm with greater determinism.
\subsubsection{Hashed Double-Ended Queue}
\label{sec:SMPdesign:Hashed Double-Ended Queue}
One of the simplest and most effective ways to deterministically
partition a data structure is to hash it.
It is possible to trivially hash a double-ended queue by assigning
each element a sequence number based on its position in the list,
so that the first element left-enqueued into an empty queue is numbered
zero and the first element right-enqueued into an empty queue is numbered
one.
A series of elements left-enqueued into an otherwise-idle queue would
be assigned decreasing numbers ($-1$, $-2$, $-3$, \ldots), while a series of
elements right-enqueued into an otherwise-idle queue would be assigned
increasing numbers (2, 3, 4, \ldots).
A key point is that it is not necessary to actually represent a given
element's number, as this number will be implied by its position in
the queue.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash}}
\caption{Hashed Double-Ended Queue}
\label{fig:SMPdesign:Hashed Double-Ended Queue}
\end{figure}
Given this approach, we assign one lock to guard the left-hand index,
one to guard the right-hand index, and one lock for each hash chain.
\Cref{fig:SMPdesign:Hashed Double-Ended Queue} shows the resulting
data structure given four hash chains.
Note that the lock domains do not overlap, and that deadlock is avoided
by acquiring the index locks before the chain locks, and by never
acquiring more than one lock of a given type (index or chain) at a time.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/lockdeqhash1R}}
\caption{Hashed Double-Ended Queue After Insertions}
\label{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
\end{figure}
Each hash chain is itself a double-ended queue, and in this example,
each holds every fourth element.
The uppermost portion of
\cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions}
shows the state after a single element (``R$_1$'') has been
right-enqueued, with the right-hand index having been incremented to
reference hash chain~2.
The middle portion of this same figure shows the state after
three more elements have been right-enqueued.
As you can see, the indexes are back to their initial states
(see \cref{fig:SMPdesign:Hashed Double-Ended Queue}), however,
each hash chain is now non-empty.
The lower portion of this figure shows the state after three additional
elements have been left-enqueued and an additional element has been
right-enqueued.
From the last state shown in
\cref{fig:SMPdesign:Hashed Double-Ended Queue After Insertions},
a left-dequeue operation would return element ``L$_{-2}$'' and leave
the left-hand index referencing hash chain~2, which would then
contain only a single element (``R$_2$'').
In this state, a left-enqueue running concurrently with a right-enqueue
would result in \IX{lock contention}, but the probability of such contention
can be reduced to arbitrarily low levels by using a larger hash table.
\begin{figure}
\centering
\resizebox{1.5in}{!}{\includegraphics{SMPdesign/lockdeqhashlots}}
\caption{Hashed Double-Ended Queue With 16 Elements}
\label{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements}
\end{figure}
\Cref{fig:SMPdesign:Hashed Double-Ended Queue With 16 Elements}
shows how 16 elements would be organized in a four-hash-bucket
parallel double-ended queue.
Each underlying single-lock double-ended queue holds a one-quarter
slice of the full parallel double-ended queue.
\begin{listing}
\input{CodeSamples/SMPdesign/lockhdeq@struct_pdeq.fcv}
\caption{Lock-Based Parallel Double-Ended Queue Data Structure}
\label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
\end{listing}
\Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Data Structure}
shows the corresponding C-language data structure, assuming an
existing \co{struct deq} that provides a trivially locked
double-ended-queue implementation.
\begin{fcvref}[ln:SMPdesign:lockhdeq:struct_pdeq]
This data structure contains the left-hand lock on \clnref{llock},
the left-hand index on \clnref{lidx}, the right-hand lock on \clnref{rlock}
(which is cache-aligned in the actual implementation),
the right-hand index on \clnref{ridx}, and, finally, the hashed array
of simple lock-based double-ended queues on \clnref{bkt}.
A high-performance implementation would of course use padding or special
alignment directives to avoid \IX{false sharing}.
\end{fcvref}
\begin{listing}
\input{CodeSamples/SMPdesign/lockhdeq@pop_push.fcv}
\caption{Lock-Based Parallel Double-Ended Queue Implementation}
\label{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
\end{listing}
\Cref{lst:SMPdesign:Lock-Based Parallel Double-Ended Queue Implementation}
(\path{lockhdeq.c})
shows the implementation of the enqueue and dequeue functions.\footnote{
One could easily create a polymorphic implementation in any
number of languages, but doing so is left as an exercise for
the reader.}
Discussion will focus on the left-hand operations, as the right-hand
operations are trivially derived from them.
\begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:popl]
\Clnrefrange{b}{e} show \co{pdeq_pop_l()},
which left\-/dequeues and returns
an element if possible, returning \co{NULL} otherwise.
\Clnref{acq} acquires the left-hand spinlock,
and \clnref{idx} computes the
index to be dequeued from.
\Clnref{deque} dequeues the element, and,
if \clnref{check} finds the result to be
non-\co{NULL}, \clnref{record} records the new left-hand index.
Either way, \clnref{rel} releases the lock, and,
finally, \clnref{return} returns
the element if there was one, or \co{NULL} otherwise.
\end{fcvref}
\begin{fcvref}[ln:SMPdesign:lockhdeq:pop_push:pushl]
\Clnrefrange{b}{e} show \co{pdeq_push_l()},
which left-enqueues the specified
element.
\Clnref{acq} acquires the left-hand lock,
and \clnref{idx} picks up the left-hand
index.
\Clnref{enque} left-enqueues the specified element
onto the double-ended queue
indexed by the left-hand index.
\Clnref{update} then updates the left-hand index
and \clnref{rel} releases the lock.
\end{fcvref}
As noted earlier, the right-hand operations are completely analogous
to their left-handed counterparts, so their analysis is left as an
exercise for the reader.
\QuickQuiz{
Is the hashed double-ended queue a good solution?
Why or why not?
}\QuickQuizAnswer{
The best way to answer this is to run \path{lockhdeq.c} on
a number of different multiprocessor systems, and you are
encouraged to do so in the strongest possible terms.
One reason for concern is that each operation on this
implementation must acquire not one but two locks.
% Getting about 500 nanoseconds per element when used as
% a queue on a 4.2GHz Power system. This is roughly the same as
% the version covered by a single lock. Sequential (unlocked
% variant is more than an order of magnitude faster!
The first well-designed performance study will be cited.\footnote{
The studies by Dalessandro
et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are
excellent starting points.}
Do not forget to compare to a sequential implementation!
}\QuickQuizEnd
\subsubsection{Compound Double-Ended Queue Revisited}
\label{sec:SMPdesign:Compound Double-Ended Queue Revisited}
This section revisits the compound double-ended queue, using a trivial
rebalancing scheme that moves all the elements from the non-empty
queue to the now-empty queue.
\QuickQuiz{
Move \emph{all} the elements to the queue that became empty?
In what possible universe is this brain-dead solution in any
way optimal???
}\QuickQuizAnswer{
It is optimal in the case where data flow switches direction only
rarely.
It would of course be an extremely poor choice if the double-ended
queue was being emptied from both ends concurrently.
This of course raises another question, namely, in what possible
universe emptying from both ends concurrently would be a reasonable
thing to do.
Work-stealing queues are one possible answer to this question.
}\QuickQuizEnd
In contrast to the hashed implementation presented in
the previous section, the compound implementation will build on
a sequential implementation of a double-ended queue that uses
neither locks nor atomic operations.
\begin{listing}
\ebresizeverb{.97}{\input{CodeSamples/SMPdesign/locktdeq@pop_push.fcv}}
\caption{Compound Parallel Double-Ended Queue Implementation}
\label{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
\end{listing}
\Cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
shows the implementation.
Unlike the hashed implementation, this compound implementation is
asymmetric, so that we must consider the \co{pdeq_pop_l()}
and \co{pdeq_pop_r()} implementations separately.
\QuickQuiz{
Why can't the compound parallel double-ended queue
implementation be symmetric?
}\QuickQuizAnswer{
The need to avoid deadlock by imposing a lock hierarchy
forces the asymmetry, just as it does in the fork-numbering
solution to the Dining Philosophers Problem
(see \cref{sec:SMPdesign:Dining Philosophers Problem}).
}\QuickQuizEnd
\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popl]
The \co{pdeq_pop_l()} implementation is shown on
\clnrefrange{b}{e}
of the figure.
\Clnref{acq:l} acquires the left-hand lock,
which \clnref{rel:l} releases.
\Clnref{deq:ll} attempts to left-dequeue an element
from the left-hand underlying
double-ended queue, and, if successful,
skips \clnrefrange{acq:r}{skip} to simply
return this element.
Otherwise, \clnref{acq:r} acquires the right-hand lock, \clnref{deq:lr}
left-dequeues an element from the right-hand queue,
and \clnref{move} moves any remaining elements on the right-hand
queue to the left-hand queue, \clnref{init:r} initializes
the right-hand queue,
and \clnref{rel:r} releases the right-hand lock.
The element, if any, that was dequeued on \clnref{deq:lr} will be returned.
\end{fcvref}
\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr]
The \co{pdeq_pop_r()} implementation is shown on \clnrefrange{b}{e}
of the figure.
As before, \clnref{acq:r1} acquires the right-hand lock
(and \clnref{rel:r2}
releases it), and \clnref{deq:rr1} attempts to right-dequeue an element
from the right-hand queue, and, if successful,
skips \clnrefrange{rel:r1}{skip2}
to simply return this element.
However, if \clnref{check1} determines that there was no element to dequeue,
\clnref{rel:r1} releases the right-hand lock and
\clnrefrange{acq:l}{acq:r2} acquire both
locks in the proper order.
\Clnref{deq:rr2} then attempts to right-dequeue an element
from the right-hand
list again, and if \clnref{check2} determines that this second attempt has
failed, \clnref{deq:rl} right-dequeues an element from the left-hand queue
(if there is one available), \clnref{move} moves any remaining elements
from the left-hand queue to the right-hand queue, and \clnref{init:l}
initializes the left-hand queue.
Either way, \clnref{rel:l} releases the left-hand lock.
\end{fcvref}
\QuickQuizSeries{%
\QuickQuizB{
Why is it necessary to retry the right-dequeue operation
on \clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:deq:rr2} of
\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
}\QuickQuizAnswerB{
\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:popr]
This retry is necessary because some other thread might have
enqueued an element between the time that this thread dropped
\co{d->rlock} on \clnref{rel:r1} and the time that it reacquired this
same lock on \clnref{acq:r2}.
\end{fcvref}
}\QuickQuizEndB
%
\QuickQuizE{
Surely the left-hand lock must \emph{sometimes} be available!!!
So why is it necessary that
\clnrefr{ln:SMPdesign:locktdeq:pop_push:popr:rel:r1} of
\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
unconditionally release the right-hand lock?
}\QuickQuizAnswerE{
It would be possible to use \co{spin_trylock()} to attempt
to acquire the left-hand lock when it was available.
However, the failure case would still need to drop the
right-hand lock and then re-acquire the two locks in order.
Making this transformation (and determining whether or not
it is worthwhile) is left as an exercise for the reader.
}\QuickQuizEndE
}
\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushl]
The \co{pdeq_push_l()} implementation is shown on
\clnrefrange{b}{e} of
\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}.
\Clnref{acq:l} acquires the left-hand spinlock,
\clnref{que:l} left-enqueues the
element onto the left-hand queue, and finally \clnref{rel:l} releases
the lock.
\end{fcvref}
\begin{fcvref}[ln:SMPdesign:locktdeq:pop_push:pushr]
The \co{pdeq_push_r()} implementation (shown on \clnrefrange{b}{e})
is quite similar.
\end{fcvref}
\QuickQuiz{
But in the case where data is flowing in only one direction,
the algorithm shown in
\cref{lst:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
will have both ends attempting to acquire the same lock
whenever the consuming end empties its underlying
double-ended queue.
Doesn't that mean that sometimes this algorithm fails to
provide concurrent access to both ends of the queue even
when the queue contains an arbitrarily large number of elements?
}\QuickQuizAnswer{
Indeed it does!
But the same is true of other algorithms claiming this property.
For example, in solutions using software transactional memory
mechanisms based on hashed arrays of locks,
the leftmost and rightmost elements' addresses will sometimes
happen to hash to the same lock.
These hash collisions will also prevent concurrent access.
For another example, solutions using hardware transactional
memory mechanisms with software
fallbacks~\cite{Yoo:2013:PEI:2503210.2503232,RickMerrit2011PowerTM,ChristianJacobi2012MainframeTM}
often use locking within those software fallbacks, and thus
suffer (albeit hopefully rarely) from whatever concurrency
limitations that these locking solutions suffer from.
Therefore, as of 2021, all practical solutions to the
concurrent double-ended queue problem fail to provide full
concurrency in at least some circumstances, including the
compound double-ended queue.
}\QuickQuizEnd
\subsubsection{Double-Ended Queue Discussion}
\label{sec:SMPdesign:Double-Ended Queue Discussion}
The compound implementation is somewhat more complex than the
hashed variant presented in
\cref{sec:SMPdesign:Hashed Double-Ended Queue},
but is still reasonably simple.
Of course, a more intelligent rebalancing scheme could be arbitrarily
complex, but the simple scheme shown here has been shown to
perform well compared to software
alternatives~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
and even compared to algorithms using hardware
assist~\cite{DavidDice:2010:SCA:HTM:deque}.
Nevertheless, the best we can hope for from such a scheme
is 2x scalability, as at most two threads can be holding the
dequeue's locks concurrently.
This limitation also applies to algorithms based on \IXacrl{nbs},
such as the compare-and-swap-based dequeue algorithm of
Michael~\cite{DBLP:conf/europar/Michael03}.\footnote{
This paper is interesting in that it showed that special
double-compare-and-swap (DCAS) instructions are not needed
for lock-free implementations of double-ended queues.
Instead, the common compare-and-swap (e.g., x86 cmpxchg)
suffices.}
\QuickQuiz{
Why are there not one but two solutions to the double-ended queue
problem?
}\QuickQuizAnswer{
There are actually at least three.
The third, by Dominik Dingel, makes interesting use of
reader-writer locking, and may be found in \path{lockrwdeq.c}.
And so there is not one, but rather three solutions to the
lock-based double-ended queue problem on
\cpageref{sec:SMPdesign:Problems Double-Ended Queue}!
}\QuickQuizEnd
In fact, as noted by Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque},
an unsynchronized single-threaded double-ended queue significantly
outperforms any of the parallel implementations they studied.
Therefore, the key point is that there can be significant overhead enqueuing to
or dequeuing from a shared queue, regardless of implementation.
This should come as no surprise in light of the material in
\cref{chp:Hardware and its Habits}, given the strict
first-in-first-out (FIFO) nature of these queues.
Furthermore, these strict FIFO queues are strictly FIFO only with
respect to
\emph{linearization points}~\cite{Herlihy:1990:LCC:78969.78972}\footnote{
In short, a linearization point is a single point within a given
function where that function can be said to have taken effect.
In this lock-based implementation, the linearization points
can be said to be anywhere within the critical section that
does the work.}
that are not visible to the caller, in fact, in these examples,
the linearization points are buried in the lock-based critical
sections.
These queues are not strictly FIFO with respect to (say) the times at which
the individual operations started~\cite{AndreasHaas2012FIFOisnt}.
This indicates that the strict FIFO property is not all that valuable
in concurrent programs, and in fact, Kirsch et al.\ present less-strict
queues that provide improved performance and
scalability~\cite{ChristophMKirsch2012FIFOisntTR}.\footnote{
Nir Shavit produced relaxed stacks for roughly the same
reasons~\cite{Shavit:2011:DSM:1897852.1897873}.
This situation leads some to believe that the linearization
points are useful to theorists rather than developers, and
leads others to wonder to what extent the designers of such
data structures and algorithms were considering the needs
of their users.}
All that said, if you are pushing all the data used by your concurrent
program through a single queue, you really need to rethink your
overall design.
\subsection{Partitioning Example Discussion}
\label{sec:SMPdesign:Partitioning Example Discussion}
The optimal solution to the dining philosophers problem given in
the answer to the Quick Quiz in
\cref{sec:SMPdesign:Dining Philosophers Problem}
is an excellent example of ``horizontal parallelism'' or
``data parallelism''.
The synchronization overhead in this case is nearly (or even exactly)
zero.
In contrast, the double-ended
queue implementations are examples of ``vertical parallelism'' or
``pipelining'', given that data moves from one thread to another.
The tighter coordination required for pipelining in turn requires
larger units of work to obtain a given level of \IX{efficiency}.
\QuickQuizSeries{%
\QuickQuizB{
The tandem double-ended queue runs about twice as fast as
the hashed double-ended queue, even when I increase the
size of the hash table to an insanely large number.
Why is that?
}\QuickQuizAnswerB{
The hashed double-ended queue's locking design only permits
one thread at a time at each end, and further requires
two lock acquisitions for each operation.
The tandem double-ended queue also permits one thread at a time
at each end, and in the common case requires only one lock
acquisition per operation.
Therefore, the tandem double-ended queue should be expected to
outperform the hashed double-ended queue.
Can you create a double-ended queue that allows multiple
concurrent operations at each end?
If so, how?
If not, why not?
}\QuickQuizEndB
%
\QuickQuizE{
Is there a significantly better way of handling concurrency
for double-ended queues?
}\QuickQuizAnswerE{
One approach is to transform the problem to be solved
so that multiple double-ended queues can be used in parallel,
allowing the simpler single-lock double-ended queue to be used,
and perhaps also replace each double-ended queue with a pair of
conventional single-ended queues.
Without such ``horizontal scaling'', the speedup is limited
to 2.0.
In contrast, horizontal-scaling designs can achieve very large
speedups, and are especially attractive if there are multiple threads
working either end of the queue, because in this
multiple-thread case the dequeue
simply cannot provide strong ordering guarantees.
After all, the fact that a given thread removed an item first
in no way implies that it will process that item
first~\cite{AndreasHaas2012FIFOisnt}.
And if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide these
guarantees.
% about twice as fast as hashed version on 4.2GHz Power.
Regardless of whether or not the problem can be transformed
to use multiple queues, it is worth asking whether work can
be batched so that each enqueue and dequeue operation corresponds
to larger units of work.
This batching approach decreases contention on the queue data
structures, which increases both performance and scalability,
as will be seen in
\cref{sec:SMPdesign:Synchronization Granularity}.
After all, if you must incur high synchronization overheads,
be sure you are getting your money's worth.
Other researchers are working on other ways to take advantage
of limited ordering guarantees in
queues~\cite{ChristophMKirsch2012FIFOisntTR}.
}\QuickQuizEndE
}
These two examples show just how powerful partitioning can be in
devising parallel algorithms.
\Cref{sec:SMPdesign:Locking Granularity and Performance}
looks briefly at a third example, matrix multiply.
However, all three of these examples beg for more and better design
criteria for parallel programs, a topic taken up in the next section.