SMPdesign/beyond.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % beyond.tex

 \section{Beyond Partitioning}
 \label{sec:SMPdesign:Beyond Partitioning}
 \OriginallyPublished{Section}{sec:SMPdesign:Beyond Partitioning}{Retrofitted Parallelism Considered Grossly Sub-Optimal}{4\textsuperscript{th} USENIX Workshop on Hot Topics on Parallelism}{PaulEMcKenney2012HOTPARsuboptimal}

 This chapter has discussed how data partitioning can be used to design
 simple linearly scalable parallel programs.
 Section~\ref{sec:SMPdesign:Data Ownership} hinted at the possibilities
 of data replication, which will be used to great effect in
 Section~\ref{sec:defer:Read-Copy Update (RCU)}.

 The main goal of applying partitioning and replication is to achieve
 linear speedups, in other words, to ensure that the total amount of
 work required does not increase significantly as the number of CPUs
 or threads increases.
 A problem that can be solved via partitioning and/or replication,
 resulting in linear speedups, is \emph{embarrassingly parallel}.
 But can we do better?

 To answer this question, let us examine the solution of
 labyrinths and mazes.
 Of course, labyrinths and mazes have been objects of fascination for
 millenia~\cite{WikipediaLabyrinth},
 so it should come as no surprise that they are generated and solved
 using computers, including biological
 computers~\cite{AndrewAdamatzky2011SlimeMold},
 GPGPUs~\cite{ChristerEricson2008GPUMaze}, and even
 discrete hardware~\cite{MIT:TRMag:MemristorMazes}.
 Parallel solution of mazes is sometimes used as a class project in
 universities~\cite{ETHZurich:FS2011maze,UMD:CMSC433maze} and
 as a vehicle to demonstrate the benefits of parallel-programming
 frameworks~\cite{RonFosner2010maze}.

 Common advice is to use a parallel work-queue algorithm
 (PWQ)~\cite{ETHZurich:FS2011maze,RonFosner2010maze}.
 This section evaluates this advice by comparing PWQ
 against a sequential algorithm (SEQ) and also against
 an alternative parallel algorithm, in all cases solving randomly generated
 square mazes.
 Section~\ref{sec:SMPdesign:Work-Queue Parallel Maze Solver} discusses PWQ,
 Section~\ref{sec:SMPdesign:Alternative Parallel Maze Solver} discusses an alternative
 parallel algorithm,
 Section~\ref{sec:SMPdesign:Performance Comparison I} analyzes its anomalous performance,
 Section~\ref{sec:SMPdesign:Alternative Sequential Maze Solver} derives an improved
 sequential algorithm from the alternative parallel algorithm,
 Section~\ref{sec:SMPdesign:Performance Comparison II} makes further performance
 comparisons,
 and finally
 Section~\ref{sec:SMPdesign:Future Directions and Conclusions}
 presents future directions and concluding remarks.

 \subsection{Work-Queue Parallel Maze Solver}
 \label{sec:SMPdesign:Work-Queue Parallel Maze Solver}

 { \scriptsize
 \begin{verbbox}
   1 int maze_solve(maze *mp, cell sc, cell ec)
   2 {
   3   cell c = sc;
   4   cell n;
   5   int vi = 0;
   6
   7   maze_try_visit_cell(mp, c, c, &n, 1);
   8   for (;;) {
   9     while (!maze_find_any_next_cell(mp, c, &n)) {
  10       if (++vi >= mp->vi)
  11         return 0;
  12       c = mp->visited[vi].c;
  13     }
  14     do {
  15       if (n == ec) {
  16         return 1;
  17       }
  18       c = n;
  19     } while (maze_find_any_next_cell(mp, c, &n));
  20     c = mp->visited[vi].c;
  21   }
  22 }
 \end{verbbox}
 }
 \begin{figure}[tbp]
 \centering
 \theverbbox
 \caption{SEQ Pseudocode}
 \label{fig:SMPdesign:SEQ Pseudocode}
 \end{figure}

 PWQ is based on SEQ, which is shown in
 Figure~\ref{fig:SMPdesign:SEQ Pseudocode}
 (\path{maze_seq.c}).
 The maze is represented by a 2D array of cells and
 a linear-array-based work queue named \co{->visited}.

 Line~7 visits the initial cell, and each iteration of the loop spanning
 lines~8-21 traverses passages headed by one cell.
 The loop spanning lines~9-13 scans the \co{->visited[]} array for a
 visited cell with an unvisited neighbor, and the loop spanning
 lines~14-19 traverses one fork of the submaze headed by that neighbor.
 Line~20 initializes for the next pass through the outer loop.

 { \scriptsize
 \begin{verbbox}
   1 int maze_try_visit_cell(struct maze *mp, cell c, cell t,
   2                         cell *n, int d)
   3 {
   4   if (!maze_cells_connected(mp, c, t) ||
   5       (*celladdr(mp, t) & VISITED))
   6     return 0;
   7   *n = t;
   8   mp->visited[mp->vi] = t;
   9   mp->vi++;
  10   *celladdr(mp, t) |= VISITED | d;
  11   return 1;
  12 }
  13
  14 int maze_find_any_next_cell(struct maze *mp, cell c,
  15                             cell *n)
  16 {
  17   int d = (*celladdr(mp, c) & DISTANCE) + 1;
  18
  19   if (maze_try_visit_cell(mp, c, prevcol(c), n, d))
  20     return 1;
  21   if (maze_try_visit_cell(mp, c, nextcol(c), n, d))
  22     return 1;
  23   if (maze_try_visit_cell(mp, c, prevrow(c), n, d))
  24     return 1;
  25   if (maze_try_visit_cell(mp, c, nextrow(c), n, d))
  26     return 1;
  27   return 0;
  28 }
 \end{verbbox}
 }
 \begin{figure}[tbp]
 \centering
 \theverbbox
 \caption{SEQ Helper Pseudocode}
 \label{fig:SMPdesign:SEQ Helper Pseudocode}
 \end{figure}

 The pseudocode for \co{maze_try_visit_cell()} is shown on lines~1-12
 of Figure~\ref{fig:SMPdesign:SEQ Helper Pseudocode}
 (\path{maze.c}).
 Line~4 checks to see if cells \co{c} and \co{n} are adjacent and connected,
 while line~5 checks to see if cell \co{n} has not yet been visited.
 The \co{celladdr()} function returns the address of the specified cell.
 If either check fails, line~6 returns failure.
 Line~7 indicates the next cell, line~8 records this cell in the next
 slot of the \co{->visited[]} array, line~9 indicates that this slot
 is now full, and line~10 marks this cell as visited and also records
 the distance from the maze start.  Line~11 then returns success.

 The pseudocode for \co{maze_find_any_next_cell()} is shown on lines~14-28
 of Figure~\ref{fig:SMPdesign:SEQ Helper Pseudocode}
 (\path{maze.c}).
 Line~17 picks up the current cell's distance plus 1,
 while lines~19, 21, 23, and~25
 check the cell in each direction, and lines~20, 22, 24, and~26
 return true if the corresponding cell is a candidate next cell.
 The \co{prevcol()}, \co{nextcol()}, \co{prevrow()}, and \co{nextrow()}
 each do the specified array-index-conversion operation.
 If none of the cells is a candidate, line~27 returns false.

 \begin{figure}[tb]
 \centering
 \resizebox{1.2in}{!}{\includegraphics{SMPdesign/MazeNumberPath}}
 \caption{Cell-Number Solution Tracking}
 \label{fig:SMPdesign:Cell-Number Solution Tracking}
 \end{figure}

 The path is recorded in the maze by counting the number of cells from
 the starting point, as shown in
 Figure~\ref{fig:SMPdesign:Cell-Number Solution Tracking},
 where the starting cell is in the upper left and the ending cell is
 in the lower right.
 Starting at the ending cell and following
 consecutively decreasing cell numbers traverses the solution.

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seq_fg-cdf}}
 \caption{CDF of Solution Times For SEQ and PWQ}
 \label{fig:SMPdesign:CDF of Solution Times For SEQ and PWQ}
 \end{figure}

 The parallel work-queue solver is a straightforward parallelization
 of the algorithm shown in
 Figures~\ref{fig:SMPdesign:SEQ Pseudocode} and ~\ref{fig:SMPdesign:SEQ Helper Pseudocode}.
 Line~10 of Figure~\ref{fig:SMPdesign:SEQ Pseudocode} must use fetch-and-add,
 and the local variable \co{vi} must be shared among the various threads.
 Lines~5 and~10 of Figure~\ref{fig:SMPdesign:SEQ Helper Pseudocode} must be
 combined into a CAS loop, with CAS failure indicating a loop in the
 maze.
 Lines~8-9 of this figure must use fetch-and-add to arbitrate concurrent
 attempts to record cells in the \co{->visited[]} array.

 This approach does provide significant speedups on a dual-CPU
 Lenovo\textsuperscript\texttrademark W500
 running at 2.53GHz, as shown in
 Figure~\ref{fig:SMPdesign:CDF of Solution Times For SEQ and PWQ},
 which shows the cumulative distribution functions (CDFs) for the solution
 times of the two algorithms, based on the solution of 500 different square
 500-by-500 randomly generated mazes.
 The substantial overlap
 of the projection of the CDFs onto the x-axis will be addressed in
 Section~\ref{sec:SMPdesign:Performance Comparison I}.

 Interestingly enough, the sequential solution-path tracking works unchanged
 for the parallel algorithm.
 However, this uncovers a significant weakness in the parallel algorithm:
 At most one thread may be making progress along the solution path at
 any given time.
 This weakness is addressed in the next section.

 \subsection{Alternative Parallel Maze Solver}
 \label{sec:SMPdesign:Alternative Parallel Maze Solver}

 Youthful maze solvers are often urged to start at both ends, and
 this advice has been repeated more recently in the context of automated
 maze solving~\cite{UMD:CMSC433maze}.
 This advice amounts to partitioning, which has been a powerful
 parallelization strategy
 in the context of parallel programming for both operating-system
 kernels~\cite{Beck85,Inman85} and
 applications~\cite{DavidAPatterson2010TroubleMulticore}.
 This section applies this strategy, using two child threads that start
 at opposite ends of the solution path, and takes a brief look at the
 performance and scalability consequences.

 { \scriptsize
 \begin{verbbox}
   1 int maze_solve_child(maze *mp, cell *visited, cell sc)
   2 {
   3   cell c;
   4   cell n;
   5   int vi = 0;
   6
   7   myvisited = visited; myvi = &vi;
   8   c = visited[vi];
   9   do {
  10     while (!maze_find_any_next_cell(mp, c, &n)) {
  11       if (visited[++vi].row < 0)
  12         return 0;
  13       if (ACCESS_ONCE(mp->done))
  14         return 1;
  15       c = visited[vi];
  16     }
  17     do {
  18       if (ACCESS_ONCE(mp->done))
  19         return 1;
  20       c = n;
  21     } while (maze_find_any_next_cell(mp, c, &n));
  22     c = visited[vi];
  23   } while (!ACCESS_ONCE(mp->done));
  24   return 1;
  25 }
 \end{verbbox}
 }
 \begin{figure}[tbp]
 \centering
 \theverbbox
 \caption{Partitioned Parallel Solver Pseudocode}
 \label{fig:SMPdesign:Partitioned Parallel Solver Pseudocode}
 \end{figure}

 The partitioned parallel algorithm (PART), shown in
 Figure~\ref{fig:SMPdesign:Partitioned Parallel Solver Pseudocode}
 (\path{maze_part.c}),
 is similar to SEQ, but has a few important differences.
 First, each child thread has its own \co{visited} array, passed in by
 the parent as shown on line~1,
 which must be initialized to all [-1,-1].
 Line~7 stores a pointer to this array into the per-thread variable
 \co{myvisited} to allow access by helper functions, and similarly stores
 a pointer to the local visit index.
 Second, the parent visits the first cell on each child's behalf,
 which the child retrieves on line~8.
 Third, the maze is solved as soon as one child locates a cell that has
 been visited by the other child.
 When \co{maze_try_visit_cell()} detects this,
 it sets a \co{->done} field in the maze structure.
 Fourth, each child must therefore periodically check the \co{->done}
 field, as shown on lines~13, 18, and~23.
 The \co{ACCESS_ONCE()} primitive must disable any compiler
 optimizations that might combine consecutive loads or that
 might reload the value.
 A C++1x volatile relaxed load suffices~\cite{PeteBecker2011N3242}.
 Finally, the \co{maze_find_any_next_cell()} function must use
 compare-and-swap to mark a cell as visited, however
 no constraints on ordering are required beyond those provided by
 thread creation and join.

 { \scriptsize
 \begin{verbbox}
   1 int maze_try_visit_cell(struct maze *mp, int c, int t,
   2       int *n, int d)
   3 {
   4   cell_t t;
   5   cell_t *tp;
   6   int vi;
   7
   8   if (!maze_cells_connected(mp, c, t))
   9     return 0;
  10   tp = celladdr(mp, t);
  11   do {
  12     t = ACCESS_ONCE(*tp);
  13     if (t & VISITED) {
  14       if ((t & TID) != mytid)
  15         mp->done = 1;
  16       return 0;
  17     }
  18   } while (!CAS(tp, t, t | VISITED | myid | d));
  19   *n = t;
  20   vi = (*myvi)++;
  21   myvisited[vi] = t;
  22   return 1;
  23 }
 \end{verbbox}
 }
 \begin{figure}[tbp]
 \centering
 \theverbbox
 \caption{Partitioned Parallel Helper Pseudocode}
 \label{fig:SMPdesign:Partitioned Parallel Helper Pseudocode}
 \end{figure}

 The pseudocode for \co{maze_find_any_next_cell()} is identical to that shown in
 Figure~\ref{fig:SMPdesign:SEQ Helper Pseudocode},
 but the pseudocode for \co{maze_try_visit_cell()} differs, and
 is shown in
 Figure~\ref{fig:SMPdesign:Partitioned Parallel Helper Pseudocode}.
 Lines~8-9 check to see if the cells are connected, returning failure
 if not.
 The loop spanning lines~11-18 attempts to mark the new cell visited.
 Line~13 checks to see if it has already been visited, in which case
 line~16 returns failure, but only after line~14 checks to see if
 we have encountered the other thread, in which case line~15 indicates
 that the solution has been located.
 Line~19 updates to the new cell, lines~20 and~21 update this thread's visited
 array, and line~22 returns success.

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seq_fg_part-cdf}}
 \caption{CDF of Solution Times For SEQ, PWQ, and PART}
 \label{fig:SMPdesign:CDF of Solution Times For SEQ, PWQ, and PART}
 \end{figure}

 Performance testing revealed a surprising anomaly, shown in
 Figure~\ref{fig:SMPdesign:CDF of Solution Times For SEQ, PWQ, and PART}.
 The median solution time for PART (17 milliseconds)
 is more than four times faster than that of SEQ (79 milliseconds),
 despite running on only two threads.
 The next section analyzes this anomaly.

 \subsection{Performance Comparison I}
 \label{sec:SMPdesign:Performance Comparison I}

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seqVfg_part-cdf}}
 \caption{CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios}
 \label{fig:SMPdesign:CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios}
 \end{figure}

 The first reaction to a performance anomaly is to check for bugs.
 Although the algorithms were in fact finding valid solutions, the
 plot of CDFs in
 Figure~\ref{fig:SMPdesign:CDF of Solution Times For SEQ, PWQ, and PART}
 assumes independent data points.
 This is not the case:  The performance tests randomly generate a maze,
 and then run all solvers on that maze.
 It therefore makes sense to plot the CDF of the ratios of
 solution times for each generated maze,
 as shown in
 Figure~\ref{fig:SMPdesign:CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios},
 greatly reducing the CDFs' overlap.
 This plot reveals that for some mazes, PART
 is more than \emph{forty} times faster than SEQ.
 In contrast, PWQ is never more than about
 two times faster than SEQ.
 A forty-times speedup on two threads demands explanation.
 After all, this is not merely embarrassingly parallel, where partitionability
 means that adding threads does not increase the overall computational cost.
 It is instead \emph{humiliatingly parallel}: Adding threads
 significantly reduces the overall computational cost, resulting in
 large algorithmic superlinear speedups.

 \begin{figure}[tb]
 \centering
 \resizebox{1.4in}{!}{\includegraphics{SMPdesign/maze_in_way10a}}
 \caption{Reason for Small Visit Percentages}
 \label{fig:SMPdesign:Reason for Small Visit Percentages}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-pctVms_seq_part-sct}}
 \caption{Correlation Between Visit Percentage and Solution Time}
 \label{fig:SMPdesign:Correlation Between Visit Percentage and Solution Time}
 \end{figure}

 Further investigation showed that
 PART sometimes visited fewer than 2\% of the maze's cells,
 while SEQ and PWQ never visited fewer than about 9\%.
 The reason for this difference is shown by
 Figure~\ref{fig:SMPdesign:Reason for Small Visit Percentages}.
 If the thread traversing the solution from the upper left reaches
 the circle, the other thread cannot reach
 the upper-right portion of the maze.
 Similarly, if the other thread reaches the square,
 the first thread cannot reach the lower-left
 portion of the maze.
 Therefore, PART will likely visit a small fraction
 of the non-solution-path cells.
 In short, the superlinear speedups are due to threads getting in each
 others' way.
 This is a sharp contrast with decades of experience with
 parallel programming, where workers have struggled
 to keep threads \emph{out} of each others' way.

 Figure~\ref{fig:SMPdesign:Correlation Between Visit Percentage and Solution Time}
 confirms a strong correlation between cells visited and solution time
 for all three methods.
 The slope of PART's scatterplot is smaller than that of SEQ,
 indicating that PART's pair of threads visits a given fraction
 of the maze faster than can SEQ's single thread.
 PART's scatterplot is also weighted toward small visit
 percentages, confirming that PART does less total work, hence
 the observed humiliating parallelism.

 \begin{figure}[tb]
 \centering
 \resizebox{1.4in}{!}{\includegraphics{SMPdesign/maze_PWQ_vs_PART}}
 \caption{PWQ Potential Contention Points}
 \label{fig:SMPdesign:PWQ Potential Contention Points}
 \end{figure}

 The fraction of cells visited by PWQ is similar to that of SEQ.
 In addition, PWQ's solution time is greater than that of PART,
 even for equal visit fractions.
 The reason for this is shown in
 Figure~\ref{fig:SMPdesign:PWQ Potential Contention Points}, which has a red
 circle on each cell with more than two neighbors.
 Each such cell can result in contention in PWQ, because
 one thread can enter but two threads can exit, which hurts
 performance, as noted earlier in this chapter.
 In contrast, PART can incur such contention but once, namely
 when the solution is located.
 Of course, SEQ never contends.

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seqVfg_part_seqO3-cdf}}
 \caption{Effect of Compiler Optimization (-O3)}
 \label{fig:SMPdesign:Effect of Compiler Optimization (-O3)}
 \end{figure}

 Although PART's speedup is impressive, we should not neglect sequential
 optimizations.
 Figure~\ref{fig:SMPdesign:Effect of Compiler Optimization (-O3)} shows that
 SEQ, when compiled with -O3, is about twice as fast
 as unoptimized PWQ, approaching the performance of unoptimized PART.
 Compiling all three algorithms with -O3 gives results similar to
 (albeit faster than) those shown in
 Figure~\ref{fig:SMPdesign:CDF of SEQ/PWQ and SEQ/PART Solution-Time Ratios},
 except that PWQ provides almost no speedup compared
 to SEQ, in keeping with Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw}.
 However, if the goal is to double performance compared to unoptimized
 SEQ, as opposed to achieving optimality, compiler
 optimizations are quite attractive.

 Cache alignment and padding often improves performance by reducing
 false sharing.
 However, for these maze-solution algorithms, aligning and padding the
 maze-cell array \emph{degrades} performance by up to 42\% for 1000x1000 mazes.
 Cache locality is more important than avoiding
 false sharing, especially for large mazes.
 For smaller 20-by-20 or 50-by-50 mazes, aligning and padding can produce
 up to a 40\% performance improvement for PART,
 but for these small sizes, SEQ performs better anyway because there
 is insufficient time for PART to make up for the overhead of
 thread creation and destruction.

 In short, the partitioned parallel maze solver is an interesting example
 of an algorithmic superlinear speedup.
 If ``algorithmic superlinear speedup'' causes cognitive dissonance,
 please proceed to the next section.

 \subsection{Alternative Sequential Maze Solver}
 \label{sec:SMPdesign:Alternative Sequential Maze Solver}

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seqO3V2seqO3_fgO3_partO3-cdf}}
 \caption{Partitioned Coroutines}
 \label{fig:SMPdesign:Partitioned Coroutines}
 \end{figure}

 The presence of algorithmic superlinear speedups suggests simulating
 parallelism via co-routines, for example, manually switching context
 between threads on each pass through the main do-while loop in
 Figure~\ref{fig:SMPdesign:Partitioned Parallel Solver Pseudocode}.
 This context switching is straightforward because the context
 consists only of the variables \co{c} and \co{vi}: Of the numerous
 ways to achieve the effect, this is a good tradeoff between
 context-switch overhead and visit percentage.
 As can be seen in
 Figure~\ref{fig:SMPdesign:Partitioned Coroutines},
 this coroutine algorithm (COPART) is quite effective, with the performance
 on one thread being within about 30\% of PART on two threads
 (\path{maze_2seq.c}).

 \subsection{Performance Comparison II}
 \label{sec:SMPdesign:Performance Comparison II}

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_seqO3VfgO3_partO3-median}}
 \caption{Varying Maze Size vs. SEQ}
 \label{fig:SMPdesign:Varying Maze Size vs. SEQ}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/500-ms_2seqO3VfgO3_partO3-median}}
 \caption{Varying Maze Size vs. COPART}
 \label{fig:SMPdesign:Varying Maze Size vs. COPART}
 \end{figure}

 Figures~\ref{fig:SMPdesign:Varying Maze Size vs. SEQ}
 and~\ref{fig:SMPdesign:Varying Maze Size vs. COPART}
 show the effects of varying maze size, comparing both PWQ and PART
 running on two threads
 against either SEQ or COPART, respectively, with 90\%-confidence
 error bars.
 PART shows superlinear scalability against SEQ and modest scalability
 against COPART for 100-by-100 and larger mazes.
 PART exceeds theoretical energy-efficiency breakeven against COPART at roughly
 the 200-by-200 maze size, given that power consumption rises as roughly
 the square of the frequency for high frequencies~\cite{TrevorMudge2000Power},
 so that 1.4x scaling on two threads consumes the same energy
 as a single thread at equal solution speeds.
 In contrast, PWQ shows poor scalability against both SEQ and COPART
 unless unoptimized: Figures~\ref{fig:SMPdesign:Varying Maze Size vs. SEQ}
 and~\ref{fig:SMPdesign:Varying Maze Size vs. COPART}
 were generated using -O3.

 \begin{figure}[tb]
 \centering
 \resizebox{2.2in}{!}{\includegraphics{SMPdesign/1000-ms_2seqO3VfgO3_partO3-mean}}
 \caption{Mean Speedup vs. Number of Threads, 1000x1000 Maze}
 \label{fig:SMPdesign:Mean Speedup vs. Number of Threads, 1000x1000 Maze}
 \end{figure}

 Figure~\ref{fig:SMPdesign:Mean Speedup vs. Number of Threads, 1000x1000 Maze}
 shows the performance of PWQ and PART relative to COPART.
 For PART runs with more than two threads, the additional threads were
 started evenly spaced along the diagonal connecting the starting and
 ending cells.
 Simplified link-state routing~\cite{BERT-87} was used to
 detect early termination on PART runs with more than two threads
 (the solution is flagged when
 a thread is connected to both beginning and end).
 PWQ performs quite poorly, but
 PART hits breakeven at two threads and again at five threads, achieving
 modest speedups beyond five threads.
 Theoretical energy efficiency breakeven is within the 90\% confidence
 interval for seven and eight threads.
 The reasons for the peak at two threads are (1) the lower complexity
 of termination detection in the two-thread case and (2) the fact that
 there is a lower probability of the third and subsequent threads making
 useful forward progress: Only the first two threads are guaranteed to start on
 the solution line.
 This disappointing performance compared to results in
 Figure~\ref{fig:SMPdesign:Varying Maze Size vs. COPART}
 is due to the less-tightly integrated hardware available in the
 larger and older Xeon\textsuperscript\textregistered
 system running at 2.66GHz.

 \subsection{Future Directions and Conclusions}
 \label{sec:SMPdesign:Future Directions and Conclusions}

 Much future work remains.
 First, this section applied only one technique used by human maze solvers.
 Others include following walls to exclude portions of the maze
 and choosing internal starting points based on the
 locations of previously traversed paths.
 Second, different choices of
 starting and ending points might favor different algorithms.
 Third, although placement of the PART algorithm's
 first two threads is straightforward, there are any number of
 placement schemes for the remaining threads.
 Optimal placement might well depend on the starting and ending points.
 Fourth, study of unsolvable mazes and cyclic mazes
 is likely to produce interesting results.
 Fifth, the lightweight C++11 atomic operations might improve performance.
 Sixth, it would be interesting to compare the speedups for
 three-dimensional mazes (or of even higher-order mazes).
 Finally, for mazes, humiliating parallelism indicated a
 more-efficient sequential implementation using coroutines.
 Do humiliatingly parallel algorithms always lead to more-efficient
 sequential implementations, or are there inherently humiliatingly parallel
 algorithms for which coroutine context-switch overhead overwhelms the
 speedups?

 This section demonstrated and analyzed parallelization of maze-solution
 algorithms.
 A conventional work-queue-based algorithm did well only when compiler
 optimizations were disabled, suggesting that some prior results obtained
 using high-level/overhead languages will be invalidated
 by advances in optimization.

 This section gave a clear example where approaching parallelism
 as a first-class optimization technique rather than as a derivative of a
 sequential algorithm paves the way for an improved sequential algorithm.
 High-level design-time application of parallelism is likely to be a
 fruitful field of study.
 This section took the problem of solving mazes from mildly scalable
 to humiliatingly parallel and back again.
 It is hoped that this experience will motivate work on parallelism
 as a first-class design-time whole-application optimization technique,
 rather than as a grossly suboptimal after-the-fact micro-optimization
 to be retrofitted into existing programs.

 \section{Partitioning, Parallelism, and Optimization}
 \label{sec:SMPdesign:Partitioning, Parallelism, and Optimization}

 Most important, although this chapter has demonstrated that although applying
 parallelism at the design level gives excellent results, this final
 section shows that this is not enough.
 For search problems such as maze solution, this section has shown that
 search strategy is even more important than parallel design.
 Yes, for this particular type of maze, intelligently applying parallelism
 identified a superior search strategy, but this sort of luck is no
 substitute for a clear focus on search strategy itself.

 As noted back in Section~\ref{sec:intro:Parallel Programming Goals},
 parallelism is but one potential optimization of many.
 A successful design needs to focus on the most important optimization.
 Much though I might wish to claim otherwise, that optimization might
 or might not be parallelism.

 However, for the many cases where parallelism is the right optimization,
 the next section covers that synchronization workhorse, locking.