intro/intro.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % intro/intro.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{chp:Introduction}{Introduction}{qqzintro}
 %
 \Epigraph{If parallel programming is so hard, why are there so many
 	  parallel programs?}{Unknown}

 Parallel programming has earned a reputation as one of the most
 difficult areas a hacker can tackle.
 Papers and textbooks warn of the perils of \IX{deadlock}, \IX{livelock},
 \IXpl{race condition}, non-determinism,
 \IXaltr{Amdahl's-Law}{Amdahl's Law} limits to scaling,
 and excessive realtime latencies.
 And these perils are quite real; we authors have accumulated uncounted
 % 2020:
 %	30 for Paul E. McKenney
 years of experience along with the resulting emotional scars,
 grey hairs, and hair loss.

 However, new technologies that are difficult to use at introduction
 invariably become easier over time.
 For example, the once-rare ability to drive a car is now
 commonplace in many countries.
 This dramatic change came about for two basic reasons:
 \begin{enumerate*}[(1)]
 \item Cars became cheaper and more readily available, so that
 more people had the opportunity to learn to drive, and
 \item Cars became easier to operate due to automatic transmissions,
 automatic chokes, automatic starters, greatly improved reliability,
 and a host of other technological improvements.
 \end{enumerate*}

 The same is true for many other technologies, including computers.
 It is no longer necessary to operate a keypunch in order to program.
 Spreadsheets allow most non-programmers to get results from their computers
 that would have required a team of specialists a few decades ago.
 Perhaps the most compelling example is web-surfing and content creation,
 which since the early 2000s has been easily done by
 untrained, uneducated people using various now-commonplace
 social-networking tools.
 As recently as 1968, such content creation was a far-out research
 project~\cite{DouglasEngelbart1968}, described at
 the time as
 ``like a UFO landing on the White House lawn''~\cite{ScottGriffen2000}.
 % http://www.ibiblio.org/pioneers/engelbart.html
 % http://www.histech.rwth-aachen.de/www/quellen/engelbart/ahi62index.html

 Therefore, if you wish to argue that parallel programming will remain
 as difficult as it is currently perceived by many to be, it is you
 who bears the burden of proof, keeping in mind the many centuries of
 counter-examples in many fields of endeavor.

 \section{Historic Parallel Programming Difficulties}
 \label{sec:intro:Historic Parallel Programming Difficulties}
 %
 \epigraph{Not the power to remember, but its very opposite, the power to
 	  forget, is a necessary condition for our existence.}
 	 {Sholem Asch}

 As indicated by its title, this book takes a different approach.
 Rather than complain about the difficulty of parallel programming,
 it instead examines the reasons why parallel programming is
 difficult, and then works to help the reader to overcome these
 difficulties.
 As will be seen, these difficulties have historically fallen into several
 categories, including:

 \begin{enumerate}
 \item	The historic high cost and relative rarity of parallel systems.
 \item	The typical researcher's and practitioner's lack of experience
 	with parallel systems.
 \item	The paucity of publicly accessible parallel code.
 \item	The lack of a widely understood engineering discipline of
 	parallel programming.
 \item	The high \IX{overhead} of communication relative to that of processing,
 	even in tightly coupled shared-memory computers.
 \end{enumerate}

 Many of these historic difficulties are well on the way to being overcome.
 First, over the past few decades, the cost of parallel systems
 has decreased from many multiples of that of a house to that of a
 modest meal, courtesy of \IXr{Moore's Law}~\cite{GordonMoore1965MooresLaw}.
 Papers calling out the advantages of multicore CPUs were published
 as early as 1996~\cite{Olukotun96}.
 IBM introduced simultaneous multi-threading
 into its high-end \Power{} family in 2000, and multicore in 2001.
 Intel introduced hyperthreading into its commodity Pentium line in
 November 2000, and both AMD and Intel introduced
 dual-core CPUs in 2005.
 Sun followed with the multicore/multi-threaded Niagara in late 2005.
 In fact, by 2008, it was becoming difficult
 to find a single-CPU desktop system, with single-core CPUs being
 relegated to netbooks and embedded devices.
 By 2012, even smartphones were starting to sport multiple CPUs.
 By 2020, safety-critical software standards started addressing
 concurrency.

 Second, the advent of low-cost and readily available multicore systems
 means that the once-rare experience of parallel programming is
 now available to almost all researchers and practitioners.
 In fact, parallel systems have long been within the budget of students
 and hobbyists.
 We can therefore expect greatly increased levels of invention and
 innovation surrounding parallel systems, and that increased familiarity
 will over time make the once prohibitively expensive field of parallel
 programming much more friendly and commonplace.

 Third, in the 20\textsuperscript{th} century, large systems of
 highly parallel software were almost always closely guarded proprietary
 secrets.
 In happy contrast, the 21\textsuperscript{st} century has seen numerous
 open-source (and thus publicly available) parallel software projects,
 including the Linux kernel~\cite{Torvalds2.6kernel},
 database systems~\cite{PostgreSQL2008,MySQL2008},
 and message-passing systems~\cite{OpenMPI2008,BOINC2008}.
 This book will draw primarily from the Linux kernel, but will
 provide much material suitable for user-level applications.

 Fourth, even though the large-scale parallel\-/programming projects of
 the 1980s and 1990s were almost all proprietary projects, these
 projects have seeded other communities with cadres of developers who
 understand the engineering discipline required to develop production-quality
 parallel code.
 A major purpose of this book is to present this engineering discipline.

 Unfortunately, the fifth difficulty, the high cost of communication
 relative to that of processing, remains largely in force.
 This difficulty has been receiving increasing attention during
 the new millennium.
 However, according to \ppl{Stephen}{Hawking},
 the finite speed of light and the atomic
 nature of matter will limit progress in this
 area~\cite{BryanGardiner2007,GordonMoore03a}.
 Fortunately, this difficulty has been in force since the late 1980s,
 so that the aforementioned engineering discipline has evolved practical
 and effective strategies for handling it.
 In addition, hardware designers are increasingly aware of these issues,
 so perhaps future hardware will be more friendly to parallel software,
 as discussed in \cref{sec:cpu:Hardware Free Lunch?}.

 \QuickQuiz{
 	Come on now!!!
 	Parallel programming has been known to be exceedingly
 	hard for many decades.
 	You seem to be hinting that it is not so hard.
 	What sort of game are you playing?
 }\QuickQuizAnswer{
 	If you really believe that parallel programming is exceedingly
 	hard, then you should have a ready answer to the question
 	``Why is parallel programming hard?''
 	One could list any number of reasons, ranging from deadlocks to
 	race conditions to testing coverage, but the real answer is that
 	{\em it is not really all that hard}.
 	After all, if parallel programming was really so horribly difficult,
 	how could a large number of open-source projects, ranging from Apache
 	to MySQL to the Linux kernel, have managed to master it?

 	A better question might be:
 	``Why is parallel programming \emph{perceived} to be so difficult?''
 	To see the answer, let's go back to the year 1991.
 	Paul McKenney was walking across the parking lot to Sequent's
 	benchmarking center carrying six dual-80486 Sequent Symmetry CPU
 	boards, when he suddenly realized that he was carrying several
 	times the price of the house he had just purchased.\footnote{
 		Yes, this sudden realization {\em did} cause him to walk quite
 		a bit more carefully.
 		Why do you ask?}
 	This high cost of parallel systems meant that
 	parallel programming was restricted to a privileged few who
 	worked for an employer who either manufactured or could afford to
 	purchase machines costing upwards of \$100,000---in 1991 dollars US.

 	In contrast, in 2020, Paul finds himself typing these words on a
 	six-core x86 laptop.
 	Unlike the dual-80486 CPU boards, this laptop also contains
 	64\,GB of main memory, a 1\,TB solid-state disk, a display, Ethernet,
 	USB ports, wireless, and Bluetooth.
 	And the laptop is more than an order of magnitude cheaper than
 	even one of those dual-80486 CPU boards, even before taking inflation
 	into account.

 	Parallel systems have truly arrived.
 	They are no longer the sole domain of a privileged few, but something
 	available to almost everyone.

 	The earlier restricted availability of parallel hardware is
 	the \emph{real} reason that parallel programming is considered
 	so difficult.
 	After all, it is quite difficult to learn to program even the simplest
 	machine if you have no access to it.
 	Since the age of rare and expensive parallel machines is for the most
 	part behind us, the age during which
 	parallel programming is perceived to be mind-crushingly difficult is
 	coming to a close.\footnote{
 		Parallel programming is in some ways more difficult than
 		sequential programming, for example, parallel validation
 		is more difficult.
 		But no longer mind-crushingly difficult.}
 }\QuickQuizEnd

 However, even though parallel programming might not be as hard as
 is commonly advertised, it is often more work than is sequential
 programming.

 \QuickQuiz{
 	How could parallel programming \emph{ever} be as easy
 	as sequential programming?
 }\QuickQuizAnswer{
 	It depends on the programming environment.
 	SQL~\cite{DIS9075SQL92} is an underappreciated success
 	story, as it permits programmers who know nothing about parallelism
 	to keep a large parallel system productively busy.
 	We can expect more variations on this theme as parallel
 	computers continue to become cheaper and more readily available.
 	For example, one possible contender in the scientific and
 	technical computing arena is MATLAB*P,
 	which is an attempt to automatically parallelize common
 	matrix operations.

 	Finally, on Linux and UNIX systems, consider the following
 	shell command:

 	\begin{VerbatimU}
 	get_input | grep "interesting" | sort
 	\end{VerbatimU}

 	This shell pipeline runs the \co{get_input}, \co{grep},
 	and \co{sort} processes in parallel.
 	There, that wasn't so hard, now was it?

 	In short, parallel programming is just as easy as sequential
 	programming---at least in those environments that hide the parallelism
 	from the user!
 }\QuickQuizEnd

 It therefore makes sense to consider alternatives to parallel programming.
 However, it is not possible to reasonably consider parallel-programming
 alternatives without understanding parallel-programming goals.
 This topic is addressed in the next section.

 \section{Parallel Programming Goals}
 \label{sec:intro:Parallel Programming Goals}
 %
 \epigraph{If you don't know where you are going, you will end up somewhere
 	  else.}{Yogi Berra}

 The three major goals of parallel programming (over and above those
 of sequential programming) are as follows:

 \begin{enumerate}
 \item	\IX{Performance}.
 \item	\IX{Productivity}.
 \item	\IX{Generality}.
 \end{enumerate}

 Unfortunately, given the current state of the art, it is possible to
 achieve at best two of these three goals for any given parallel program.
 These three goals therefore form the \emph{iron triangle of parallel
 programming},
 a triangle upon which overly optimistic hopes all too often come to
 grief.\footnote{
 	Kudos to Michael Wong for naming the iron triangle.}

 \QuickQuizSeries{%
 \QuickQuizB{
 	Oh, really???
 	What about correctness, maintainability, robustness, and so on?
 }\QuickQuizAnswerB{
 	These are important goals, but they are just as important for
 	sequential programs as they are for parallel programs.
 	Therefore, important though they are, they do not belong on
 	a list specific to parallel programming.
 }\QuickQuizEndB
 %
 \QuickQuizM{
 	And if correctness, maintainability, and robustness don't
 	make the list, why do productivity and generality?
 }\QuickQuizAnswerM{
 	Given that parallel programming is perceived to be much harder
 	than sequential programming, productivity is tantamount and
 	therefore must not be omitted.
 	Furthermore, high-productivity parallel-programming environments
 	such as SQL serve a specific purpose, hence generality must
 	also be added to the list.
 }\QuickQuizEndM
 %
 \QuickQuizM{
 	Given that parallel programs are much harder to prove
 	correct than are sequential programs, again, shouldn't
 	correctness \emph{really} be on the list?
 }\QuickQuizAnswerM{
 	From an engineering standpoint, the difficulty in proving
 	correctness, either formally or informally, would be important
 	insofar as it impacts the primary goal of productivity.
 	So, in cases where correctness proofs are important, they
 	are subsumed under the ``productivity'' rubric.
 }\QuickQuizEndM
 %
 \QuickQuizE{
 	What about just having fun?
 }\QuickQuizAnswerE{
 	Having fun is important as well, but, unless you are a hobbyist,
 	would not normally be a \emph{primary} goal.
 	On the other hand, if you \emph{are} a hobbyist, go wild!
 }\QuickQuizEndE
 }

 Each of these goals is elaborated upon in the following sections.

 \subsection{Performance}
 \label{sec:intro:Performance}

 Performance is the primary goal behind most parallel-programming effort.
 After all, if performance is not a concern, why not do yourself a favor:
 Just write sequential code, and be happy?
 It will very likely be easier
 and you will probably get done much more quickly.

 \QuickQuiz{
 	Are there no cases where parallel programming is about something
 	other than performance?
 }\QuickQuizAnswer{
 	There certainly are cases where the problem to be solved is
 	inherently parallel, for example, Monte Carlo methods and
 	some numerical computations.
 	Even in these cases, however, there will be some amount of
 	extra work managing the parallelism.

 	Parallelism is also sometimes used for reliability.
 	For but one example,
 	triple-modulo redundancy has three systems run in parallel
 	and vote on the result.
 	In extreme cases, the three systems will be independently
 	implemented using different algorithms and technologies.
 }\QuickQuizEnd

 Note that ``performance'' is interpreted broadly here,
 including for example \IX{scalability} (performance per CPU) and \IX{efficiency}
 (performance per watt).

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}}
 \caption{MIPS/Clock-Frequency Trend for Intel CPUs}
 \label{fig:intro:Clock-Frequency Trend for Intel CPUs}
 \end{figure}

 That said, the focus of performance has shifted from hardware to
 parallel software.
 This change in focus is due to the fact that, although \IXr{Moore's Law}
 continues to deliver increases in transistor density, it has ceased to
 provide the traditional single-threaded performance increases.
 This can be seen in
 \cref{fig:intro:Clock-Frequency Trend for Intel CPUs},\footnote{
 	This plot shows clock frequencies for newer CPUs theoretically
 	capable of retiring one or more instructions per clock, and MIPS
 	(millions of instructions per second, usually from the old
 	Dhrystone benchmark)
 	for older CPUs requiring multiple clocks to execute even the
 	simplest instruction.
 	The reason for shifting between these two measures is that the
 	newer CPUs' ability to retire multiple instructions per clock is
 	typically limited by memory-system performance.
 	Furthermore, the benchmarks commonly used on the older CPUs
 	are obsolete, and it is difficult to run the newer benchmarks
 	on systems containing the old CPUs, in part because it is hard
 	to find working instances of the old CPUs.}
 which shows that writing single-threaded code and simply waiting
 a year or two for the CPUs to catch up may no longer be an option.
 Given the recent trends on the part of all major manufacturers towards
 multicore/multithreaded systems, parallelism is the way to go for
 those wanting to avail themselves of the full performance of their
 systems.

 \QuickQuiz{
 	Why not instead rewrite programs from inefficient scripting
 	languages to C or C++?
 }\QuickQuizAnswer{
 	If the developers, budget, and time is available for such a
 	rewrite, and if the result will attain the required levels
 	of performance on a single CPU, this can be a reasonable
 	approach.
 }\QuickQuizEnd

 Even so, the first goal is performance rather than scalability,
 especially given that the easiest way to attain linear scalability
 is to reduce the performance of each CPU~\cite{LinusTorvalds2001a}.
 Given a four-CPU system, which would you prefer?
 A program that provides 100 transactions per second on a single CPU,
 but does not scale at all?
 Or a program that provides 10 transactions per second on a single CPU,
 but scales perfectly?
 The first program seems like a better bet, though the answer might
 change if you happened to have a 32-CPU system.

 That said, just because you have multiple CPUs is not necessarily
 in and of itself a reason to use them all, especially given the
 recent decreases in price of multi-CPU systems.
 The key point to understand is that parallel programming is primarily
 a performance optimization, and, as such, it is one potential optimization
 of many.
 If your program is fast enough as currently written, there is no
 reason to optimize, either by parallelizing it or by applying any
 of a number of potential sequential optimizations.\footnote{
 	Of course, if you are a hobbyist whose primary interest is
 	writing parallel software, that is more than enough reason to
 	parallelize whatever software you are interested in.}
 By the same token, if you are looking to apply parallelism as an
 optimization to a sequential program, then you will need to compare
 parallel algorithms to the best sequential algorithms.
 This may require some care, as far too many publications ignore the
 sequential case when analyzing the performance of parallel algorithms.

 \subsection{Productivity}
 \label{sec:intro:Productivity}

 \EQuickQuiz{
 	Why all this prattling on about non-technical issues???
 	And not just \emph{any} non-technical issue, but \emph{productivity}
 	of all things?
 	Who cares?
 }\EQuickQuizAnswer{
 	If you are a pure hobbyist, perhaps you don't need to care.
 	But even pure hobbyists will often care about how much they
 	can get done, and how quickly.
 	After all, the most popular hobbyist tools are usually those
 	that are the best suited for the job, and an important part of
 	the definition of ``best suited'' involves productivity.
 	And if someone is paying you to write parallel code, they will
 	very likely care deeply about your productivity.
 	And if the person paying you cares about something, you would
 	be most wise to pay at least some attention to it!

 	Besides, if you \emph{really} didn't care about productivity,
 	you would be doing it by hand rather than using a computer!
 }\EQuickQuizEnd

 \IX{Productivity} has been becoming increasingly important in recent decades.
 To see this, consider that the price of early computers was tens
 of millions of dollars at
 a time when engineering salaries were but a few thousand dollars a year.
 If dedicating a team of ten engineers to such a machine would improve
 its performance, even by only 10\pct, then their salaries
 would be repaid many times over.

 One such machine was the CSIRAC, the oldest still-intact stored-program
 computer, which was put into operation in
 1949~\cite{CSIRACMuseumVictoria,CSIRACUniversityMelbourne}.
 Because this machine was built before the transistor era, it was constructed
 of 2,000 vacuum tubes, ran with a clock frequency of 1\,kHz,
 consumed 30\,kW of power, and weighed more than three metric tons.
 Given that this machine had but 768 words of RAM, it is safe to say that
 it did not suffer from the productivity issues that often plague
 today's large-scale software projects.

 Today, it would be quite difficult to purchase a machine with so
 little computing power.
 Perhaps the closest equivalents
 are 8-bit embedded microprocessors exemplified by the venerable
 Z80~\cite{z80Wikipedia}, but even the old Z80 had a CPU clock
 frequency more than 1,000 times faster than the CSIRAC\@.
 The Z80 CPU had 8,500 transistors, and could be purchased in 2008
 for less than \$2 US per unit in 1,000-unit quantities.
 In stark contrast to the CSIRAC, software-development costs are
 anything but insignificant for the Z80.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{SMPdesign/mipsperbuck}}
 \caption{MIPS per Die for Intel CPUs}
 \label{fig:intro:MIPS per Die for Intel CPUs}
 \end{figure}

 The CSIRAC and the Z80 are two points in a long-term trend, as can be
 seen in
 \cref{fig:intro:MIPS per Die for Intel CPUs}.
 This figure plots an approximation to computational power per die
 over the past four decades, showing an impressive six-order-of-magnitude
 increase over a period of forty years.
 Note that the advent of multicore CPUs has permitted this increase to
 continue apace despite the clock-frequency wall encountered in 2003,
 albeit courtesy of dies supporting more than 50 hardware threads each.

 One of the inescapable consequences of the rapid decrease in
 the cost of hardware is that software productivity becomes increasingly
 important.
 It is no longer sufficient merely to make efficient use of the hardware:
 It is now necessary to make extremely efficient use of software
 developers as well.
 This has long been the case for sequential hardware, but
 parallel hardware has become a low-cost commodity only recently.
 Therefore, only recently has high productivity become critically important
 when creating parallel software.

 \QuickQuiz{
 	Given how cheap parallel systems have become, how can anyone
 	afford to pay people to program them?
 }\QuickQuizAnswer{
 	There are a number of answers to this question:
 	\begin{enumerate}
 	\item	Given a large computational cluster of parallel machines,
 		the aggregate cost of the cluster can easily justify
 		substantial developer effort, because the development
 		cost can be spread over the large number of machines.
 	\item	Popular software that is run by tens of millions of users
 		can easily justify substantial developer effort,
 		as the cost of this development can be spread over the tens
 		of millions of users.
 		Note that this includes things like kernels and system
 		libraries.
 	\item	If the low-cost parallel machine is controlling the operation
 		of a valuable piece of equipment, then the cost of this
 		piece of equipment might easily justify substantial
 		developer effort.
 	\item	If the software for the low-cost parallel machine produces an
 		extremely valuable result (e.g., energy savings),
 		then this valuable result might again justify substantial
 		developer cost.
 	\item	Safety-critical systems protect lives, which can clearly
 		justify very large developer effort.
 	\item	Hobbyists and researchers might instead seek knowledge,
 		experience, fun, or glory.
 	\end{enumerate}
 	So it is not the case that the decreasing cost of hardware renders
 	software worthless, but rather that it is no longer possible to
 	``hide'' the cost of software development within the cost of
 	the hardware, at least not unless there are extremely large
 	quantities of hardware.
 }\QuickQuizEnd

 Perhaps at one time, the sole purpose of parallel software was performance.
 Now, however, productivity is gaining the spotlight.

 \subsection{Generality}
 \label{sec:intro:Generality}

 One way to justify the high cost of developing parallel software
 is to strive for maximal \IX{generality}.
 All else being equal, the cost of a more-general software artifact
 can be spread over more users than that of a less-general one.
 In fact, this economic force explains much of the maniacal focus
 on portability, which can be seen as an important special case
 of generality.\footnote{
 	Kudos to Michael Wong for pointing this out.}

 Unfortunately, generality often comes at the cost of performance,
 productivity, or both.
 For example, portability is often achieved via adaptation layers,
 which inevitably exact a performance penalty.
 To see this more generally, consider the following popular parallel
 programming environments:

 \begin{description}
 \item[C/C++ ``Locking Plus Threads'':] This category, which includes
 	POSIX Threads (pthreads)~\cite{OpenGroup1997pthreads},
 	Windows Threads, and numerous
 	operating-system kernel environments, offers excellent performance
 	(at least within the confines of a single SMP system)
 	and also offers good generality.
 	Pity about the relatively low productivity.
 \item[Java:] This general purpose and inherently multithreaded
 	programming environment	is widely believed to offer much higher
 	productivity than C or C++, courtesy of the automatic garbage collector
 	and the rich set of class libraries.
 	However, its performance, though greatly improved in the early
 	2000s, lags that of C and C++.
 \item[MPI:] This Message Passing Interface~\cite{MPIForum2008} powers
 	the largest scientific and technical computing clusters in
 	the world and offers unparalleled performance and scalability.
 	In theory, it is general purpose, but it is mainly used
 	for scientific and technical computing.
 	Its productivity is believed by many to be even lower than that
 	of C/C++ ``locking plus threads'' environments.
 \item[OpenMP:] This set of compiler directives can be used
 	to parallelize loops.
 	It is thus quite specific to this
 	task, and this specificity often limits its performance.
 	It is, however, much easier to use than MPI or C/C++
 	``locking plus threads.''
 \item[SQL:] Structured Query Language~\cite{DIS9075SQL92} is
 	specific to relational database queries.
 	However, its performance is quite good as measured by the
 	Transaction Processing Performance Council (TPC)
 	benchmark results~\cite{TPC}.
 	Productivity is excellent; in fact, this parallel programming
 	environment enables people to make good use of a large parallel
 	system despite having little or no knowledge of parallel
 	programming concepts.
 \end{description}

 \QuickQuiz{
 	SQL???
 	That is ancient history.
 	Here in the 2020s, if you really want to make good use of a
 	large parallel system, why not use machine learning?
 }\QuickQuizAnswer{
 	It certainly is the case that machine learning can keep many
 	large parallel systems busy!
 	However, the jury is still out as to whether this constitutes
 	\emph{good} use of those systems.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{intro/PPGrelation}}
 \caption{Software Layers and Performance, Productivity, and Generality}
 \label{fig:intro:Software Layers and Performance; Productivity; and Generality}
 \end{figure}

 The nirvana of parallel programming environments, one that offers
 world-class performance, productivity, and generality, simply does
 not yet exist.
 Until such a nirvana appears, it will be necessary to make engineering
 tradeoffs among performance, productivity, and generality.
 One such tradeoff is depicted by the green ``iron triangle''\footnote{
 	Kudos to Michael Wong for coining ``iron triangle.''}
 shown in
 \cref{fig:intro:Software Layers and Performance; Productivity; and Generality},
 which shows how productivity becomes increasingly important at the upper layers
 of the system stack,
 while performance and generality become increasingly important at the
 lower layers of the system stack.
 The huge development costs incurred at the lower layers
 must be spread over equally huge numbers of users
 (hence the importance of generality), and
 performance lost in lower layers cannot easily be
 recovered further up the stack.
 In the upper layers of the stack, there might be very few users for a given
 specific application, in which case productivity concerns are paramount.
 This explains the tendency towards ``bloatware'' further up the stack:
 Extra hardware is often cheaper than extra developers.
 This book is intended for developers working near the bottom
 of the stack, where performance and generality are of greatest concern.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{intro/Generality}}
 \caption{Tradeoff Between Productivity and Generality}
 \label{fig:intro:Tradeoff Between Productivity and Generality}
 \end{figure}

 It is important to note that a tradeoff between productivity and
 generality has existed for centuries in many fields.
 For but one example, a nailgun is more productive than a hammer for
 driving nails, but in contrast to the nailgun, a hammer can be used for
 many things besides driving nails.
 It should therefore be no surprise to see similar tradeoffs
 appear in the field of parallel computing.
 This tradeoff is shown schematically in
 \cref{fig:intro:Tradeoff Between Productivity and Generality}.
 Here, users~1, 2, 3, and~4 have specific jobs that they need the computer
 to help them with.
 The most productive possible language or environment for a given user is one
 that simply does that user's job, without requiring any programming,
 configuration, or other setup.

 \QuickQuiz{
 	This is a ridiculously unachievable ideal!
 	Why not focus on something that is achievable in practice?
 }\QuickQuizAnswer{
 	This is eminently achievable.
 	The cellphone is a computer that can be used to make phone
 	calls and to send and receive text messages with little or
 	no programming or configuration on the part of the end user.

 	This might seem to be a trivial example at first glance,
 	but if you consider it carefully you will see that it is
 	both simple and profound.
 	When we are willing to sacrifice generality, we can achieve
 	truly astounding increases in productivity.
 	Those who indulge in excessive generality will therefore fail to set
 	the productivity bar high enough to succeed near the top of the
 	software stack.
 	This fact of life even has its own acronym:
 	YAGNI, or ``You Ain't Gonna Need It.''
 }\QuickQuizEnd

 Unfortunately, a system that does the job required by user~1 is
 unlikely to do user~2's job.
 In other words, the most productive languages and environments are
 domain-specific, and thus by definition lacking generality.

 Another option is to tailor a given programming language or environment
 to the hardware system (for example, low-level languages such as
 assembly, C, C++, or Java) or to some abstraction (for example,
 Haskell, Prolog, or Snobol), as is shown by the circular region near
 the center of
 \cref{fig:intro:Tradeoff Between Productivity and Generality}.
 These languages can be considered to be general in the sense that they
 are equally ill-suited to the jobs required by users~1, 2, 3, and~4.
 In other words, their generality comes at the expense of
 decreased productivity when compared to domain-specific languages
 and environments.
 Worse yet, a language that is tailored to a given abstraction
 is likely to suffer from performance and scalability problems
 unless and until it can be efficiently mapped to real hardware.

 Is there no escape from iron triangle's three conflicting goals of
 performance, productivity, and generality?

 It turns out that there often is an escape, for example,
 using the alternatives to parallel programming discussed in the next section.
 After all, parallel programming can be a great deal of fun, but
 it is not always the best tool for the job.

 \section{Alternatives to Parallel Programming}
 \label{sec:intro:Alternatives to Parallel Programming}
 %
 \epigraph{Experiment is folly when experience shows the way.}
 	 {Roger M. Babson}

 In order to properly consider alternatives to parallel programming,
 you must first decide on what exactly you expect the parallelism
 to do for you.
 As seen in \cref{sec:intro:Parallel Programming Goals},
 the primary goals of parallel programming are performance, productivity,
 and generality.
 Because this book is intended for developers working on
 performance-critical code near the bottom of the software stack,
 the remainder of this section focuses primarily on performance improvement.

 It is important to keep in mind that parallelism is but one way to
 improve performance.
 Other well-known approaches include the following, in roughly increasing
 order of difficulty:

 \begin{enumerate}
 \item	Run multiple instances of a sequential application.
 \item	Make the application use existing parallel software.
 \item	Optimize the serial application.
 \end{enumerate}

 These approaches are covered in the following sections.

 \subsection{Multiple Instances of a Sequential Application}
 \label{sec:intro:Multiple Instances of a Sequential Application}

 Running multiple instances of a sequential application can allow you
 to do parallel programming without actually doing parallel programming.
 There are a large number of ways to approach this, depending on the
 structure of the application.

 If your program is analyzing a large number of different scenarios,
 or is analyzing a large number of independent data sets, one easy
 and effective approach is to create a single sequential program that
 carries out a single analysis, then use any of a number of scripting
 environments (for example the \co{bash} shell) to run a number of
 instances of that sequential program in parallel.
 In some cases, this approach can be easily extended to a cluster of
 machines.

 This approach may seem like cheating, and in fact some denigrate such
 programs as ``\IX{embarrassingly parallel}''.
 And in fact, this approach does have some potential disadvantages,
 including increased memory consumption, waste of CPU cycles recomputing
 common intermediate results, and increased copying of data.
 However, it is often  extremely productive, garnering extreme performance
 gains with little or no added effort.

 \subsection{Use Existing Parallel Software}
 \label{sec:intro:Use Existing Parallel Software}

 There is no longer any shortage of parallel software environments that
 can present a single-threaded programming environment,
 including relational
 databases~\cite{Date82},
 web-application servers, and map-reduce environments.
 For example, a common design provides a separate process for each
 user, each of which generates SQL from user queries.
 This per-user SQL is run against a common relational database, which
 automatically runs the users' queries concurrently.
 The per-user programs are responsible only for the user interface,
 with the relational database taking full responsibility for the
 difficult issues surrounding parallelism and persistence.

 In addition, there are a growing number of parallel library functions,
 particularly for numeric computation.
 Even better, some libraries take advantage of special\-/purpose
 hardware such as vector units and general\-/purpose graphical processing
 units (GPGPUs).

 Taking this approach often sacrifices some performance, at least when
 compared to carefully hand-coding a fully parallel application.
 However, such sacrifice is often well repaid by a huge reduction in
 development effort.

 \QuickQuiz{
 	Wait a minute!
 	Doesn't this approach simply shift the development effort from
 	you to whoever wrote the existing parallel software you are using?
 }\QuickQuizAnswer{
 	Exactly!
 	And that is the whole point of using existing software.
 	One team's work can be used by many other teams, resulting in a
 	large decrease in overall effort compared to all teams
 	needlessly reinventing the wheel.
 }\QuickQuizEnd

 \subsection{Performance Optimization}
 \label{sec:intro:Performance Optimization}

 Up through the early 2000s, CPU clock frequencies doubled every 18 months.
 It was therefore usually more important to create new functionality than to
 carefully optimize performance.
 Now that \IXr{Moore's Law} is ``only'' increasing transistor density instead
 of increasing both transistor density and per-transistor performance,
 it might be a good time to rethink the importance of performance
 optimization.
 After all, new hardware generations no longer bring significant
 single-threaded performance improvements.
 Furthermore, many performance optimizations can also conserve energy.

 From this viewpoint, parallel programming is but another performance
 optimization, albeit one that is becoming much more attractive
 as parallel systems become cheaper and more readily available.
 However, it is wise to keep in mind that the speedup available from
 parallelism is limited to roughly the number of CPUs
 (but see \cref{sec:SMPdesign:Beyond Partitioning}
 for an interesting exception).
 In contrast, the speedup available from traditional single-threaded
 software optimizations can be much larger.
 For example, replacing a long linked list with a hash table
 or a search tree can improve performance by many orders of magnitude.
 This highly optimized single-threaded program might run much
 faster than its unoptimized parallel counterpart, making parallelization
 unnecessary.
 Of course, a highly optimized parallel program would be even better,
 aside from the added development effort required.

 Furthermore, different programs might have different performance
 bottlenecks.
 For example, if your program spends most of its time
 waiting on data from your disk drive,
 using multiple CPUs will probably just increase the time wasted waiting
 for the disks.
 In fact, if the program was reading from a single large file laid out
 sequentially on a rotating disk, parallelizing your program might
 well make it a lot slower due to the added seek overhead.
 You should instead optimize the data layout so that
 the file can be smaller (thus faster to read), split the file into chunks
 which can be accessed in parallel from different drives,
 cache frequently accessed data in main memory,
 or, if possible,
 reduce the amount of data that must be read.

 \QuickQuiz{
 	What other bottlenecks might prevent additional CPUs from
 	providing additional performance?
 }\QuickQuizAnswer{
 	There are any number of potential bottlenecks:
 	\begin{enumerate}
 	\item	Main memory.
 		If a single thread consumes all available
 		memory, additional threads will simply page themselves
 		silly.
 	\item	Cache.
 		If a single thread's cache footprint completely
 		fills any shared CPU cache(s), then adding more threads
 		will simply thrash those affected caches, as will be
 		seen in \cref{chp:Data Structures}.
 	\item	Memory bandwidth.
 		If a single thread consumes all available
 		memory bandwidth, additional threads will simply
 		result in additional queuing on the system interconnect.
 	\item	I/O bandwidth.
 		If a single thread is I/O bound,
 		adding more threads will simply result in them all
 		waiting in line for the affected I/O resource.
 	\end{enumerate}

 	Specific hardware systems might have any number of additional
 	bottlenecks.
 	The fact is that every resource which is shared between
 	multiple CPUs or threads is a potential bottleneck.
 }\QuickQuizEnd

 Parallelism can be a powerful optimization technique, but
 it is not the only such technique, nor is it appropriate for all
 situations.
 Of course, the easier it is to parallelize your program, the
 more attractive parallelization becomes as an optimization.
 Parallelization has a reputation of being quite difficult,
 which leads to the question ``exactly what makes parallel
 programming so difficult?''

 \section{What Makes Parallel Programming Hard?}
 \label{sec:intro:What Makes Parallel Programming Hard?}
 %
 \epigraph{Real difficulties can be overcome; it is only the imaginary
 	  ones that are unconquerable.}{Theodore N.~Vail}

 \OriginallyPublished{sec:intro:What Makes Parallel Programming Hard?}{What Makes Parallel Programming Hard?}{a Portland State University Technical Report}{PaulEMcKenney2009ProgrammingHard}

 It is important to note that the difficulty of parallel programming
 is as much a human-factors issue as it is a set of technical properties of the
 parallel programming problem.
 We do need human beings to be able to tell parallel
 systems what to do, otherwise known as programming.
 But parallel programming involves two-way communication, with
 a program's performance and scalability being the communication from
 the machine to the human.
 In short, the human writes a program telling the computer what to do,
 and the computer critiques this program via the resulting performance and
 scalability.
 Therefore, appeals to abstractions or to mathematical analyses will
 often be of severely limited utility.

 In the Industrial Revolution, the interface between human and machine
 was evaluated by human-factor studies, then called time-and-motion
 studies.
 Although there have been a few human-factor studies examining parallel
 programming~\cite{RyanEccles2005HPCSNovice,RyanEccles2006HPCSNoviceNeeds,
 LorinHochstein2005SC,DuaneSzafron1994PEMPDS}, these studies have
 been extremely narrowly focused, and hence unable to demonstrate any
 general results.
 Furthermore, given that the normal range of programmer productivity
 spans more than an order of magnitude, it is unrealistic to expect
 an affordable study to be capable of detecting (say) a 10\pct\ difference
 in productivity.
 Although the multiple-order-of-magnitude differences that such studies
 \emph{can} reliably detect are extremely valuable, the most impressive
 improvements tend to be based on a long series of 10\pct\ improvements.

 We must therefore take a different approach.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{intro/FourTaskCategories}}
 \caption{Categories of Tasks Required of Parallel Programmers}
 \label{fig:intro:Categories of Tasks Required of Parallel Programmers}
 \end{figure}

 One such approach is to carefully consider the tasks that parallel
 programmers must undertake that are not required of sequential programmers.
 We can then evaluate how well a given programming language or environment
 assists the developer with these tasks.
 These tasks fall into the four categories shown in
 \cref{fig:intro:Categories of Tasks Required of Parallel Programmers},
 each of which is covered in the following sections.

 \subsection{Work Partitioning}
 \label{sec:intro:Work Partitioning}

 Work partitioning is absolutely required for parallel execution:
 If there is but one ``glob'' of work, then it can be executed by at
 most one CPU at a time, which is by definition sequential execution.
 However, partitioning the work requires great care.
 For example, uneven partitioning can result in sequential execution
 once the small partitions have completed~\cite{GeneAmdahl1967AmdahlsLaw}.
 In less extreme cases, load balancing can be used to fully utilize
 available hardware and restore performance and scalability.

 Although partitioning can greatly improve performance and scalability,
 it can also increase complexity.
 For example, partitioning can complicate handling of global
 errors and events:
 A parallel program may need to carry out non-trivial synchronization
 in order to safely process such global events.
 More generally, each partition requires some sort of communication:
 After all, if
 a given thread did not communicate at all, it would have no effect and
 would thus not need to be executed.
 However, because communication incurs overhead, careless partitioning choices
 can result in severe performance degradation.

 Furthermore, the number of concurrent threads must often be controlled,
 as each such thread occupies common resources, for example,
 space in CPU caches.
 If too many threads are permitted to execute concurrently, the
 CPU caches will overflow, resulting in high cache miss rate, which in
 turn degrades performance.
 Conversely, large numbers of threads are often required to
 overlap computation and I/O so as to fully utilize I/O devices.

 \QuickQuiz{
 	Other than CPU cache capacity, what might require limiting the
 	number of concurrent threads?
 }\QuickQuizAnswer{
 	There are any number of potential limits on the number of
 	threads:
 	\begin{enumerate}
 	\item	Main memory.
 		Each thread consumes some memory
 		(for its stack if nothing else), so that excessive
 		numbers of threads can exhaust memory, resulting
 		in excessive paging or memory-allocation failures.
 	\item	I/O bandwidth.
 		If each thread initiates a given
 		amount of mass-storage I/O or networking traffic,
 		excessive numbers of threads can result in excessive
 		I/O queuing delays, again degrading performance.
 		Some networking protocols may be subject to timeouts
 		or other failures if there are so many threads that
 		networking events cannot be responded to in a timely
 		fashion.
 	\item	Synchronization overhead.
 		For many synchronization protocols, excessive numbers
 		of threads can result in excessive spinning, blocking,
 		or rollbacks, thus degrading performance.
 	\end{enumerate}

 	Specific applications and platforms may have any number of additional
 	limiting factors.
 }\QuickQuizEnd

 Finally, permitting threads to execute concurrently greatly increases
 the program's state space, which can make the program difficult to
 understand and debug, degrading productivity.
 All else being equal, smaller state spaces having more regular structure
 are more easily understood, but this is a human-factors statement as much
 as it is a technical or mathematical statement.
 Good parallel designs might have extremely large state spaces, but
 nevertheless be easy to understand due to their regular structure,
 while poor designs can be impenetrable despite having a comparatively
 small state space.
 The best designs exploit embarrassing parallelism, or transform the
 problem to one having an embarrassingly parallel solution.
 In either case, ``embarrassingly parallel'' is in fact
 an embarrassment of riches.
 The current state of the art enumerates good designs; more work is
 required to make more general judgments on
 state-space size and structure.

 \subsection{Parallel Access Control}
 \label{sec:Parallel Access Control}

 Given a single-threaded sequential program, that single
 thread has full access to all of the program's resources.
 These resources are most often in-memory data structures, but can be CPUs,
 memory (including caches), I/O devices, computational accelerators, files,
 and much else besides.

 The first parallel-access-control issue is whether the form of access to
 a given resource depends on that resource's location.
 For example, in many message-passing environments, local-variable
 access is via expressions and assignments,
 while remote-variable access uses an entirely different
 syntax, usually involving messaging.
 The POSIX Threads environment~\cite{OpenGroup1997pthreads},
 Structured Query Language (SQL)~\cite{DIS9075SQL92}, and
 partitioned global address-space (PGAS) environments
 such as Universal Parallel C (UPC)~\cite{ElGhazawi2003UPC,UPCConsortium2013}
 offer implicit access,
 while Message Passing Interface (MPI)~\cite{MPIForum2008} offers
 explicit access because access to remote data requires explicit
 messaging.

 The other parallel-access-control issue is how threads coordinate
 access to the resources.
 This coordination is carried out by
 the very large number of synchronization mechanisms
 provided by various parallel languages and environments,
 including message passing, locking, transactions,
 reference counting, explicit timing, shared atomic variables, and data
 ownership.
 Many traditional parallel-programming concerns such as \IX{deadlock},
 \IX{livelock}, and transaction rollback stem from this coordination.
 This framework can be elaborated to include comparisons
 of these synchronization mechanisms, for example locking vs.\@ transactional
 memory~\cite{McKenney2007PLOSTM}, but such elaboration is beyond the
 scope of this section.
 (See
 \cref{sec:future:Transactional Memory,%
 sec:future:Hardware Transactional Memory}
 for more information on transactional memory.)

 \QuickQuiz{
 	Just what is ``explicit timing''???
 }\QuickQuizAnswer{
 	Where each thread is given access to some set of resources during
 	an agreed-to slot of time.
 	For example, a parallel program with eight threads might be
 	organized into eight-millisecond time intervals, so that the
 	first thread is given access during the first millisecond of
 	each interval, the second thread during the second millisecond,
 	and so on.
 	This approach clearly requires carefully synchronized clocks
 	and careful control of execution times, and therefore should
 	be used with considerable caution.

 	In fact, outside of hard realtime environments, you almost
 	certainly want to use something else instead.
 	Explicit timing is nevertheless worth a mention, as it is
 	always there when you need it.
 }\QuickQuizEnd

 \subsection{Resource Partitioning and Replication}
 \label{sec:Resource Partitioning and Replication}

 The most effective parallel algorithms and systems exploit resource
 parallelism, so much so that it is
 usually wise to begin parallelization by partitioning your write-intensive
 resources and replicating frequently accessed read-mostly resources.
 The resource in question is most frequently data, which might be
 partitioned over computer systems, mass-storage devices, \IXplr{NUMA node},
 CPU cores (or dies or hardware threads), pages, cache lines, instances
 of synchronization primitives, or critical sections of code.
 For example, partitioning over locking primitives is termed
 ``\IXh{data}{locking}''~\cite{Beck85}.

 Resource partitioning is frequently application dependent.
 For example, numerical applications frequently partition matrices
 by row, column, or sub-matrix, while commercial applications frequently
 partition write-intensive data structures and replicate
 read-mostly data structures.
 Thus, a commercial application might assign the data for a
 given customer to a given few computers out of a large cluster.
 An application might statically partition data, or dynamically
 change the partitioning over time.

 Resource partitioning is extremely effective, but
 it can be quite challenging for complex multilinked data
 structures.

 \subsection{Interacting With Hardware}
 \label{sec:Interacting With Hardware}

 Hardware interaction is normally the domain of the operating system,
 the compiler, libraries, or other software-environment infrastructure.
 However, developers working with novel hardware features and components
 will often need to work directly with such hardware.
 In addition, direct access to the hardware can be required when squeezing
 the last drop of performance out of a given system.
 In this case, the developer may need to tailor or configure the application
 to the cache geometry, system topology, or interconnect protocol of the
 target hardware.

 In some cases, hardware may be considered to be a resource which
 is subject to partitioning or access control, as described in
 the previous sections.

 \subsection{Composite Capabilities}
 \label{sec:Composite Capabilities}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{intro/FourTaskOrder}}
 \caption{Ordering of Parallel-Programming Tasks}
 \label{fig:intro:Ordering of Parallel-Programming Tasks}
 \end{figure}

 Although these four capabilities are fundamental,
 good engineering practice uses composites of
 these capabilities.
 For example, the data-parallel approach first
 partitions the data so as to minimize the need for
 inter-partition communication, partitions the code accordingly,
 and finally maps data partitions and threads so as to maximize
 throughput while minimizing inter-thread communication,
 as shown in
 \cref{fig:intro:Ordering of Parallel-Programming Tasks}.
 The developer can then
 consider each partition separately, greatly reducing the size
 of the relevant state space, in turn increasing productivity.
 Even though some problems are non-partitionable,
 clever transformations into forms permitting partitioning can
 sometimes greatly enhance
 both performance and scalability~\cite{PanagiotisMetaxas1999PDCS}.

 \subsection{How Do Languages and Environments Assist With These Tasks?}
 \label{sec:intro:How Do Languages and Environments Assist With These Tasks?}

 Although many environments require the developer to deal manually
 with these tasks, there are long-standing environments that bring
 significant automation to bear.
 The poster child for these environments is SQL, many implementations
 of which automatically parallelize single large queries and also
 automate concurrent execution of independent queries and updates.

 These four categories of tasks must be carried out in all parallel
 programs, but that of course does not necessarily mean that the developer
 must manually carry out these tasks.
 We can expect to see ever-increasing automation of these four tasks
 as parallel systems continue to become cheaper and more readily available.

 \QuickQuiz{
 	Are there any other obstacles to parallel programming?
 }\QuickQuizAnswer{
 	There are a great many other potential obstacles to parallel
 	programming.
 	Here are a few of them:
 	\begin{enumerate}
 	\item	The only known algorithms for a given project might
 		be inherently sequential in nature.
 		In this case, either avoid parallel programming
 		(there being no law saying that your project \emph{has}
 		to run in parallel) or invent a new parallel algorithm.
 	\item	The project allows binary-only plugins that share the same
 		address space, such that no one developer has access to
 		all of the source code for the project.
 		Because many parallel bugs, including deadlocks, are
 		global in nature, such binary-only plugins pose a severe
 		challenge to current software development methodologies.
 		This might well change, but for the time being, all
 		developers of parallel code sharing a given address space
 		need to be able to see \emph{all} of the code running in
 		that address space.
 	\item	The project contains heavily used APIs that were designed
 		without regard to
 		parallelism~\cite{HagitAttiya2011LawsOfOrder,Clements:2013:SCR:2517349.2522712}.
 		Some of the more ornate features of the System V
 		message-queue API form a case in point.
 		Of course, if your project has been around for a few
 		decades, and its developers did not have access to
 		parallel hardware, it undoubtedly has at least
 		its share of such APIs.
 	\item	The project was implemented without regard to parallelism.
 		Given that there are a great many techniques that work
 		extremely well in a sequential environment, but that
 		fail miserably in parallel environments, if your project
 		ran only on sequential hardware for most of its lifetime,
 		then your project undoubtably has at least its share of
 		parallel-unfriendly code.
 	\item	The project was implemented without regard to good
 		software-development practice.
 		The cruel truth is that shared-memory parallel
 		environments are often much less forgiving of sloppy
 		development practices than are sequential environments.
 		You may be well-served to clean up the existing design
 		and code prior to attempting parallelization.
 	\item	The people who originally did the development on your
 		project have since moved on, and the people remaining,
 		while well able to maintain it or add small features,
 		are unable to make ``big animal'' changes.
 		In this case, unless you can work out a very simple
 		way to parallelize your project, you will probably
 		be best off leaving it sequential.
 		That said, there are a number of simple approaches that
 		you might use
 		to parallelize your project, including running multiple
 		instances of it, using a parallel implementation of
 		some heavily used library function, or making use of
 		some other parallel project, such as a database.
 	\end{enumerate}

 	One can argue that many of these obstacles are non-technical
 	in nature, but that does not make them any less real.
 	In short, parallelization of a large body of code
 	can be a large and complex effort.
 	As with any large and complex effort, it makes sense to
 	do your homework beforehand.
 }\QuickQuizEnd

 \section{Discussion}
 \label{sec:intro:Discussion}
 %
 \epigraph{Until you try, you don't know what you can't do.}
 	 {Henry James}

 This section has given an overview of the difficulties with, goals of,
 and alternatives to parallel programming.
 This overview was followed by a discussion of
 what can make parallel programming hard, along with a high-level
 approach for dealing with parallel programming's difficulties.
 Those who still insist that parallel programming is impossibly difficult
 should review some of the older guides to parallel
 programmming~\cite{SQNTParallel,AndrewDBirrell1989Threads,Beck85,Inman85}.
 The following quote from Andrew Birrell's
 monograph~\cite{AndrewDBirrell1989Threads} is especially telling:

 \begin{quote}
 	Writing concurrent programs has a reputation for being exotic
 	and difficult.
 	I~believe it is neither.
 	You need a system that provides you with good primitives
 	and suitable libraries,
 	you need a basic caution and carefulness, you need an armory of
 	useful techniques, and you need to know of the common pitfalls.
 	I~hope that this paper has helped you towards sharing my belief.
 \end{quote}

 The authors of these older guides were well up to the parallel programming
 challenge back in the 1980s.
 As such, there are simply no excuses for refusing to step up to the
 parallel-programming challenge here in the 21\textsuperscript{st} century!

 We are now ready to proceed to the next chapter, which dives into the
 relevant properties of the parallel hardware underlying our parallel
 software.

 \QuickQuizAnswersChp{qqzintro}