blob: ae47a528771cf7f2519d2a79f1a2fdefa29a7653 [file] [log] [blame]
% intro/intro.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{chp:Introduction}{Introduction}{qqzintro}
%
\Epigraph{If parallel programming is so hard, why are there so many
parallel programs?}{Unknown}
Parallel programming has earned a reputation as one of the most
difficult areas a hacker can tackle.
Papers and textbooks warn of the perils of \IX{deadlock}, \IX{livelock},
\IXpl{race condition}, non-determinism,
\IXaltr{Amdahl's-Law}{Amdahl's Law} limits to scaling,
and excessive realtime latencies.
And these perils are quite real; we authors have accumulated uncounted
% 2020:
% 30 for Paul E. McKenney
years of experience along with the resulting emotional scars,
grey hairs, and hair loss.
However, new technologies that are difficult to use at introduction
invariably become easier over time.
For example, the once-rare ability to drive a car is now
commonplace in many countries.
This dramatic change came about for two basic reasons:
\begin{enumerate*}[(1)]
\item Cars became cheaper and more readily available, so that
more people had the opportunity to learn to drive, and
\item Cars became easier to operate due to automatic transmissions,
automatic chokes, automatic starters, greatly improved reliability,
and a host of other technological improvements.
\end{enumerate*}
The same is true for many other technologies, including computers.
It is no longer necessary to operate a keypunch in order to program.
Spreadsheets allow most non-programmers to get results from their computers
that would have required a team of specialists a few decades ago.
Perhaps the most compelling example is web-surfing and content creation,
which since the early 2000s has been easily done by
untrained, uneducated people using various now-commonplace
social-networking tools.
As recently as 1968, such content creation was a far-out research
project~\cite{DouglasEngelbart1968}, described at
the time as
``like a UFO landing on the White House lawn''~\cite{ScottGriffen2000}.
% http://www.ibiblio.org/pioneers/engelbart.html
% http://www.histech.rwth-aachen.de/www/quellen/engelbart/ahi62index.html
Therefore, if you wish to argue that parallel programming will remain
as difficult as it is currently perceived by many to be, it is you
who bears the burden of proof, keeping in mind the many centuries of
counter-examples in many fields of endeavor.
\section{Historic Parallel Programming Difficulties}
\label{sec:intro:Historic Parallel Programming Difficulties}
%
\epigraph{Not the power to remember, but its very opposite, the power to
forget, is a necessary condition for our existence.}
{Sholem Asch}
As indicated by its title, this book takes a different approach.
Rather than complain about the difficulty of parallel programming,
it instead examines the reasons why parallel programming is
difficult, and then works to help the reader to overcome these
difficulties.
As will be seen, these difficulties have historically fallen into several
categories, including:
\begin{enumerate}
\item The historic high cost and relative rarity of parallel systems.
\item The typical researcher's and practitioner's lack of experience
with parallel systems.
\item The paucity of publicly accessible parallel code.
\item The lack of a widely understood engineering discipline of
parallel programming.
\item The high \IX{overhead} of communication relative to that of processing,
even in tightly coupled shared-memory computers.
\end{enumerate}
Many of these historic difficulties are well on the way to being overcome.
First, over the past few decades, the cost of parallel systems
has decreased from many multiples of that of a house to that of a
modest meal, courtesy of \IXr{Moore's Law}~\cite{GordonMoore1965MooresLaw}.
Papers calling out the advantages of multicore CPUs were published
as early as 1996~\cite{Olukotun96}.
IBM introduced simultaneous multi-threading
into its high-end \Power{} family in 2000, and multicore in 2001.
Intel introduced hyperthreading into its commodity Pentium line in
November 2000, and both AMD and Intel introduced
dual-core CPUs in 2005.
Sun followed with the multicore/multi-threaded Niagara in late 2005.
In fact, by 2008, it was becoming difficult
to find a single-CPU desktop system, with single-core CPUs being
relegated to netbooks and embedded devices.
By 2012, even smartphones were starting to sport multiple CPUs.
By 2020, safety-critical software standards started addressing
concurrency.
Second, the advent of low-cost and readily available multicore systems
means that the once-rare experience of parallel programming is
now available to almost all researchers and practitioners.
In fact, parallel systems have long been within the budget of students
and hobbyists.
We can therefore expect greatly increased levels of invention and
innovation surrounding parallel systems, and that increased familiarity
will over time make the once prohibitively expensive field of parallel
programming much more friendly and commonplace.
Third, in the 20\textsuperscript{th} century, large systems of
highly parallel software were almost always closely guarded proprietary
secrets.
In happy contrast, the 21\textsuperscript{st} century has seen numerous
open-source (and thus publicly available) parallel software projects,
including the Linux kernel~\cite{Torvalds2.6kernel},
database systems~\cite{PostgreSQL2008,MySQL2008},
and message-passing systems~\cite{OpenMPI2008,BOINC2008}.
This book will draw primarily from the Linux kernel, but will
provide much material suitable for user-level applications.
Fourth, even though the large-scale parallel\-/programming projects of
the 1980s and 1990s were almost all proprietary projects, these
projects have seeded other communities with cadres of developers who
understand the engineering discipline required to develop production-quality
parallel code.
A major purpose of this book is to present this engineering discipline.
Unfortunately, the fifth difficulty, the high cost of communication
relative to that of processing, remains largely in force.
This difficulty has been receiving increasing attention during
the new millennium.
However, according to \ppl{Stephen}{Hawking},
the finite speed of light and the atomic
nature of matter will limit progress in this
area~\cite{BryanGardiner2007,GordonMoore03a}.
Fortunately, this difficulty has been in force since the late 1980s,
so that the aforementioned engineering discipline has evolved practical
and effective strategies for handling it.
In addition, hardware designers are increasingly aware of these issues,
so perhaps future hardware will be more friendly to parallel software,
as discussed in \cref{sec:cpu:Hardware Free Lunch?}.
\QuickQuiz{
Come on now!!!
Parallel programming has been known to be exceedingly
hard for many decades.
You seem to be hinting that it is not so hard.
What sort of game are you playing?
}\QuickQuizAnswer{
If you really believe that parallel programming is exceedingly
hard, then you should have a ready answer to the question
``Why is parallel programming hard?''
One could list any number of reasons, ranging from deadlocks to
race conditions to testing coverage, but the real answer is that
{\em it is not really all that hard}.
After all, if parallel programming was really so horribly difficult,
how could a large number of open-source projects, ranging from Apache
to MySQL to the Linux kernel, have managed to master it?
A better question might be:
``Why is parallel programming \emph{perceived} to be so difficult?''
To see the answer, let's go back to the year 1991.
Paul McKenney was walking across the parking lot to Sequent's
benchmarking center carrying six dual-80486 Sequent Symmetry CPU
boards, when he suddenly realized that he was carrying several
times the price of the house he had just purchased.\footnote{
Yes, this sudden realization {\em did} cause him to walk quite
a bit more carefully.
Why do you ask?}
This high cost of parallel systems meant that
parallel programming was restricted to a privileged few who
worked for an employer who either manufactured or could afford to
purchase machines costing upwards of \$100,000---in 1991 dollars US.
In contrast, in 2020, Paul finds himself typing these words on a
six-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains
64\,GB of main memory, a 1\,TB solid-state disk, a display, Ethernet,
USB ports, wireless, and Bluetooth.
And the laptop is more than an order of magnitude cheaper than
even one of those dual-80486 CPU boards, even before taking inflation
into account.
Parallel systems have truly arrived.
They are no longer the sole domain of a privileged few, but something
available to almost everyone.
The earlier restricted availability of parallel hardware is
the \emph{real} reason that parallel programming is considered
so difficult.
After all, it is quite difficult to learn to program even the simplest
machine if you have no access to it.
Since the age of rare and expensive parallel machines is for the most
part behind us, the age during which
parallel programming is perceived to be mind-crushingly difficult is
coming to a close.\footnote{
Parallel programming is in some ways more difficult than
sequential programming, for example, parallel validation
is more difficult.
But no longer mind-crushingly difficult.}
}\QuickQuizEnd
However, even though parallel programming might not be as hard as
is commonly advertised, it is often more work than is sequential
programming.
\QuickQuiz{
How could parallel programming \emph{ever} be as easy
as sequential programming?
}\QuickQuizAnswer{
It depends on the programming environment.
SQL~\cite{DIS9075SQL92} is an underappreciated success
story, as it permits programmers who know nothing about parallelism
to keep a large parallel system productively busy.
We can expect more variations on this theme as parallel
computers continue to become cheaper and more readily available.
For example, one possible contender in the scientific and
technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common
matrix operations.
Finally, on Linux and UNIX systems, consider the following
shell command:
\begin{VerbatimU}
get_input | grep "interesting" | sort
\end{VerbatimU}
This shell pipeline runs the \co{get_input}, \co{grep},
and \co{sort} processes in parallel.
There, that wasn't so hard, now was it?
In short, parallel programming is just as easy as sequential
programming---at least in those environments that hide the parallelism
from the user!
}\QuickQuizEnd
It therefore makes sense to consider alternatives to parallel programming.
However, it is not possible to reasonably consider parallel-programming
alternatives without understanding parallel-programming goals.
This topic is addressed in the next section.
\section{Parallel Programming Goals}
\label{sec:intro:Parallel Programming Goals}
%
\epigraph{If you don't know where you are going, you will end up somewhere
else.}{Yogi Berra}
The three major goals of parallel programming (over and above those
of sequential programming) are as follows:
\begin{enumerate}
\item \IX{Performance}.
\item \IX{Productivity}.
\item \IX{Generality}.
\end{enumerate}
Unfortunately, given the current state of the art, it is possible to
achieve at best two of these three goals for any given parallel program.
These three goals therefore form the \emph{iron triangle of parallel
programming},
a triangle upon which overly optimistic hopes all too often come to
grief.\footnote{
Kudos to Michael Wong for naming the iron triangle.}
\QuickQuizSeries{%
\QuickQuizB{
Oh, really???
What about correctness, maintainability, robustness, and so on?
}\QuickQuizAnswerB{
These are important goals, but they are just as important for
sequential programs as they are for parallel programs.
Therefore, important though they are, they do not belong on
a list specific to parallel programming.
}\QuickQuizEndB
%
\QuickQuizM{
And if correctness, maintainability, and robustness don't
make the list, why do productivity and generality?
}\QuickQuizAnswerM{
Given that parallel programming is perceived to be much harder
than sequential programming, productivity is tantamount and
therefore must not be omitted.
Furthermore, high-productivity parallel-programming environments
such as SQL serve a specific purpose, hence generality must
also be added to the list.
}\QuickQuizEndM
%
\QuickQuizM{
Given that parallel programs are much harder to prove
correct than are sequential programs, again, shouldn't
correctness \emph{really} be on the list?
}\QuickQuizAnswerM{
From an engineering standpoint, the difficulty in proving
correctness, either formally or informally, would be important
insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they
are subsumed under the ``productivity'' rubric.
}\QuickQuizEndM
%
\QuickQuizE{
What about just having fun?
}\QuickQuizAnswerE{
Having fun is important as well, but, unless you are a hobbyist,
would not normally be a \emph{primary} goal.
On the other hand, if you \emph{are} a hobbyist, go wild!
}\QuickQuizEndE
}
Each of these goals is elaborated upon in the following sections.
\subsection{Performance}
\label{sec:intro:Performance}
Performance is the primary goal behind most parallel-programming effort.
After all, if performance is not a concern, why not do yourself a favor:
Just write sequential code, and be happy?
It will very likely be easier
and you will probably get done much more quickly.
\QuickQuiz{
Are there no cases where parallel programming is about something
other than performance?
}\QuickQuizAnswer{
There certainly are cases where the problem to be solved is
inherently parallel, for example, Monte Carlo methods and
some numerical computations.
Even in these cases, however, there will be some amount of
extra work managing the parallelism.
Parallelism is also sometimes used for reliability.
For but one example,
triple-modulo redundancy has three systems run in parallel
and vote on the result.
In extreme cases, the three systems will be independently
implemented using different algorithms and technologies.
}\QuickQuizEnd
Note that ``performance'' is interpreted broadly here,
including for example \IX{scalability} (performance per CPU) and \IX{efficiency}
(performance per watt).
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}}
\caption{MIPS/Clock-Frequency Trend for Intel CPUs}
\label{fig:intro:Clock-Frequency Trend for Intel CPUs}
\end{figure}
That said, the focus of performance has shifted from hardware to
parallel software.
This change in focus is due to the fact that, although \IXr{Moore's Law}
continues to deliver increases in transistor density, it has ceased to
provide the traditional single-threaded performance increases.
This can be seen in
\cref{fig:intro:Clock-Frequency Trend for Intel CPUs},\footnote{
This plot shows clock frequencies for newer CPUs theoretically
capable of retiring one or more instructions per clock, and MIPS
(millions of instructions per second, usually from the old
Dhrystone benchmark)
for older CPUs requiring multiple clocks to execute even the
simplest instruction.
The reason for shifting between these two measures is that the
newer CPUs' ability to retire multiple instructions per clock is
typically limited by memory-system performance.
Furthermore, the benchmarks commonly used on the older CPUs
are obsolete, and it is difficult to run the newer benchmarks
on systems containing the old CPUs, in part because it is hard
to find working instances of the old CPUs.}
which shows that writing single-threaded code and simply waiting
a year or two for the CPUs to catch up may no longer be an option.
Given the recent trends on the part of all major manufacturers towards
multicore/multithreaded systems, parallelism is the way to go for
those wanting to avail themselves of the full performance of their
systems.
\QuickQuiz{
Why not instead rewrite programs from inefficient scripting
languages to C or C++?
}\QuickQuizAnswer{
If the developers, budget, and time is available for such a
rewrite, and if the result will attain the required levels
of performance on a single CPU, this can be a reasonable
approach.
}\QuickQuizEnd
Even so, the first goal is performance rather than scalability,
especially given that the easiest way to attain linear scalability
is to reduce the performance of each CPU~\cite{LinusTorvalds2001a}.
Given a four-CPU system, which would you prefer?
A program that provides 100 transactions per second on a single CPU,
but does not scale at all?
Or a program that provides 10 transactions per second on a single CPU,
but scales perfectly?
The first program seems like a better bet, though the answer might
change if you happened to have a 32-CPU system.
That said, just because you have multiple CPUs is not necessarily
in and of itself a reason to use them all, especially given the
recent decreases in price of multi-CPU systems.
The key point to understand is that parallel programming is primarily
a performance optimization, and, as such, it is one potential optimization
of many.
If your program is fast enough as currently written, there is no
reason to optimize, either by parallelizing it or by applying any
of a number of potential sequential optimizations.\footnote{
Of course, if you are a hobbyist whose primary interest is
writing parallel software, that is more than enough reason to
parallelize whatever software you are interested in.}
By the same token, if you are looking to apply parallelism as an
optimization to a sequential program, then you will need to compare
parallel algorithms to the best sequential algorithms.
This may require some care, as far too many publications ignore the
sequential case when analyzing the performance of parallel algorithms.
\subsection{Productivity}
\label{sec:intro:Productivity}
\EQuickQuiz{
Why all this prattling on about non-technical issues???
And not just \emph{any} non-technical issue, but \emph{productivity}
of all things?
Who cares?
}\EQuickQuizAnswer{
If you are a pure hobbyist, perhaps you don't need to care.
But even pure hobbyists will often care about how much they
can get done, and how quickly.
After all, the most popular hobbyist tools are usually those
that are the best suited for the job, and an important part of
the definition of ``best suited'' involves productivity.
And if someone is paying you to write parallel code, they will
very likely care deeply about your productivity.
And if the person paying you cares about something, you would
be most wise to pay at least some attention to it!
Besides, if you \emph{really} didn't care about productivity,
you would be doing it by hand rather than using a computer!
}\EQuickQuizEnd
\IX{Productivity} has been becoming increasingly important in recent decades.
To see this, consider that the price of early computers was tens
of millions of dollars at
a time when engineering salaries were but a few thousand dollars a year.
If dedicating a team of ten engineers to such a machine would improve
its performance, even by only 10\pct, then their salaries
would be repaid many times over.
One such machine was the CSIRAC, the oldest still-intact stored-program
computer, which was put into operation in
1949~\cite{CSIRACMuseumVictoria,CSIRACUniversityMelbourne}.
Because this machine was built before the transistor era, it was constructed
of 2,000 vacuum tubes, ran with a clock frequency of 1\,kHz,
consumed 30\,kW of power, and weighed more than three metric tons.
Given that this machine had but 768 words of RAM, it is safe to say that
it did not suffer from the productivity issues that often plague
today's large-scale software projects.
Today, it would be quite difficult to purchase a machine with so
little computing power.
Perhaps the closest equivalents
are 8-bit embedded microprocessors exemplified by the venerable
Z80~\cite{z80Wikipedia}, but even the old Z80 had a CPU clock
frequency more than 1,000 times faster than the CSIRAC\@.
The Z80 CPU had 8,500 transistors, and could be purchased in 2008
for less than \$2 US per unit in 1,000-unit quantities.
In stark contrast to the CSIRAC, software-development costs are
anything but insignificant for the Z80.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{SMPdesign/mipsperbuck}}
\caption{MIPS per Die for Intel CPUs}
\label{fig:intro:MIPS per Die for Intel CPUs}
\end{figure}
The CSIRAC and the Z80 are two points in a long-term trend, as can be
seen in
\cref{fig:intro:MIPS per Die for Intel CPUs}.
This figure plots an approximation to computational power per die
over the past four decades, showing an impressive six-order-of-magnitude
increase over a period of forty years.
Note that the advent of multicore CPUs has permitted this increase to
continue apace despite the clock-frequency wall encountered in 2003,
albeit courtesy of dies supporting more than 50 hardware threads each.
One of the inescapable consequences of the rapid decrease in
the cost of hardware is that software productivity becomes increasingly
important.
It is no longer sufficient merely to make efficient use of the hardware:
It is now necessary to make extremely efficient use of software
developers as well.
This has long been the case for sequential hardware, but
parallel hardware has become a low-cost commodity only recently.
Therefore, only recently has high productivity become critically important
when creating parallel software.
\QuickQuiz{
Given how cheap parallel systems have become, how can anyone
afford to pay people to program them?
}\QuickQuizAnswer{
There are a number of answers to this question:
\begin{enumerate}
\item Given a large computational cluster of parallel machines,
the aggregate cost of the cluster can easily justify
substantial developer effort, because the development
cost can be spread over the large number of machines.
\item Popular software that is run by tens of millions of users
can easily justify substantial developer effort,
as the cost of this development can be spread over the tens
of millions of users.
Note that this includes things like kernels and system
libraries.
\item If the low-cost parallel machine is controlling the operation
of a valuable piece of equipment, then the cost of this
piece of equipment might easily justify substantial
developer effort.
\item If the software for the low-cost parallel machine produces an
extremely valuable result (e.g., energy savings),
then this valuable result might again justify substantial
developer cost.
\item Safety-critical systems protect lives, which can clearly
justify very large developer effort.
\item Hobbyists and researchers might instead seek knowledge,
experience, fun, or glory.
\end{enumerate}
So it is not the case that the decreasing cost of hardware renders
software worthless, but rather that it is no longer possible to
``hide'' the cost of software development within the cost of
the hardware, at least not unless there are extremely large
quantities of hardware.
}\QuickQuizEnd
Perhaps at one time, the sole purpose of parallel software was performance.
Now, however, productivity is gaining the spotlight.
\subsection{Generality}
\label{sec:intro:Generality}
One way to justify the high cost of developing parallel software
is to strive for maximal \IX{generality}.
All else being equal, the cost of a more-general software artifact
can be spread over more users than that of a less-general one.
In fact, this economic force explains much of the maniacal focus
on portability, which can be seen as an important special case
of generality.\footnote{
Kudos to Michael Wong for pointing this out.}
Unfortunately, generality often comes at the cost of performance,
productivity, or both.
For example, portability is often achieved via adaptation layers,
which inevitably exact a performance penalty.
To see this more generally, consider the following popular parallel
programming environments:
\begin{description}
\item[C/C++ ``Locking Plus Threads'':] This category, which includes
POSIX Threads (pthreads)~\cite{OpenGroup1997pthreads},
Windows Threads, and numerous
operating-system kernel environments, offers excellent performance
(at least within the confines of a single SMP system)
and also offers good generality.
Pity about the relatively low productivity.
\item[Java:] This general purpose and inherently multithreaded
programming environment is widely believed to offer much higher
productivity than C or C++, courtesy of the automatic garbage collector
and the rich set of class libraries.
However, its performance, though greatly improved in the early
2000s, lags that of C and C++.
\item[MPI:] This Message Passing Interface~\cite{MPIForum2008} powers
the largest scientific and technical computing clusters in
the world and offers unparalleled performance and scalability.
In theory, it is general purpose, but it is mainly used
for scientific and technical computing.
Its productivity is believed by many to be even lower than that
of C/C++ ``locking plus threads'' environments.
\item[OpenMP:] This set of compiler directives can be used
to parallelize loops.
It is thus quite specific to this
task, and this specificity often limits its performance.
It is, however, much easier to use than MPI or C/C++
``locking plus threads.''
\item[SQL:] Structured Query Language~\cite{DIS9075SQL92} is
specific to relational database queries.
However, its performance is quite good as measured by the
Transaction Processing Performance Council (TPC)
benchmark results~\cite{TPC}.
Productivity is excellent; in fact, this parallel programming
environment enables people to make good use of a large parallel
system despite having little or no knowledge of parallel
programming concepts.
\end{description}
\QuickQuiz{
SQL???
That is ancient history.
Here in the 2020s, if you really want to make good use of a
large parallel system, why not use machine learning?
}\QuickQuizAnswer{
It certainly is the case that machine learning can keep many
large parallel systems busy!
However, the jury is still out as to whether this constitutes
\emph{good} use of those systems.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{intro/PPGrelation}}
\caption{Software Layers and Performance, Productivity, and Generality}
\label{fig:intro:Software Layers and Performance; Productivity; and Generality}
\end{figure}
The nirvana of parallel programming environments, one that offers
world-class performance, productivity, and generality, simply does
not yet exist.
Until such a nirvana appears, it will be necessary to make engineering
tradeoffs among performance, productivity, and generality.
One such tradeoff is depicted by the green ``iron triangle''\footnote{
Kudos to Michael Wong for coining ``iron triangle.''}
shown in
\cref{fig:intro:Software Layers and Performance; Productivity; and Generality},
which shows how productivity becomes increasingly important at the upper layers
of the system stack,
while performance and generality become increasingly important at the
lower layers of the system stack.
The huge development costs incurred at the lower layers
must be spread over equally huge numbers of users
(hence the importance of generality), and
performance lost in lower layers cannot easily be
recovered further up the stack.
In the upper layers of the stack, there might be very few users for a given
specific application, in which case productivity concerns are paramount.
This explains the tendency towards ``bloatware'' further up the stack:
Extra hardware is often cheaper than extra developers.
This book is intended for developers working near the bottom
of the stack, where performance and generality are of greatest concern.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{intro/Generality}}
\caption{Tradeoff Between Productivity and Generality}
\label{fig:intro:Tradeoff Between Productivity and Generality}
\end{figure}
It is important to note that a tradeoff between productivity and
generality has existed for centuries in many fields.
For but one example, a nailgun is more productive than a hammer for
driving nails, but in contrast to the nailgun, a hammer can be used for
many things besides driving nails.
It should therefore be no surprise to see similar tradeoffs
appear in the field of parallel computing.
This tradeoff is shown schematically in
\cref{fig:intro:Tradeoff Between Productivity and Generality}.
Here, users~1, 2, 3, and~4 have specific jobs that they need the computer
to help them with.
The most productive possible language or environment for a given user is one
that simply does that user's job, without requiring any programming,
configuration, or other setup.
\QuickQuiz{
This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
}\QuickQuizAnswer{
This is eminently achievable.
The cellphone is a computer that can be used to make phone
calls and to send and receive text messages with little or
no programming or configuration on the part of the end user.
This might seem to be a trivial example at first glance,
but if you consider it carefully you will see that it is
both simple and profound.
When we are willing to sacrifice generality, we can achieve
truly astounding increases in productivity.
Those who indulge in excessive generality will therefore fail to set
the productivity bar high enough to succeed near the top of the
software stack.
This fact of life even has its own acronym:
YAGNI, or ``You Ain't Gonna Need It.''
}\QuickQuizEnd
Unfortunately, a system that does the job required by user~1 is
unlikely to do user~2's job.
In other words, the most productive languages and environments are
domain-specific, and thus by definition lacking generality.
Another option is to tailor a given programming language or environment
to the hardware system (for example, low-level languages such as
assembly, C, C++, or Java) or to some abstraction (for example,
Haskell, Prolog, or Snobol), as is shown by the circular region near
the center of
\cref{fig:intro:Tradeoff Between Productivity and Generality}.
These languages can be considered to be general in the sense that they
are equally ill-suited to the jobs required by users~1, 2, 3, and~4.
In other words, their generality comes at the expense of
decreased productivity when compared to domain-specific languages
and environments.
Worse yet, a language that is tailored to a given abstraction
is likely to suffer from performance and scalability problems
unless and until it can be efficiently mapped to real hardware.
Is there no escape from iron triangle's three conflicting goals of
performance, productivity, and generality?
It turns out that there often is an escape, for example,
using the alternatives to parallel programming discussed in the next section.
After all, parallel programming can be a great deal of fun, but
it is not always the best tool for the job.
\section{Alternatives to Parallel Programming}
\label{sec:intro:Alternatives to Parallel Programming}
%
\epigraph{Experiment is folly when experience shows the way.}
{Roger M. Babson}
In order to properly consider alternatives to parallel programming,
you must first decide on what exactly you expect the parallelism
to do for you.
As seen in \cref{sec:intro:Parallel Programming Goals},
the primary goals of parallel programming are performance, productivity,
and generality.
Because this book is intended for developers working on
performance-critical code near the bottom of the software stack,
the remainder of this section focuses primarily on performance improvement.
It is important to keep in mind that parallelism is but one way to
improve performance.
Other well-known approaches include the following, in roughly increasing
order of difficulty:
\begin{enumerate}
\item Run multiple instances of a sequential application.
\item Make the application use existing parallel software.
\item Optimize the serial application.
\end{enumerate}
These approaches are covered in the following sections.
\subsection{Multiple Instances of a Sequential Application}
\label{sec:intro:Multiple Instances of a Sequential Application}
Running multiple instances of a sequential application can allow you
to do parallel programming without actually doing parallel programming.
There are a large number of ways to approach this, depending on the
structure of the application.
If your program is analyzing a large number of different scenarios,
or is analyzing a large number of independent data sets, one easy
and effective approach is to create a single sequential program that
carries out a single analysis, then use any of a number of scripting
environments (for example the \co{bash} shell) to run a number of
instances of that sequential program in parallel.
In some cases, this approach can be easily extended to a cluster of
machines.
This approach may seem like cheating, and in fact some denigrate such
programs as ``\IX{embarrassingly parallel}''.
And in fact, this approach does have some potential disadvantages,
including increased memory consumption, waste of CPU cycles recomputing
common intermediate results, and increased copying of data.
However, it is often extremely productive, garnering extreme performance
gains with little or no added effort.
\subsection{Use Existing Parallel Software}
\label{sec:intro:Use Existing Parallel Software}
There is no longer any shortage of parallel software environments that
can present a single-threaded programming environment,
including relational
databases~\cite{Date82},
web-application servers, and map-reduce environments.
For example, a common design provides a separate process for each
user, each of which generates SQL from user queries.
This per-user SQL is run against a common relational database, which
automatically runs the users' queries concurrently.
The per-user programs are responsible only for the user interface,
with the relational database taking full responsibility for the
difficult issues surrounding parallelism and persistence.
In addition, there are a growing number of parallel library functions,
particularly for numeric computation.
Even better, some libraries take advantage of special\-/purpose
hardware such as vector units and general\-/purpose graphical processing
units (GPGPUs).
Taking this approach often sacrifices some performance, at least when
compared to carefully hand-coding a fully parallel application.
However, such sacrifice is often well repaid by a huge reduction in
development effort.
\QuickQuiz{
Wait a minute!
Doesn't this approach simply shift the development effort from
you to whoever wrote the existing parallel software you are using?
}\QuickQuizAnswer{
Exactly!
And that is the whole point of using existing software.
One team's work can be used by many other teams, resulting in a
large decrease in overall effort compared to all teams
needlessly reinventing the wheel.
}\QuickQuizEnd
\subsection{Performance Optimization}
\label{sec:intro:Performance Optimization}
Up through the early 2000s, CPU clock frequencies doubled every 18 months.
It was therefore usually more important to create new functionality than to
carefully optimize performance.
Now that \IXr{Moore's Law} is ``only'' increasing transistor density instead
of increasing both transistor density and per-transistor performance,
it might be a good time to rethink the importance of performance
optimization.
After all, new hardware generations no longer bring significant
single-threaded performance improvements.
Furthermore, many performance optimizations can also conserve energy.
From this viewpoint, parallel programming is but another performance
optimization, albeit one that is becoming much more attractive
as parallel systems become cheaper and more readily available.
However, it is wise to keep in mind that the speedup available from
parallelism is limited to roughly the number of CPUs
(but see \cref{sec:SMPdesign:Beyond Partitioning}
for an interesting exception).
In contrast, the speedup available from traditional single-threaded
software optimizations can be much larger.
For example, replacing a long linked list with a hash table
or a search tree can improve performance by many orders of magnitude.
This highly optimized single-threaded program might run much
faster than its unoptimized parallel counterpart, making parallelization
unnecessary.
Of course, a highly optimized parallel program would be even better,
aside from the added development effort required.
Furthermore, different programs might have different performance
bottlenecks.
For example, if your program spends most of its time
waiting on data from your disk drive,
using multiple CPUs will probably just increase the time wasted waiting
for the disks.
In fact, if the program was reading from a single large file laid out
sequentially on a rotating disk, parallelizing your program might
well make it a lot slower due to the added seek overhead.
You should instead optimize the data layout so that
the file can be smaller (thus faster to read), split the file into chunks
which can be accessed in parallel from different drives,
cache frequently accessed data in main memory,
or, if possible,
reduce the amount of data that must be read.
\QuickQuiz{
What other bottlenecks might prevent additional CPUs from
providing additional performance?
}\QuickQuizAnswer{
There are any number of potential bottlenecks:
\begin{enumerate}
\item Main memory.
If a single thread consumes all available
memory, additional threads will simply page themselves
silly.
\item Cache.
If a single thread's cache footprint completely
fills any shared CPU cache(s), then adding more threads
will simply thrash those affected caches, as will be
seen in \cref{chp:Data Structures}.
\item Memory bandwidth.
If a single thread consumes all available
memory bandwidth, additional threads will simply
result in additional queuing on the system interconnect.
\item I/O bandwidth.
If a single thread is I/O bound,
adding more threads will simply result in them all
waiting in line for the affected I/O resource.
\end{enumerate}
Specific hardware systems might have any number of additional
bottlenecks.
The fact is that every resource which is shared between
multiple CPUs or threads is a potential bottleneck.
}\QuickQuizEnd
Parallelism can be a powerful optimization technique, but
it is not the only such technique, nor is it appropriate for all
situations.
Of course, the easier it is to parallelize your program, the
more attractive parallelization becomes as an optimization.
Parallelization has a reputation of being quite difficult,
which leads to the question ``exactly what makes parallel
programming so difficult?''
\section{What Makes Parallel Programming Hard?}
\label{sec:intro:What Makes Parallel Programming Hard?}
%
\epigraph{Real difficulties can be overcome; it is only the imaginary
ones that are unconquerable.}{Theodore N.~Vail}
\OriginallyPublished{sec:intro:What Makes Parallel Programming Hard?}{What Makes Parallel Programming Hard?}{a Portland State University Technical Report}{PaulEMcKenney2009ProgrammingHard}
It is important to note that the difficulty of parallel programming
is as much a human-factors issue as it is a set of technical properties of the
parallel programming problem.
We do need human beings to be able to tell parallel
systems what to do, otherwise known as programming.
But parallel programming involves two-way communication, with
a program's performance and scalability being the communication from
the machine to the human.
In short, the human writes a program telling the computer what to do,
and the computer critiques this program via the resulting performance and
scalability.
Therefore, appeals to abstractions or to mathematical analyses will
often be of severely limited utility.
In the Industrial Revolution, the interface between human and machine
was evaluated by human-factor studies, then called time-and-motion
studies.
Although there have been a few human-factor studies examining parallel
programming~\cite{RyanEccles2005HPCSNovice,RyanEccles2006HPCSNoviceNeeds,
LorinHochstein2005SC,DuaneSzafron1994PEMPDS}, these studies have
been extremely narrowly focused, and hence unable to demonstrate any
general results.
Furthermore, given that the normal range of programmer productivity
spans more than an order of magnitude, it is unrealistic to expect
an affordable study to be capable of detecting (say) a 10\pct\ difference
in productivity.
Although the multiple-order-of-magnitude differences that such studies
\emph{can} reliably detect are extremely valuable, the most impressive
improvements tend to be based on a long series of 10\pct\ improvements.
We must therefore take a different approach.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{intro/FourTaskCategories}}
\caption{Categories of Tasks Required of Parallel Programmers}
\label{fig:intro:Categories of Tasks Required of Parallel Programmers}
\end{figure}
One such approach is to carefully consider the tasks that parallel
programmers must undertake that are not required of sequential programmers.
We can then evaluate how well a given programming language or environment
assists the developer with these tasks.
These tasks fall into the four categories shown in
\cref{fig:intro:Categories of Tasks Required of Parallel Programmers},
each of which is covered in the following sections.
\subsection{Work Partitioning}
\label{sec:intro:Work Partitioning}
Work partitioning is absolutely required for parallel execution:
If there is but one ``glob'' of work, then it can be executed by at
most one CPU at a time, which is by definition sequential execution.
However, partitioning the work requires great care.
For example, uneven partitioning can result in sequential execution
once the small partitions have completed~\cite{GeneAmdahl1967AmdahlsLaw}.
In less extreme cases, load balancing can be used to fully utilize
available hardware and restore performance and scalability.
Although partitioning can greatly improve performance and scalability,
it can also increase complexity.
For example, partitioning can complicate handling of global
errors and events:
A parallel program may need to carry out non-trivial synchronization
in order to safely process such global events.
More generally, each partition requires some sort of communication:
After all, if
a given thread did not communicate at all, it would have no effect and
would thus not need to be executed.
However, because communication incurs overhead, careless partitioning choices
can result in severe performance degradation.
Furthermore, the number of concurrent threads must often be controlled,
as each such thread occupies common resources, for example,
space in CPU caches.
If too many threads are permitted to execute concurrently, the
CPU caches will overflow, resulting in high cache miss rate, which in
turn degrades performance.
Conversely, large numbers of threads are often required to
overlap computation and I/O so as to fully utilize I/O devices.
\QuickQuiz{
Other than CPU cache capacity, what might require limiting the
number of concurrent threads?
}\QuickQuizAnswer{
There are any number of potential limits on the number of
threads:
\begin{enumerate}
\item Main memory.
Each thread consumes some memory
(for its stack if nothing else), so that excessive
numbers of threads can exhaust memory, resulting
in excessive paging or memory-allocation failures.
\item I/O bandwidth.
If each thread initiates a given
amount of mass-storage I/O or networking traffic,
excessive numbers of threads can result in excessive
I/O queuing delays, again degrading performance.
Some networking protocols may be subject to timeouts
or other failures if there are so many threads that
networking events cannot be responded to in a timely
fashion.
\item Synchronization overhead.
For many synchronization protocols, excessive numbers
of threads can result in excessive spinning, blocking,
or rollbacks, thus degrading performance.
\end{enumerate}
Specific applications and platforms may have any number of additional
limiting factors.
}\QuickQuizEnd
Finally, permitting threads to execute concurrently greatly increases
the program's state space, which can make the program difficult to
understand and debug, degrading productivity.
All else being equal, smaller state spaces having more regular structure
are more easily understood, but this is a human-factors statement as much
as it is a technical or mathematical statement.
Good parallel designs might have extremely large state spaces, but
nevertheless be easy to understand due to their regular structure,
while poor designs can be impenetrable despite having a comparatively
small state space.
The best designs exploit embarrassing parallelism, or transform the
problem to one having an embarrassingly parallel solution.
In either case, ``embarrassingly parallel'' is in fact
an embarrassment of riches.
The current state of the art enumerates good designs; more work is
required to make more general judgments on
state-space size and structure.
\subsection{Parallel Access Control}
\label{sec:Parallel Access Control}
Given a single-threaded sequential program, that single
thread has full access to all of the program's resources.
These resources are most often in-memory data structures, but can be CPUs,
memory (including caches), I/O devices, computational accelerators, files,
and much else besides.
The first parallel-access-control issue is whether the form of access to
a given resource depends on that resource's location.
For example, in many message-passing environments, local-variable
access is via expressions and assignments,
while remote-variable access uses an entirely different
syntax, usually involving messaging.
The POSIX Threads environment~\cite{OpenGroup1997pthreads},
Structured Query Language (SQL)~\cite{DIS9075SQL92}, and
partitioned global address-space (PGAS) environments
such as Universal Parallel C (UPC)~\cite{ElGhazawi2003UPC,UPCConsortium2013}
offer implicit access,
while Message Passing Interface (MPI)~\cite{MPIForum2008} offers
explicit access because access to remote data requires explicit
messaging.
The other parallel-access-control issue is how threads coordinate
access to the resources.
This coordination is carried out by
the very large number of synchronization mechanisms
provided by various parallel languages and environments,
including message passing, locking, transactions,
reference counting, explicit timing, shared atomic variables, and data
ownership.
Many traditional parallel-programming concerns such as \IX{deadlock},
\IX{livelock}, and transaction rollback stem from this coordination.
This framework can be elaborated to include comparisons
of these synchronization mechanisms, for example locking vs.\@ transactional
memory~\cite{McKenney2007PLOSTM}, but such elaboration is beyond the
scope of this section.
(See
\cref{sec:future:Transactional Memory,%
sec:future:Hardware Transactional Memory}
for more information on transactional memory.)
\QuickQuiz{
Just what is ``explicit timing''???
}\QuickQuizAnswer{
Where each thread is given access to some set of resources during
an agreed-to slot of time.
For example, a parallel program with eight threads might be
organized into eight-millisecond time intervals, so that the
first thread is given access during the first millisecond of
each interval, the second thread during the second millisecond,
and so on.
This approach clearly requires carefully synchronized clocks
and careful control of execution times, and therefore should
be used with considerable caution.
In fact, outside of hard realtime environments, you almost
certainly want to use something else instead.
Explicit timing is nevertheless worth a mention, as it is
always there when you need it.
}\QuickQuizEnd
\subsection{Resource Partitioning and Replication}
\label{sec:Resource Partitioning and Replication}
The most effective parallel algorithms and systems exploit resource
parallelism, so much so that it is
usually wise to begin parallelization by partitioning your write-intensive
resources and replicating frequently accessed read-mostly resources.
The resource in question is most frequently data, which might be
partitioned over computer systems, mass-storage devices, \IXplr{NUMA node},
CPU cores (or dies or hardware threads), pages, cache lines, instances
of synchronization primitives, or critical sections of code.
For example, partitioning over locking primitives is termed
``\IXh{data}{locking}''~\cite{Beck85}.
Resource partitioning is frequently application dependent.
For example, numerical applications frequently partition matrices
by row, column, or sub-matrix, while commercial applications frequently
partition write-intensive data structures and replicate
read-mostly data structures.
Thus, a commercial application might assign the data for a
given customer to a given few computers out of a large cluster.
An application might statically partition data, or dynamically
change the partitioning over time.
Resource partitioning is extremely effective, but
it can be quite challenging for complex multilinked data
structures.
\subsection{Interacting With Hardware}
\label{sec:Interacting With Hardware}
Hardware interaction is normally the domain of the operating system,
the compiler, libraries, or other software-environment infrastructure.
However, developers working with novel hardware features and components
will often need to work directly with such hardware.
In addition, direct access to the hardware can be required when squeezing
the last drop of performance out of a given system.
In this case, the developer may need to tailor or configure the application
to the cache geometry, system topology, or interconnect protocol of the
target hardware.
In some cases, hardware may be considered to be a resource which
is subject to partitioning or access control, as described in
the previous sections.
\subsection{Composite Capabilities}
\label{sec:Composite Capabilities}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{intro/FourTaskOrder}}
\caption{Ordering of Parallel-Programming Tasks}
\label{fig:intro:Ordering of Parallel-Programming Tasks}
\end{figure}
Although these four capabilities are fundamental,
good engineering practice uses composites of
these capabilities.
For example, the data-parallel approach first
partitions the data so as to minimize the need for
inter-partition communication, partitions the code accordingly,
and finally maps data partitions and threads so as to maximize
throughput while minimizing inter-thread communication,
as shown in
\cref{fig:intro:Ordering of Parallel-Programming Tasks}.
The developer can then
consider each partition separately, greatly reducing the size
of the relevant state space, in turn increasing productivity.
Even though some problems are non-partitionable,
clever transformations into forms permitting partitioning can
sometimes greatly enhance
both performance and scalability~\cite{PanagiotisMetaxas1999PDCS}.
\subsection{How Do Languages and Environments Assist With These Tasks?}
\label{sec:intro:How Do Languages and Environments Assist With These Tasks?}
Although many environments require the developer to deal manually
with these tasks, there are long-standing environments that bring
significant automation to bear.
The poster child for these environments is SQL, many implementations
of which automatically parallelize single large queries and also
automate concurrent execution of independent queries and updates.
These four categories of tasks must be carried out in all parallel
programs, but that of course does not necessarily mean that the developer
must manually carry out these tasks.
We can expect to see ever-increasing automation of these four tasks
as parallel systems continue to become cheaper and more readily available.
\QuickQuiz{
Are there any other obstacles to parallel programming?
}\QuickQuizAnswer{
There are a great many other potential obstacles to parallel
programming.
Here are a few of them:
\begin{enumerate}
\item The only known algorithms for a given project might
be inherently sequential in nature.
In this case, either avoid parallel programming
(there being no law saying that your project \emph{has}
to run in parallel) or invent a new parallel algorithm.
\item The project allows binary-only plugins that share the same
address space, such that no one developer has access to
all of the source code for the project.
Because many parallel bugs, including deadlocks, are
global in nature, such binary-only plugins pose a severe
challenge to current software development methodologies.
This might well change, but for the time being, all
developers of parallel code sharing a given address space
need to be able to see \emph{all} of the code running in
that address space.
\item The project contains heavily used APIs that were designed
without regard to
parallelism~\cite{HagitAttiya2011LawsOfOrder,Clements:2013:SCR:2517349.2522712}.
Some of the more ornate features of the System V
message-queue API form a case in point.
Of course, if your project has been around for a few
decades, and its developers did not have access to
parallel hardware, it undoubtedly has at least
its share of such APIs.
\item The project was implemented without regard to parallelism.
Given that there are a great many techniques that work
extremely well in a sequential environment, but that
fail miserably in parallel environments, if your project
ran only on sequential hardware for most of its lifetime,
then your project undoubtably has at least its share of
parallel-unfriendly code.
\item The project was implemented without regard to good
software-development practice.
The cruel truth is that shared-memory parallel
environments are often much less forgiving of sloppy
development practices than are sequential environments.
You may be well-served to clean up the existing design
and code prior to attempting parallelization.
\item The people who originally did the development on your
project have since moved on, and the people remaining,
while well able to maintain it or add small features,
are unable to make ``big animal'' changes.
In this case, unless you can work out a very simple
way to parallelize your project, you will probably
be best off leaving it sequential.
That said, there are a number of simple approaches that
you might use
to parallelize your project, including running multiple
instances of it, using a parallel implementation of
some heavily used library function, or making use of
some other parallel project, such as a database.
\end{enumerate}
One can argue that many of these obstacles are non-technical
in nature, but that does not make them any less real.
In short, parallelization of a large body of code
can be a large and complex effort.
As with any large and complex effort, it makes sense to
do your homework beforehand.
}\QuickQuizEnd
\section{Discussion}
\label{sec:intro:Discussion}
%
\epigraph{Until you try, you don't know what you can't do.}
{Henry James}
This section has given an overview of the difficulties with, goals of,
and alternatives to parallel programming.
This overview was followed by a discussion of
what can make parallel programming hard, along with a high-level
approach for dealing with parallel programming's difficulties.
Those who still insist that parallel programming is impossibly difficult
should review some of the older guides to parallel
programmming~\cite{SQNTParallel,AndrewDBirrell1989Threads,Beck85,Inman85}.
The following quote from Andrew Birrell's
monograph~\cite{AndrewDBirrell1989Threads} is especially telling:
\begin{quote}
Writing concurrent programs has a reputation for being exotic
and difficult.
I~believe it is neither.
You need a system that provides you with good primitives
and suitable libraries,
you need a basic caution and carefulness, you need an armory of
useful techniques, and you need to know of the common pitfalls.
I~hope that this paper has helped you towards sharing my belief.
\end{quote}
The authors of these older guides were well up to the parallel programming
challenge back in the 1980s.
As such, there are simply no excuses for refusing to step up to the
parallel-programming challenge here in the 21\textsuperscript{st} century!
We are now ready to proceed to the next chapter, which dives into the
relevant properties of the parallel hardware underlying our parallel
software.
\QuickQuizAnswersChp{qqzintro}