blob: a84f3015371e1410090a641f651b4d7f1720ad50 [file] [log] [blame]
% SMPdesign/criteria.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Design Criteria}
\label{sec:SMPdesign:Design Criteria}
One way to obtain the best performance and scalability is to simply
hack away until you converge on the best possible parallel program.
Unfortunately, if your program is other than microscopically tiny,
the space of possible parallel programs is so huge
that convergence is not guaranteed in the lifetime of the universe.
Besides, what exactly is the ``best possible parallel program''?
After all, Section~\ref{sec:intro:Parallel Programming Goals}
called out no fewer than three parallel-programming goals of
performance, productivity, and generality,
and the best possible performance will likely come at a cost in
terms of productivity and generality.
We clearly need to be able to make higher-level choices at design
time in order to arrive at an acceptably good parallel program
before that program becomes obsolete.
However, more detailed design criteria are required to
actually produce a real-world design, a task taken up in this section.
This being the real world, these criteria often conflict to a
greater or lesser degree, requiring that the designer carefully
balance the resulting tradeoffs.
As such, these criteria may be thought of as the ``forces''
acting on the design, with particularly good tradeoffs between
these forces being called ``design patterns''~\cite{Alexander79,GOF95}.
The design criteria for attaining the three parallel-programming goals
are speedup,
contention, overhead, read-to-write ratio, and complexity:
\begin{description}
\item[Speedup:] As noted in
Section~\ref{sec:intro:Parallel Programming Goals},
increased performance is the major reason
to go to all of the time and trouble
required to parallelize it.
Speedup is defined to be the ratio of the time required
to run a sequential version of the program to the time
required to run a parallel version.
\item[Contention:] If more CPUs are applied to a parallel
program than can be kept busy by that program,
the excess CPUs are prevented from doing
useful work by contention.
This may be lock contention, memory contention, or a host
of other performance killers.
\item[Work-to-Synchronization Ratio:] A uniprocessor,
single-threaded, non-preemptible, and non-interruptible\footnote{
Either by masking interrupts or by being oblivious to them.}
version of a given parallel
program would not need any synchronization primitives.
Therefore, any time consumed by these primitives
(including communication cache misses as well as
message latency, locking primitives, atomic instructions,
and memory barriers)
is overhead that does not contribute directly to the useful
work that the program is intended to accomplish.
Note that the important measure is the
relationship between the synchronization overhead
and the overhead of the code in the critical section, with larger
critical sections able to tolerate greater synchronization overhead.
The work-to-synchronization ratio is related to
the notion of synchronization efficiency. % @@@ reference.
\item[Read-to-Write Ratio:] A data structure that is
rarely updated may often be replicated rather than partitioned,
and furthermore may be protected with asymmetric
synchronization primitives that reduce readers' synchronization
overhead at the expense of that of writers, thereby
reducing overall synchronization overhead.
Corresponding optimizations are possible for frequently
updated data structures, as discussed in
Chapter~\ref{chp:Counting}.
\item[Complexity:] A parallel program is more complex than
an equivalent sequential program because the parallel
program has a much larger state space than does the
sequential program, although these larger state spaces
can in some cases be easily understood given sufficient
regularity and structure.
A parallel programmer must
consider synchronization primitives, messaging, locking design,
critical-section identification,
and deadlock in the context of this larger state space.
This greater complexity often translates
to higher development and maintenance costs.
Therefore, budgetary constraints can
limit the number and types of modifications made to
an existing program, since a given degree of speedup is
worth only so much time and trouble.
Worse yet, added complexity can actually \emph{reduce}
performance and scalability.
Therefore, beyond a certain point,
there may be potential sequential optimizations
that are cheaper and more effective than parallelization.
As noted in
Section~\ref{sec:intro:Performance},
parallelization is but one performance optimization of
many, and is furthermore an optimization that applies
most readily to CPU-based bottlenecks.
\end{description}
These criteria will act together to enforce a maximum speedup.
The first three criteria are deeply interrelated, so
the remainder of this section analyzes these
interrelationships.\footnote{
A real-world parallel system will be subject to many additional
design criteria, such as data-structure layout,
memory size, memory-hierarchy latencies, bandwidth limitations,
and I/O issues.}
Note that these criteria may also appear as part of the requirements
specification.
For example, speedup may act as a relative desideratum
(``the faster, the better'')
or as an absolute requirement of the workload (``the system
must support at least 1,000,000 web hits per second'').
Classic design pattern languages describe relative desiderata as forces
and absolute requirements as context.
An understanding of the relationships between these design criteria can
be very helpful when identifying appropriate design tradeoffs for a
parallel program.
\begin{enumerate}
\item The less time a program spends in critical sections,
the greater the potential speedup.
This is a consequence of Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw}
and of the fact that only one CPU may execute within a given
critical section at a given time.
More specifically, the fraction of time that the program spends in
a given exclusive critical section must be much less than
the reciprocal of the number of CPUs for the
actual speedup to approach the number of CPUs.
For example, a program running on 10 CPUs must spend
much less than one tenth of its time in the most-restrictive
critical section if it is to scale at all well.
\item Contention effects will consume the excess CPU and/or
wallclock time should the actual speedup be less than
the number of available CPUs. The
larger the gap between the number of CPUs
and the actual speedup, the less efficiently the
CPUs will be used.
Similarly, the greater the desired efficiency, the smaller
the achievable speedup.
\item If the available synchronization primitives have
high overhead compared to the critical sections
that they guard, the best way to improve speedup
is to reduce the number of times that the primitives
are invoked (perhaps by batching critical sections,
using data ownership, using asymmetric primitives
(see Section~\ref{chp:Deferred Processing}),
or by moving toward a more coarse-grained design
such as code locking).
\item If the critical sections have high overhead compared
to the primitives guarding them, the best way
to improve speedup is to increase parallelism
by moving to reader/writer locking, data locking, asymmetric,
or data ownership.
\item If the critical sections have high overhead compared
to the primitives guarding them and the data structure
being guarded is read much more often than modified,
the best way to increase parallelism is to move
to reader/writer locking or asymmetric primitives.
\item Many changes that improve SMP performance, for example,
reducing lock contention, also improve real-time
latencies~\cite{PaulMcKenney2005h}.
\end{enumerate}
\QuickQuiz{}
Don't all these problems with critical sections mean that
we should just always use
non-blocking synchronization~\cite{MauriceHerlihy90a},
which don't have critical sections?
\QuickQuizAnswer{
Although non-blocking synchronization can be very useful
in some situations, it is no panacea.
Also, non-blocking synchronization really does have
critical sections, as noted by Josh Triplett.
For example, in a non-blocking algorithm based on
compare-and-swap operations, the code starting at the
initial load and continuing to the compare-and-swap
is in many ways analogous to a lock-based critical section.
} \QuickQuizEnd