SMPdesign/criteria.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % SMPdesign/criteria.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Design Criteria}
 \label{sec:SMPdesign:Design Criteria}

 One way to obtain the best performance and scalability is to simply
 hack away until you converge on the best possible parallel program.
 Unfortunately, if your program is other than microscopically tiny,
 the space of possible parallel programs is so huge
 that convergence is not guaranteed in the lifetime of the universe.
 Besides, what exactly is the ``best possible parallel program''?
 After all, Section~\ref{sec:intro:Parallel Programming Goals}
 called out no fewer than three parallel-programming goals of
 performance, productivity, and generality,
 and the best possible performance will likely come at a cost in
 terms of productivity and generality.
 We clearly need to be able to make higher-level choices at design
 time in order to arrive at an acceptably good parallel program
 before that program becomes obsolete.

 However, more detailed design criteria are required to
 actually produce a real-world design, a task taken up in this section.
 This being the real world, these criteria often conflict to a
 greater or lesser degree, requiring that the designer carefully
 balance the resulting tradeoffs.

 As such, these criteria may be thought of as the ``forces''
 acting on the design, with particularly good tradeoffs between
 these forces being called ``design patterns''~\cite{Alexander79,GOF95}.

 The design criteria for attaining the three parallel-programming goals
 are speedup,
 contention, overhead, read-to-write ratio, and complexity:
 \begin{description}
 \item[Speedup:]  As noted in
 	Section~\ref{sec:intro:Parallel Programming Goals},
 	increased performance is the major reason
 	to go to all of the time and trouble
 	required to parallelize it.
 	Speedup is defined to be the ratio of the time required
 	to run a sequential version of the program to the time
 	required to run a parallel version.
 \item[Contention:]  If more CPUs are applied to a parallel
 	program than can be kept busy by that program,
 	the excess CPUs are prevented from doing
 	useful work by contention.
 	This may be lock contention, memory contention, or a host
 	of other performance killers.
 \item[Work-to-Synchronization Ratio:]  A uniprocessor,
 	single\-/threaded, non-preemptible, and non\-/interruptible\footnote{
 		Either by masking interrupts or by being oblivious to them.}
 	version of a given parallel
 	program would not need any synchronization primitives.
 	Therefore, any time consumed by these primitives
 	(including communication cache misses as well as
 	message latency, locking primitives, atomic instructions,
 	and memory barriers)
 	is overhead that does not contribute directly to the useful
 	work that the program is intended to accomplish.
 	Note that the important measure is the
 	relationship between the synchronization overhead
 	and the overhead of the code in the critical section, with larger
 	critical sections able to tolerate greater synchronization overhead.
 	The work-to-synchronization ratio is related to
 	the notion of synchronization efficiency.  % @@@ reference.
 \item[Read-to-Write Ratio:]  A data structure that is
 	rarely updated may often be replicated rather than partitioned,
 	and furthermore may be protected with asymmetric
 	synchronization primitives that reduce readers' synchronization
 	overhead at the expense of that of writers, thereby
 	reducing overall synchronization overhead.
 	Corresponding optimizations are possible for frequently
 	updated data structures, as discussed in
 	Chapter~\ref{chp:Counting}.
 \item[Complexity:]  A parallel program is more complex than
 	an equivalent sequential program because the parallel
 	program has a much larger state space than does the
 	sequential program, although these larger state spaces
 	can in some cases be easily understood given sufficient
 	regularity and structure.
 	A parallel programmer must
 	consider synchronization primitives, messaging, locking design,
 	critical-section identification,
 	and deadlock in the context of this larger state space.

 	This greater complexity often translates
 	to higher development and maintenance costs.
 	Therefore, budgetary constraints can
 	limit the number and types of modifications made to
 	an existing program, since a given degree of speedup is
 	worth only so much time and trouble.
 	Worse yet, added complexity can actually \emph{reduce}
 	performance and scalability.

 	Therefore, beyond a certain point,
 	there may be potential sequential optimizations
 	that are cheaper and more effective than parallelization.
 	As noted in
 	Section~\ref{sec:intro:Performance},
 	parallelization is but one performance optimization of
 	many, and is furthermore an optimization that applies
 	most readily to CPU-based bottlenecks.
 \end{description}
 These criteria will act together to enforce a maximum speedup.
 The first three criteria are deeply interrelated, so
 the remainder of this section analyzes these
 interrelationships.\footnote{
 	A real-world parallel system will be subject to many additional
 	design criteria, such as data-structure layout,
 	memory size, memory-hierarchy latencies, bandwidth limitations,
 	and I/O issues.}

 Note that these criteria may also appear as part of the requirements
 specification.
 For example, speedup may act as a relative desideratum
 (``the faster, the better'')
 or as an absolute requirement of the workload (``the system
 must support at least 1,000,000 web hits per second'').
 Classic design pattern languages describe relative desiderata as forces
 and absolute requirements as context.

 An understanding of the relationships between these design criteria can
 be very helpful when identifying appropriate design tradeoffs for a
 parallel program.
 \begin{enumerate}
 \item	The less time a program spends in critical sections,
 	the greater the potential speedup.
 	This is a consequence of Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw}
 	and of the fact that only one CPU may execute within a given
 	critical section at a given time.

 	More specifically, the fraction of time that the program spends in
 	a given exclusive critical section must be much less than
 	the reciprocal of the number of CPUs for the
 	actual speedup to approach the number of CPUs.
 	For example, a program running on 10 CPUs must spend
 	much less than one tenth of its time in the most-restrictive
 	critical section if it is to scale at all well.
 \item	Contention effects will consume the excess CPU and/or
 	wallclock time should the actual speedup be less than
 	the number of available CPUs.  The
 	larger the gap between the number of CPUs
 	and the actual speedup, the less efficiently the
 	CPUs will be used.
 	Similarly, the greater the desired efficiency, the smaller
 	the achievable speedup.
 \item	If the available synchronization primitives have
 	high overhead compared to the critical sections
 	that they guard, the best way to improve speedup
 	is to reduce the number of times that the primitives
 	are invoked (perhaps by batching critical sections,
 	using data ownership, using asymmetric primitives
 	(see Section~\ref{chp:Deferred Processing}),
 	or by moving toward a more coarse-grained design
 	such as code locking).
 \item	If the critical sections have high overhead compared
 	to the primitives guarding them, the best way
 	to improve speedup is to increase parallelism
 	by moving to reader/writer locking, data locking, asymmetric,
 	or data ownership.
 \item	If the critical sections have high overhead compared
 	to the primitives guarding them and the data structure
 	being guarded is read much more often than modified,
 	the best way to increase parallelism is to move
 	to reader/writer locking or asymmetric primitives.
 \item	Many changes that improve SMP performance, for example,
 	reducing lock contention, also improve real-time
 	latencies~\cite{PaulMcKenney2005h}.
 \end{enumerate}

 \QuickQuiz{}
 	Don't all these problems with critical sections mean that
 	we should just always use
 	non-blocking synchronization~\cite{MauriceHerlihy90a},
 	which don't have critical sections?
 \QuickQuizAnswer{
 	Although non-blocking synchronization can be very useful
 	in some situations, it is no panacea.
 	Also, non-blocking synchronization really does have
 	critical sections, as noted by Josh Triplett.
 	For example, in a non-blocking algorithm based on
 	compare-and-swap operations, the code starting at the
 	initial load and continuing to the compare-and-swap
 	is in many ways analogous to a lock-based critical section.
 } \QuickQuizEnd
	% SMPdesign/criteria.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{Design Criteria}
	\label{sec:SMPdesign:Design Criteria}

	One way to obtain the best performance and scalability is to simply
	hack away until you converge on the best possible parallel program.
	Unfortunately, if your program is other than microscopically tiny,
	the space of possible parallel programs is so huge
	that convergence is not guaranteed in the lifetime of the universe.
	Besides, what exactly is the ``best possible parallel program''?
	After all, Section~\ref{sec:intro:Parallel Programming Goals}
	called out no fewer than three parallel-programming goals of
	performance, productivity, and generality,
	and the best possible performance will likely come at a cost in
	terms of productivity and generality.
	We clearly need to be able to make higher-level choices at design
	time in order to arrive at an acceptably good parallel program
	before that program becomes obsolete.

	However, more detailed design criteria are required to
	actually produce a real-world design, a task taken up in this section.
	This being the real world, these criteria often conflict to a
	greater or lesser degree, requiring that the designer carefully
	balance the resulting tradeoffs.

	As such, these criteria may be thought of as the ``forces''
	acting on the design, with particularly good tradeoffs between
	these forces being called ``design patterns''~\cite{Alexander79,GOF95}.

	The design criteria for attaining the three parallel-programming goals
	are speedup,
	contention, overhead, read-to-write ratio, and complexity:
	\begin{description}
	\item[Speedup:] As noted in
	Section~\ref{sec:intro:Parallel Programming Goals},
	increased performance is the major reason
	to go to all of the time and trouble
	required to parallelize it.
	Speedup is defined to be the ratio of the time required
	to run a sequential version of the program to the time
	required to run a parallel version.
	\item[Contention:] If more CPUs are applied to a parallel
	program than can be kept busy by that program,
	the excess CPUs are prevented from doing
	useful work by contention.
	This may be lock contention, memory contention, or a host
	of other performance killers.
	\item[Work-to-Synchronization Ratio:] A uniprocessor,
	single\-/threaded, non-preemptible, and non\-/interruptible\footnote{
	Either by masking interrupts or by being oblivious to them.}
	version of a given parallel
	program would not need any synchronization primitives.
	Therefore, any time consumed by these primitives
	(including communication cache misses as well as
	message latency, locking primitives, atomic instructions,
	and memory barriers)
	is overhead that does not contribute directly to the useful
	work that the program is intended to accomplish.
	Note that the important measure is the
	relationship between the synchronization overhead
	and the overhead of the code in the critical section, with larger
	critical sections able to tolerate greater synchronization overhead.
	The work-to-synchronization ratio is related to
	the notion of synchronization efficiency. % @@@ reference.
	\item[Read-to-Write Ratio:] A data structure that is
	rarely updated may often be replicated rather than partitioned,
	and furthermore may be protected with asymmetric
	synchronization primitives that reduce readers' synchronization
	overhead at the expense of that of writers, thereby
	reducing overall synchronization overhead.
	Corresponding optimizations are possible for frequently
	updated data structures, as discussed in
	Chapter~\ref{chp:Counting}.
	\item[Complexity:] A parallel program is more complex than
	an equivalent sequential program because the parallel
	program has a much larger state space than does the
	sequential program, although these larger state spaces
	can in some cases be easily understood given sufficient
	regularity and structure.
	A parallel programmer must
	consider synchronization primitives, messaging, locking design,
	critical-section identification,
	and deadlock in the context of this larger state space.

	This greater complexity often translates
	to higher development and maintenance costs.
	Therefore, budgetary constraints can
	limit the number and types of modifications made to
	an existing program, since a given degree of speedup is
	worth only so much time and trouble.
	Worse yet, added complexity can actually \emph{reduce}
	performance and scalability.

	Therefore, beyond a certain point,
	there may be potential sequential optimizations
	that are cheaper and more effective than parallelization.
	As noted in
	Section~\ref{sec:intro:Performance},
	parallelization is but one performance optimization of
	many, and is furthermore an optimization that applies
	most readily to CPU-based bottlenecks.
	\end{description}
	These criteria will act together to enforce a maximum speedup.
	The first three criteria are deeply interrelated, so
	the remainder of this section analyzes these
	interrelationships.\footnote{
	A real-world parallel system will be subject to many additional
	design criteria, such as data-structure layout,
	memory size, memory-hierarchy latencies, bandwidth limitations,
	and I/O issues.}

	Note that these criteria may also appear as part of the requirements
	specification.
	For example, speedup may act as a relative desideratum
	(``the faster, the better'')
	or as an absolute requirement of the workload (``the system
	must support at least 1,000,000 web hits per second'').
	Classic design pattern languages describe relative desiderata as forces
	and absolute requirements as context.

	An understanding of the relationships between these design criteria can
	be very helpful when identifying appropriate design tradeoffs for a
	parallel program.
	\begin{enumerate}
	\item The less time a program spends in critical sections,
	the greater the potential speedup.
	This is a consequence of Amdahl's Law~\cite{GeneAmdahl1967AmdahlsLaw}
	and of the fact that only one CPU may execute within a given
	critical section at a given time.

	More specifically, the fraction of time that the program spends in
	a given exclusive critical section must be much less than
	the reciprocal of the number of CPUs for the
	actual speedup to approach the number of CPUs.
	For example, a program running on 10 CPUs must spend
	much less than one tenth of its time in the most-restrictive
	critical section if it is to scale at all well.
	\item Contention effects will consume the excess CPU and/or
	wallclock time should the actual speedup be less than
	the number of available CPUs. The
	larger the gap between the number of CPUs
	and the actual speedup, the less efficiently the
	CPUs will be used.
	Similarly, the greater the desired efficiency, the smaller
	the achievable speedup.
	\item If the available synchronization primitives have
	high overhead compared to the critical sections
	that they guard, the best way to improve speedup
	is to reduce the number of times that the primitives
	are invoked (perhaps by batching critical sections,
	using data ownership, using asymmetric primitives
	(see Section~\ref{chp:Deferred Processing}),
	or by moving toward a more coarse-grained design
	such as code locking).
	\item If the critical sections have high overhead compared
	to the primitives guarding them, the best way
	to improve speedup is to increase parallelism
	by moving to reader/writer locking, data locking, asymmetric,
	or data ownership.
	\item If the critical sections have high overhead compared
	to the primitives guarding them and the data structure
	being guarded is read much more often than modified,
	the best way to increase parallelism is to move
	to reader/writer locking or asymmetric primitives.
	\item Many changes that improve SMP performance, for example,
	reducing lock contention, also improve real-time
	latencies~\cite{PaulMcKenney2005h}.
	\end{enumerate}

	\QuickQuiz{}
	Don't all these problems with critical sections mean that
	we should just always use
	non-blocking synchronization~\cite{MauriceHerlihy90a},
	which don't have critical sections?
	\QuickQuizAnswer{
	Although non-blocking synchronization can be very useful
	in some situations, it is no panacea.
	Also, non-blocking synchronization really does have
	critical sections, as noted by Josh Triplett.
	For example, in a non-blocking algorithm based on
	compare-and-swap operations, the code starting at the
	initial load and continuing to the compare-and-swap
	is in many ways analogous to a lock-based critical section.
	} \QuickQuizEnd