cpu/swdesign.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % cpu/swdesign.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Software Design Implications}
 \label{sec:cpu:Software Design Implications}
 %
 \epigraph{One ship drives east and another west \\
 	  While the self-same breezes blow; \\
 	  'Tis the set of the sail and not the gail \\
 	  That bids them where to go.}
 	 {Ella Wheeler Wilcox}

 The values of the ratios in
 \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
 are critically important, as they limit the
 efficiency of a given parallel application.
 To see this, suppose that the parallel application uses \IXacr{cas}
 operations to communicate among threads.
 These CAS operations will typically involve a cache miss, that is, assuming
 that the threads are communicating primarily with each other rather than
 with themselves.
 Suppose further that the unit of work corresponding to each CAS communication
 operation takes 300\,ns, which is sufficient time to compute several
 floating-point transcendental functions.
 Then about half of the execution time will be consumed by the CAS
 communication operations!
 This in turn means that a two-CPU system running such a parallel program
 would run no faster than a sequential implementation running on a
 single CPU\@.

 The situation is even worse in the distributed-system case, where the
 latency of a single communications operation might take as long as
 thousands or even millions of floating-point operations.
 This illustrates how important it is for communications operations to
 be extremely infrequent and to enable very large quantities of processing.

 \QuickQuiz{
 	Given that distributed-systems communication is so horribly
 	expensive, why does anyone bother with such systems?
 }\QuickQuizAnswer{
 	There are a number of reasons:

 	\begin{enumerate}
 	\item	Shared-memory multiprocessor systems have strict size limits.
 		If you need more than a few thousand CPUs, you have no
 		choice but to use a distributed system.
 	\item	Large shared-memory systems tend to be more expensive
 		per unit computation than their smaller counterparts.
 	\item	Large shared-memory systems tend to have much longer
 		cache-miss latencies than do smaller system.
 		To see this, compare
 		\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
 		on \cpageref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
 		with
 		\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}.
 	\item	The distributed-systems communications operations do
 		not necessarily use much CPU, so that computation can
 		proceed in parallel with message transfer.
 	\item	Many important problems are ``embarrassingly parallel'',
 		so that extremely large quantities of processing may
 		be enabled by a very small number of messages.
 		SETI@HOME~\cite{SETIatHOME2008}
 		was but one example of such an application.
 		These sorts of applications can make good use of networks
 		of computers despite extremely long communications
 		latencies.
 	\end{enumerate}

 	Thus, large shared-memory systems tend to be used for applications
 	that benefit from faster latencies than can be provided by
 	distributed computing, and particularly for those applications
 	that benefit from a large shared memory.

 	It is likely that continued work on parallel applications will
 	increase the number of embarrassingly parallel applications that
 	can run well on machines and/or clusters having long communications
 	latencies, reductions in cost being the driving force that it is.
 	That said, greatly reduced hardware latencies would be an
 	extremely welcome development, both for single-system and
 	for distributed computing.
 }\QuickQuizEnd

 The lesson should be quite clear:
 Parallel algorithms must be explicitly designed with these hardware
 properties firmly in mind.
 One approach is to run nearly independent threads.
 The less frequently the threads communicate, whether by \IX{atomic} operations,
 locks, or explicit messages, the better the application's performance
 and scalability will be.
 This approach will be touched on in \cref{chp:Counting},
 explored in \cref{chp:Partitioning and Synchronization Design},
 and taken to its logical extreme in \cref{chp:Data Ownership}.

 Another approach is to make sure that any sharing be read-mostly, which
 allows the CPUs' caches to replicate the read-mostly data, in turn
 allowing all CPUs fast access.
 This approach is touched on in
 \cref{sec:count:Eventually Consistent Implementation},
 and explored more deeply in \cref{chp:Deferred Processing}.

 In short, achieving excellent parallel performance and scalability means
 striving for embarrassingly parallel algorithms and implementations,
 whether by careful choice of data structures and algorithms, use of
 existing parallel applications and environments, or transforming the
 problem into an embarrassingly parallel form.

 \QuickQuiz{
 	OK, if we are going to have to apply distributed-programming
 	techniques to shared-memory parallel programs, why not just
 	always use these distributed techniques and dispense with
 	shared memory?
 }\QuickQuizAnswer{
 	Because it is often the case that only a small fraction of
 	the program is performance-critical.
 	Shared-memory parallelism allows us to focus distributed-programming
 	techniques on that small fraction, allowing simpler shared-memory
 	techniques to be used on the non-performance-critical bulk of
 	the program.
 }\QuickQuizEnd
	% cpu/swdesign.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{Software Design Implications}
	\label{sec:cpu:Software Design Implications}
	%
	\epigraph{One ship drives east and another west \\
	While the self-same breezes blow; \\
	'Tis the set of the sail and not the gail \\
	That bids them where to go.}
	{Ella Wheeler Wilcox}

	The values of the ratios in
	\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
	are critically important, as they limit the
	efficiency of a given parallel application.
	To see this, suppose that the parallel application uses \IXacr{cas}
	operations to communicate among threads.
	These CAS operations will typically involve a cache miss, that is, assuming
	that the threads are communicating primarily with each other rather than
	with themselves.
	Suppose further that the unit of work corresponding to each CAS communication
	operation takes 300\,ns, which is sufficient time to compute several
	floating-point transcendental functions.
	Then about half of the execution time will be consumed by the CAS
	communication operations!
	This in turn means that a two-CPU system running such a parallel program
	would run no faster than a sequential implementation running on a
	single CPU\@.

	The situation is even worse in the distributed-system case, where the
	latency of a single communications operation might take as long as
	thousands or even millions of floating-point operations.
	This illustrates how important it is for communications operations to
	be extremely infrequent and to enable very large quantities of processing.

	\QuickQuiz{
	Given that distributed-systems communication is so horribly
	expensive, why does anyone bother with such systems?
	}\QuickQuizAnswer{
	There are a number of reasons:

	\begin{enumerate}
	\item Shared-memory multiprocessor systems have strict size limits.
	If you need more than a few thousand CPUs, you have no
	choice but to use a distributed system.
	\item Large shared-memory systems tend to be more expensive
	per unit computation than their smaller counterparts.
	\item Large shared-memory systems tend to have much longer
	cache-miss latencies than do smaller system.
	To see this, compare
	\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
	on \cpageref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
	with
	\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}.
	\item The distributed-systems communications operations do
	not necessarily use much CPU, so that computation can
	proceed in parallel with message transfer.
	\item Many important problems are ``embarrassingly parallel'',
	so that extremely large quantities of processing may
	be enabled by a very small number of messages.
	SETI@HOME~\cite{SETIatHOME2008}
	was but one example of such an application.
	These sorts of applications can make good use of networks
	of computers despite extremely long communications
	latencies.
	\end{enumerate}

	Thus, large shared-memory systems tend to be used for applications
	that benefit from faster latencies than can be provided by
	distributed computing, and particularly for those applications
	that benefit from a large shared memory.

	It is likely that continued work on parallel applications will
	increase the number of embarrassingly parallel applications that
	can run well on machines and/or clusters having long communications
	latencies, reductions in cost being the driving force that it is.
	That said, greatly reduced hardware latencies would be an
	extremely welcome development, both for single-system and
	for distributed computing.
	}\QuickQuizEnd

	The lesson should be quite clear:
	Parallel algorithms must be explicitly designed with these hardware
	properties firmly in mind.
	One approach is to run nearly independent threads.
	The less frequently the threads communicate, whether by \IX{atomic} operations,
	locks, or explicit messages, the better the application's performance
	and scalability will be.
	This approach will be touched on in \cref{chp:Counting},
	explored in \cref{chp:Partitioning and Synchronization Design},
	and taken to its logical extreme in \cref{chp:Data Ownership}.

	Another approach is to make sure that any sharing be read-mostly, which
	allows the CPUs' caches to replicate the read-mostly data, in turn
	allowing all CPUs fast access.
	This approach is touched on in
	\cref{sec:count:Eventually Consistent Implementation},
	and explored more deeply in \cref{chp:Deferred Processing}.

	In short, achieving excellent parallel performance and scalability means
	striving for embarrassingly parallel algorithms and implementations,
	whether by careful choice of data structures and algorithms, use of
	existing parallel applications and environments, or transforming the
	problem into an embarrassingly parallel form.

	\QuickQuiz{
	OK, if we are going to have to apply distributed-programming
	techniques to shared-memory parallel programs, why not just
	always use these distributed techniques and dispense with
	shared memory?
	}\QuickQuizAnswer{
	Because it is often the case that only a small fraction of
	the program is performance-critical.
	Shared-memory parallelism allows us to focus distributed-programming
	techniques on that small fraction, allowing simpler shared-memory
	techniques to be used on the non-performance-critical bulk of
	the program.
	}\QuickQuizEnd