cpu/hwfreelunch.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % cpu/hwfreelunch.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Hardware Free Lunch?}
 \label{sec:cpu:Hardware Free Lunch?}
 %
 \epigraph{The great trouble today is that there are too many people looking
 	  for someone else to do something for them.
 	  The solution to most of our troubles is to be found in everyone
 	  doing something for themselves.}
 	 {\emph{Henry Ford, updated}}

 The major reason that concurrency has been receiving so much focus over
 the past few years is the end of Moore's-Law induced single-threaded
 performance increases
 (or ``free lunch''~\cite{HerbSutter2008EffectiveConcurrency}),
 as shown in
 Figure~\ref{fig:intro:Clock-Frequency Trend for Intel CPUs} on
 page~\pageref{fig:intro:Clock-Frequency Trend for Intel CPUs}.
 This section briefly surveys a few ways that hardware designers
 might be able to bring back some form of the ``free lunch''.

 However, the preceding section presented some substantial hardware
 obstacles to exploiting concurrency.
 One severe physical limitation that hardware designers face is the
 finite speed of light.
 As noted in
 Figure~\ref{fig:cpu:System Hardware Architecture} on
 page~\pageref{fig:cpu:System Hardware Architecture},
 light can travel only about an 8-centimeters round trip
 in a vacuum during the duration of a 1.8\,GHz clock period.
 This distance drops to about 3~centimeters for a 5\,GHz clock.
 Both of these distances are relatively small compared to the size
 of a modern computer system.

 To make matters even worse, electric waves in silicon move from three to
 thirty times more slowly than does light in a vacuum, and common
 clocked logic constructs run still more slowly, for example, a
 memory reference may need to wait for a local cache lookup to complete
 before the request may be passed on to the rest of the system.
 Furthermore, relatively low speed and high power drivers are required
 to move electrical signals from one silicon die to another, for example,
 to communicate between a CPU and main memory.

 \QuickQuiz{}
 	But individual electrons don't move anywhere near that fast,
 	even in conductors!!!
 	The electron drift velocity in a conductor under the low voltages
 	found in semiconductors is on the order of only one \emph{millimeter}
 	per second.
 	What gives???
 \QuickQuizAnswer{
 	Electron drift velocity tracks the long-term movement of individual
 	electrons.
 	It turns out that individual electrons bounce around quite
 	randomly, so that their instantaneous speed is very high, but
 	over the long term, they don't move very far.
 	In this, electrons resemble long-distance commuters, who
 	might spend most of their time traveling at full highway
 	speed, but over the long term going nowhere.
 	These commuters' speed might be 70 miles per hour
 	(113 kilometers per hour), but their long-term drift velocity
 	relative to the planet's surface is zero.

 	Therefore, we should pay attention not to the electrons'
 	drift velocity, but to their instantaneous velocities.
 	However, even their instantaneous velocities are nowhere near
 	a significant fraction of the speed of light.
 	Nevertheless, the measured velocity of electric waves
 	in conductors \emph{is} a substantial fraction of the
 	speed of light, so we still have a mystery on our hands.

 	The other trick is that electrons interact with each other at
 	significant distances (from an atomic perspective, anyway),
 	courtesy of their negative charge.
 	This interaction is carried out by photons, which \emph{do}
 	move at the speed of light.
 	So even with electricity's electrons, it is photons
 	doing most of the fast footwork.

 	Extending the commuter analogy, a driver might use a smartphone
 	to inform other drivers of an accident or congestion, thus
 	allowing a change in traffic flow to propagate much faster
 	than the instantaneous velocity of the individual cars.
 	Summarizing the analogy between electricity and traffic flow:

 	\begin{enumerate}
 	\item	The (very low) drift velocity of an electron is similar
 		to the long-term velocity of a commuter, both being
 		very nearly zero.
 	\item	The (still rather low) instantaneous velocity of
 		an electron is similar to the instantaneous velocity
 		of a car in traffic.
 		Both are much higher than the drift velocity, but
 		quite small compared to the rate at which changes
 		propagate.
 	\item	The (much higher) propagation velocity of an electric
 		wave is primarily due to photons transmitting
 		electromagnetic force among the electrons.
 		Similarly, traffic patterns can change quite quickly
 		due to communication among drivers.
 		Not that this is necessarily of much help to the
 		drivers already stuck in traffic, any more than it
 		is to the electrons already pooled in a given capacitor.
 	\end{enumerate}

 	Of course, to fully understand this topic, you should read
 	up on electrodynamics.
 } \QuickQuizEnd

 There are nevertheless some technologies (both hardware and software)
 that might help improve matters:

 \begin{enumerate}
 \item	3D integration,
 \item	Novel materials and processes,
 \item	Substituting light for electricity,
 \item	Special-purpose accelerators, and
 \item	Existing parallel software.
 \end{enumerate}

 Each of these is described in one of the following sections.

 \subsection{3D Integration}
 \label{sec:cpu:3D Integration}

 3-dimensional integration (3DI) is the practice of bonding
 very thin silicon dies to each other in a vertical stack.
 This practice provides potential benefits, but also poses
 significant fabrication challenges~\cite{JohnKnickerbocker2008:3DI}.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cpu/3DI}}
 \caption{Latency Benefit of 3D Integration}
 \label{fig:cpu:Latency Benefit of 3D Integration}
 \end{figure}

 Perhaps the most important benefit of 3DI is decreased path length through
 the system, as shown in
 Figure~\ref{fig:cpu:Latency Benefit of 3D Integration}.
 A 3-centimeter silicon die is replaced with a stack of four 1.5-centimeter
 dies, in theory decreasing the maximum path through the system by a factor
 of two, keeping in mind that each layer is quite thin.
 In addition, given proper attention to design and placement,
 long horizontal electrical connections (which are both slow and
 power hungry) can be replaced by short vertical electrical connections,
 which are both faster and more power efficient.

 However, delays due to levels of clocked logic will not be decreased
 by 3D integration, and significant manufacturing, testing, power-supply,
 and heat-dissipation problems must be solved for 3D integration to
 reach production while still delivering on its promise.
 The heat-dissipation problems might be solved using
 semiconductors based on diamond, which is a good conductor
 for heat, but an electrical insulator.
 That said, it remains difficult to grow large single diamond crystals,
 to say nothing of slicing them into wafers.
 In addition, it seems unlikely that any of these technologies will be able to
 deliver the exponential increases to which some people have become accustomed.
 That said, they may be necessary steps on the path to the late Jim Gray's
 ``smoking hairy golf balls''~\cite{JimGray2002SmokingHairyGolfBalls}.

 \subsection{Novel Materials and Processes}
 \label{sec:cpu:Novel Materials and Processes}

 Stephen Hawking is said to have claimed that semiconductor manufacturers
 have but two fundamental problems: (1) the finite speed of light and
 (2) the atomic nature of matter~\cite{BryanGardiner2007}.
 It is possible that semiconductor manufacturers are approaching these
 limits, but there are nevertheless a few avenues of research and
 development focused on working around these fundamental limits.

 One workaround for the atomic nature of matter are so-called
 ``high-K dielectric'' materials, which allow larger devices to mimic the
 electrical properties of infeasibly small devices.
 These materials pose some severe fabrication challenges, but nevertheless
 may help push the frontiers out a bit farther.
 Another more-exotic workaround stores multiple bits in a single electron,
 relying on the fact that a given electron can exist at a number of
 energy levels.
 It remains to be seen if this particular approach can be made to work
 reliably in production semiconductor devices.

 Another proposed workaround is the ``quantum dot'' approach that
 allows much smaller device sizes, but which is still in the research
 stage.

 One challenge is that many recent hardware-device-level breakthroughs
 require very tight control of which atoms are placed
 where~\cite{MichaelJKelly2017DeviceLevel}.
 It therefore seems likely that whoever finds a good way to hand-place
 atoms on each of the billions of devices on a chip will have most
 excellent bragging rights, if nothing else!

 \subsection{Light, Not Electrons}
 \label{sec:cpu:Light, Not Electrons}

 Although the speed of light would be a hard limit, the fact is that
 semiconductor devices are limited by the speed of electricity rather
 than that of light, given that electric waves in semiconductor materials
 move at between 3\,\% and 30\,\% of the speed of light in a vacuum.
 The use of copper connections on silicon devices is one way to increase
 the speed of electricity, and it is quite possible that additional
 advances will push closer still to the actual speed of light.
 In addition, there have been some experiments with tiny optical fibers
 as interconnects within and between chips, based on the fact that
 the speed of light in glass is more than 60\,\% of the speed of light
 in a vacuum.
 One obstacle to such optical fibers is the inefficiency conversion
 between electricity and light and vice versa, resulting in both
 power-consumption and heat-dissipation problems.

 That said, absent some fundamental advances in the field of physics,
 any exponential increases in the speed of data flow
 will be sharply limited by the actual speed of light in a vacuum.

 \subsection{Special-Purpose Accelerators}
 \label{sec:cpu:Special-Purpose Accelerators}

 A general-purpose CPU working on a specialized problem is often spending
 significant time and energy doing work that is only tangentially related
 to the problem at hand.
 For example, when taking the dot product of a pair of vectors, a
 general-purpose CPU will normally use a loop (possibly unrolled)
 with a loop counter.
 Decoding the instructions, incrementing the loop counter, testing this
 counter, and branching back to the
 top of the loop are in some sense wasted effort: the real goal is
 instead to multiply corresponding elements of the two vectors.
 Therefore, a specialized piece of hardware designed specifically to
 multiply vectors could get the job done more quickly and with less
 energy consumed.

 This is in fact the motivation for the vector instructions present in
 many commodity microprocessors.
 Because these instructions operate on multiple data items simultaneously,
 they would permit a dot product to be computed with less instruction-decode
 and loop overhead.

 Similarly, specialized hardware can more efficiently encrypt and decrypt,
 compress and decompress, encode and decode, and many other tasks besides.
 Unfortunately, this efficiency does not come for free.
 A computer system incorporating this specialized hardware will contain
 more transistors, which will consume some power even when not in use.
 Software must be modified to take advantage of this specialized hardware,
 and this specialized hardware must be sufficiently generally useful
 that the high up-front hardware-design costs can be spread over enough
 users to make the specialized hardware affordable.
 In part due to these sorts of economic considerations, specialized
 hardware has thus far appeared only for a few application areas,
 including graphics processing (GPUs), vector processors (MMX, SSE,
 and VMX instructions), and, to a lesser extent, encryption.

 Unlike the server and PC arena, smartphones have long used a wide
 variety of hardware accelerators.
 These hardware accelerators are often used for media decoding,
 so much so that a high-end MP3 player might be able to play audio
 for several minutes---with its CPU fully powered off the entire time.
 The purpose of these accelerators is to improve energy efficiency
 and thus extend battery life: special purpose hardware can often
 compute more efficiently than can a general-purpose CPU.
 This is another example of the principle called out in
 Section~\ref{sec:intro:Generality}: Generality is almost never free.

 Nevertheless, given the end of Moore's-Law-induced single-threaded
 performance increases, it seems safe to predict that there will
 be an increasing variety of special-purpose hardware going forward.

 \subsection{Existing Parallel Software}
 \label{sec:cpu:Existing Parallel Software}

 Although multicore CPUs seem to have taken the computing industry
 by surprise, the fact remains that shared-memory parallel computer
 systems have been commercially available for more than a quarter
 century.
 This is more than enough time for significant parallel software
 to make its appearance, and it indeed has.
 Parallel operating systems are quite commonplace, as are parallel
 threading libraries, parallel relational database management systems,
 and parallel numerical software.
 Use of existing parallel software can go a long ways towards solving any
 parallel-software crisis we might encounter.

 Perhaps the most common example is the parallel relational database
 management system.
 It is not unusual for single-threaded programs, often written in
 high-level scripting languages, to access a central relational
 database concurrently.
 In the resulting highly parallel system, only the database need actually
 deal directly with parallelism.
 A very nice trick when it works!
	% cpu/hwfreelunch.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{Hardware Free Lunch?}
	\label{sec:cpu:Hardware Free Lunch?}
	%
	\epigraph{The great trouble today is that there are too many people looking
	for someone else to do something for them.
	The solution to most of our troubles is to be found in everyone
	doing something for themselves.}
	{\emph{Henry Ford, updated}}

	The major reason that concurrency has been receiving so much focus over
	the past few years is the end of Moore's-Law induced single-threaded
	performance increases
	(or ``free lunch''~\cite{HerbSutter2008EffectiveConcurrency}),
	as shown in
	Figure~\ref{fig:intro:Clock-Frequency Trend for Intel CPUs} on
	page~\pageref{fig:intro:Clock-Frequency Trend for Intel CPUs}.
	This section briefly surveys a few ways that hardware designers
	might be able to bring back some form of the ``free lunch''.

	However, the preceding section presented some substantial hardware
	obstacles to exploiting concurrency.
	One severe physical limitation that hardware designers face is the
	finite speed of light.
	As noted in
	Figure~\ref{fig:cpu:System Hardware Architecture} on
	page~\pageref{fig:cpu:System Hardware Architecture},
	light can travel only about an 8-centimeters round trip
	in a vacuum during the duration of a 1.8\,GHz clock period.
	This distance drops to about 3~centimeters for a 5\,GHz clock.
	Both of these distances are relatively small compared to the size
	of a modern computer system.

	To make matters even worse, electric waves in silicon move from three to
	thirty times more slowly than does light in a vacuum, and common
	clocked logic constructs run still more slowly, for example, a
	memory reference may need to wait for a local cache lookup to complete
	before the request may be passed on to the rest of the system.
	Furthermore, relatively low speed and high power drivers are required
	to move electrical signals from one silicon die to another, for example,
	to communicate between a CPU and main memory.

	\QuickQuiz{}
	But individual electrons don't move anywhere near that fast,
	even in conductors!!!
	The electron drift velocity in a conductor under the low voltages
	found in semiconductors is on the order of only one \emph{millimeter}
	per second.
	What gives???
	\QuickQuizAnswer{
	Electron drift velocity tracks the long-term movement of individual
	electrons.
	It turns out that individual electrons bounce around quite
	randomly, so that their instantaneous speed is very high, but
	over the long term, they don't move very far.
	In this, electrons resemble long-distance commuters, who
	might spend most of their time traveling at full highway
	speed, but over the long term going nowhere.
	These commuters' speed might be 70 miles per hour
	(113 kilometers per hour), but their long-term drift velocity
	relative to the planet's surface is zero.

	Therefore, we should pay attention not to the electrons'
	drift velocity, but to their instantaneous velocities.
	However, even their instantaneous velocities are nowhere near
	a significant fraction of the speed of light.
	Nevertheless, the measured velocity of electric waves
	in conductors \emph{is} a substantial fraction of the
	speed of light, so we still have a mystery on our hands.

	The other trick is that electrons interact with each other at
	significant distances (from an atomic perspective, anyway),
	courtesy of their negative charge.
	This interaction is carried out by photons, which \emph{do}
	move at the speed of light.
	So even with electricity's electrons, it is photons
	doing most of the fast footwork.

	Extending the commuter analogy, a driver might use a smartphone
	to inform other drivers of an accident or congestion, thus
	allowing a change in traffic flow to propagate much faster
	than the instantaneous velocity of the individual cars.
	Summarizing the analogy between electricity and traffic flow:

	\begin{enumerate}
	\item The (very low) drift velocity of an electron is similar
	to the long-term velocity of a commuter, both being
	very nearly zero.
	\item The (still rather low) instantaneous velocity of
	an electron is similar to the instantaneous velocity
	of a car in traffic.
	Both are much higher than the drift velocity, but
	quite small compared to the rate at which changes
	propagate.
	\item The (much higher) propagation velocity of an electric
	wave is primarily due to photons transmitting
	electromagnetic force among the electrons.
	Similarly, traffic patterns can change quite quickly
	due to communication among drivers.
	Not that this is necessarily of much help to the
	drivers already stuck in traffic, any more than it
	is to the electrons already pooled in a given capacitor.
	\end{enumerate}

	Of course, to fully understand this topic, you should read
	up on electrodynamics.
	} \QuickQuizEnd

	There are nevertheless some technologies (both hardware and software)
	that might help improve matters:

	\begin{enumerate}
	\item 3D integration,
	\item Novel materials and processes,
	\item Substituting light for electricity,
	\item Special-purpose accelerators, and
	\item Existing parallel software.
	\end{enumerate}

	Each of these is described in one of the following sections.

	\subsection{3D Integration}
	\label{sec:cpu:3D Integration}

	3-dimensional integration (3DI) is the practice of bonding
	very thin silicon dies to each other in a vertical stack.
	This practice provides potential benefits, but also poses
	significant fabrication challenges~\cite{JohnKnickerbocker2008:3DI}.

	\begin{figure}[tb]
	\centering
	\resizebox{3in}{!}{\includegraphics{cpu/3DI}}
	\caption{Latency Benefit of 3D Integration}
	\label{fig:cpu:Latency Benefit of 3D Integration}
	\end{figure}

	Perhaps the most important benefit of 3DI is decreased path length through
	the system, as shown in
	Figure~\ref{fig:cpu:Latency Benefit of 3D Integration}.
	A 3-centimeter silicon die is replaced with a stack of four 1.5-centimeter
	dies, in theory decreasing the maximum path through the system by a factor
	of two, keeping in mind that each layer is quite thin.
	In addition, given proper attention to design and placement,
	long horizontal electrical connections (which are both slow and
	power hungry) can be replaced by short vertical electrical connections,
	which are both faster and more power efficient.

	However, delays due to levels of clocked logic will not be decreased
	by 3D integration, and significant manufacturing, testing, power-supply,
	and heat-dissipation problems must be solved for 3D integration to
	reach production while still delivering on its promise.
	The heat-dissipation problems might be solved using
	semiconductors based on diamond, which is a good conductor
	for heat, but an electrical insulator.
	That said, it remains difficult to grow large single diamond crystals,
	to say nothing of slicing them into wafers.
	In addition, it seems unlikely that any of these technologies will be able to
	deliver the exponential increases to which some people have become accustomed.
	That said, they may be necessary steps on the path to the late Jim Gray's
	``smoking hairy golf balls''~\cite{JimGray2002SmokingHairyGolfBalls}.

	\subsection{Novel Materials and Processes}
	\label{sec:cpu:Novel Materials and Processes}

	Stephen Hawking is said to have claimed that semiconductor manufacturers
	have but two fundamental problems: (1) the finite speed of light and
	(2) the atomic nature of matter~\cite{BryanGardiner2007}.
	It is possible that semiconductor manufacturers are approaching these
	limits, but there are nevertheless a few avenues of research and
	development focused on working around these fundamental limits.

	One workaround for the atomic nature of matter are so-called
	``high-K dielectric'' materials, which allow larger devices to mimic the
	electrical properties of infeasibly small devices.
	These materials pose some severe fabrication challenges, but nevertheless
	may help push the frontiers out a bit farther.
	Another more-exotic workaround stores multiple bits in a single electron,
	relying on the fact that a given electron can exist at a number of
	energy levels.
	It remains to be seen if this particular approach can be made to work
	reliably in production semiconductor devices.

	Another proposed workaround is the ``quantum dot'' approach that
	allows much smaller device sizes, but which is still in the research
	stage.

	One challenge is that many recent hardware-device-level breakthroughs
	require very tight control of which atoms are placed
	where~\cite{MichaelJKelly2017DeviceLevel}.
	It therefore seems likely that whoever finds a good way to hand-place
	atoms on each of the billions of devices on a chip will have most
	excellent bragging rights, if nothing else!

	\subsection{Light, Not Electrons}
	\label{sec:cpu:Light, Not Electrons}

	Although the speed of light would be a hard limit, the fact is that
	semiconductor devices are limited by the speed of electricity rather
	than that of light, given that electric waves in semiconductor materials
	move at between 3\,\% and 30\,\% of the speed of light in a vacuum.
	The use of copper connections on silicon devices is one way to increase
	the speed of electricity, and it is quite possible that additional
	advances will push closer still to the actual speed of light.
	In addition, there have been some experiments with tiny optical fibers
	as interconnects within and between chips, based on the fact that
	the speed of light in glass is more than 60\,\% of the speed of light
	in a vacuum.
	One obstacle to such optical fibers is the inefficiency conversion
	between electricity and light and vice versa, resulting in both
	power-consumption and heat-dissipation problems.

	That said, absent some fundamental advances in the field of physics,
	any exponential increases in the speed of data flow
	will be sharply limited by the actual speed of light in a vacuum.

	\subsection{Special-Purpose Accelerators}
	\label{sec:cpu:Special-Purpose Accelerators}

	A general-purpose CPU working on a specialized problem is often spending
	significant time and energy doing work that is only tangentially related
	to the problem at hand.
	For example, when taking the dot product of a pair of vectors, a
	general-purpose CPU will normally use a loop (possibly unrolled)
	with a loop counter.
	Decoding the instructions, incrementing the loop counter, testing this
	counter, and branching back to the
	top of the loop are in some sense wasted effort: the real goal is
	instead to multiply corresponding elements of the two vectors.
	Therefore, a specialized piece of hardware designed specifically to
	multiply vectors could get the job done more quickly and with less
	energy consumed.

	This is in fact the motivation for the vector instructions present in
	many commodity microprocessors.
	Because these instructions operate on multiple data items simultaneously,
	they would permit a dot product to be computed with less instruction-decode
	and loop overhead.

	Similarly, specialized hardware can more efficiently encrypt and decrypt,
	compress and decompress, encode and decode, and many other tasks besides.
	Unfortunately, this efficiency does not come for free.
	A computer system incorporating this specialized hardware will contain
	more transistors, which will consume some power even when not in use.
	Software must be modified to take advantage of this specialized hardware,
	and this specialized hardware must be sufficiently generally useful
	that the high up-front hardware-design costs can be spread over enough
	users to make the specialized hardware affordable.
	In part due to these sorts of economic considerations, specialized
	hardware has thus far appeared only for a few application areas,
	including graphics processing (GPUs), vector processors (MMX, SSE,
	and VMX instructions), and, to a lesser extent, encryption.

	Unlike the server and PC arena, smartphones have long used a wide
	variety of hardware accelerators.
	These hardware accelerators are often used for media decoding,
	so much so that a high-end MP3 player might be able to play audio
	for several minutes---with its CPU fully powered off the entire time.
	The purpose of these accelerators is to improve energy efficiency
	and thus extend battery life: special purpose hardware can often
	compute more efficiently than can a general-purpose CPU.
	This is another example of the principle called out in
	Section~\ref{sec:intro:Generality}: Generality is almost never free.

	Nevertheless, given the end of Moore's-Law-induced single-threaded
	performance increases, it seems safe to predict that there will
	be an increasing variety of special-purpose hardware going forward.

	\subsection{Existing Parallel Software}
	\label{sec:cpu:Existing Parallel Software}

	Although multicore CPUs seem to have taken the computing industry
	by surprise, the fact remains that shared-memory parallel computer
	systems have been commercially available for more than a quarter
	century.
	This is more than enough time for significant parallel software
	to make its appearance, and it indeed has.
	Parallel operating systems are quite commonplace, as are parallel
	threading libraries, parallel relational database management systems,
	and parallel numerical software.
	Use of existing parallel software can go a long ways towards solving any
	parallel-software crisis we might encounter.

	Perhaps the most common example is the parallel relational database
	management system.
	It is not unusual for single-threaded programs, often written in
	high-level scripting languages, to access a central relational
	database concurrently.
	In the resulting highly parallel system, only the database need actually
	deal directly with parallelism.
	A very nice trick when it works!