blob: e0be6e0ed30629ca54b1ef3f7d3e713c8345ab8a [file] [log] [blame]
% cpu/hwfreelunch.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Hardware Free Lunch?}
\label{sec:cpu:Hardware Free Lunch?}
%
\epigraph{The great trouble today is that there are too many people looking
for someone else to do something for them.
The solution to most of our troubles is to be found in everyone
doing something for themselves.}
{\emph{Henry Ford, updated}}
The major reason that concurrency has been receiving so much focus over
the past few years is the end of Moore's-Law induced single-threaded
performance increases
(or ``free lunch''~\cite{HerbSutter2008EffectiveConcurrency}),
as shown in
Figure~\ref{fig:intro:Clock-Frequency Trend for Intel CPUs} on
page~\pageref{fig:intro:Clock-Frequency Trend for Intel CPUs}.
This section briefly surveys a few ways that hardware designers
might be able to bring back some form of the ``free lunch''.
However, the preceding section presented some substantial hardware
obstacles to exploiting concurrency.
One severe physical limitation that hardware designers face is the
finite speed of light.
As noted in
Figure~\ref{fig:cpu:System Hardware Architecture} on
page~\pageref{fig:cpu:System Hardware Architecture},
light can travel only about an 8-centimeters round trip
in a vacuum during the duration of a 1.8\,GHz clock period.
This distance drops to about 3~centimeters for a 5\,GHz clock.
Both of these distances are relatively small compared to the size
of a modern computer system.
To make matters even worse, electric waves in silicon move from three to
thirty times more slowly than does light in a vacuum, and common
clocked logic constructs run still more slowly, for example, a
memory reference may need to wait for a local cache lookup to complete
before the request may be passed on to the rest of the system.
Furthermore, relatively low speed and high power drivers are required
to move electrical signals from one silicon die to another, for example,
to communicate between a CPU and main memory.
\QuickQuiz{}
But individual electrons don't move anywhere near that fast,
even in conductors!!!
The electron drift velocity in a conductor under the low voltages
found in semiconductors is on the order of only one \emph{millimeter}
per second.
What gives???
\QuickQuizAnswer{
Electron drift velocity tracks the long-term movement of individual
electrons.
It turns out that individual electrons bounce around quite
randomly, so that their instantaneous speed is very high, but
over the long term, they don't move very far.
In this, electrons resemble long-distance commuters, who
might spend most of their time traveling at full highway
speed, but over the long term going nowhere.
These commuters' speed might be 70 miles per hour
(113 kilometers per hour), but their long-term drift velocity
relative to the planet's surface is zero.
Therefore, we should pay attention not to the electrons'
drift velocity, but to their instantaneous velocities.
However, even their instantaneous velocities are nowhere near
a significant fraction of the speed of light.
Nevertheless, the measured velocity of electric waves
in conductors \emph{is} a substantial fraction of the
speed of light, so we still have a mystery on our hands.
The other trick is that electrons interact with each other at
significant distances (from an atomic perspective, anyway),
courtesy of their negative charge.
This interaction is carried out by photons, which \emph{do}
move at the speed of light.
So even with electricity's electrons, it is photons
doing most of the fast footwork.
Extending the commuter analogy, a driver might use a smartphone
to inform other drivers of an accident or congestion, thus
allowing a change in traffic flow to propagate much faster
than the instantaneous velocity of the individual cars.
Summarizing the analogy between electricity and traffic flow:
\begin{enumerate}
\item The (very low) drift velocity of an electron is similar
to the long-term velocity of a commuter, both being
very nearly zero.
\item The (still rather low) instantaneous velocity of
an electron is similar to the instantaneous velocity
of a car in traffic.
Both are much higher than the drift velocity, but
quite small compared to the rate at which changes
propagate.
\item The (much higher) propagation velocity of an electric
wave is primarily due to photons transmitting
electromagnetic force among the electrons.
Similarly, traffic patterns can change quite quickly
due to communication among drivers.
Not that this is necessarily of much help to the
drivers already stuck in traffic, any more than it
is to the electrons already pooled in a given capacitor.
\end{enumerate}
Of course, to fully understand this topic, you should read
up on electrodynamics.
} \QuickQuizEnd
There are nevertheless some technologies (both hardware and software)
that might help improve matters:
\begin{enumerate}
\item 3D integration,
\item Novel materials and processes,
\item Substituting light for electricity,
\item Special-purpose accelerators, and
\item Existing parallel software.
\end{enumerate}
Each of these is described in one of the following sections.
\subsection{3D Integration}
\label{sec:cpu:3D Integration}
3-dimensional integration (3DI) is the practice of bonding
very thin silicon dies to each other in a vertical stack.
This practice provides potential benefits, but also poses
significant fabrication challenges~\cite{JohnKnickerbocker2008:3DI}.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cpu/3DI}}
\caption{Latency Benefit of 3D Integration}
\label{fig:cpu:Latency Benefit of 3D Integration}
\end{figure}
Perhaps the most important benefit of 3DI is decreased path length through
the system, as shown in
Figure~\ref{fig:cpu:Latency Benefit of 3D Integration}.
A 3-centimeter silicon die is replaced with a stack of four 1.5-centimeter
dies, in theory decreasing the maximum path through the system by a factor
of two, keeping in mind that each layer is quite thin.
In addition, given proper attention to design and placement,
long horizontal electrical connections (which are both slow and
power hungry) can be replaced by short vertical electrical connections,
which are both faster and more power efficient.
However, delays due to levels of clocked logic will not be decreased
by 3D integration, and significant manufacturing, testing, power-supply,
and heat-dissipation problems must be solved for 3D integration to
reach production while still delivering on its promise.
The heat-dissipation problems might be solved using
semiconductors based on diamond, which is a good conductor
for heat, but an electrical insulator.
That said, it remains difficult to grow large single diamond crystals,
to say nothing of slicing them into wafers.
In addition, it seems unlikely that any of these technologies will be able to
deliver the exponential increases to which some people have become accustomed.
That said, they may be necessary steps on the path to the late Jim Gray's
``smoking hairy golf balls''~\cite{JimGray2002SmokingHairyGolfBalls}.
\subsection{Novel Materials and Processes}
\label{sec:cpu:Novel Materials and Processes}
Stephen Hawking is said to have claimed that semiconductor manufacturers
have but two fundamental problems: (1) the finite speed of light and
(2) the atomic nature of matter~\cite{BryanGardiner2007}.
It is possible that semiconductor manufacturers are approaching these
limits, but there are nevertheless a few avenues of research and
development focused on working around these fundamental limits.
One workaround for the atomic nature of matter are so-called
``high-K dielectric'' materials, which allow larger devices to mimic the
electrical properties of infeasibly small devices.
These materials pose some severe fabrication challenges, but nevertheless
may help push the frontiers out a bit farther.
Another more-exotic workaround stores multiple bits in a single electron,
relying on the fact that a given electron can exist at a number of
energy levels.
It remains to be seen if this particular approach can be made to work
reliably in production semiconductor devices.
Another proposed workaround is the ``quantum dot'' approach that
allows much smaller device sizes, but which is still in the research
stage.
One challenge is that many recent hardware-device-level breakthroughs
require very tight control of which atoms are placed
where~\cite{MichaelJKelly2017DeviceLevel}.
It therefore seems likely that whoever finds a good way to hand-place
atoms on each of the billions of devices on a chip will have most
excellent bragging rights, if nothing else!
\subsection{Light, Not Electrons}
\label{sec:cpu:Light, Not Electrons}
Although the speed of light would be a hard limit, the fact is that
semiconductor devices are limited by the speed of electricity rather
than that of light, given that electric waves in semiconductor materials
move at between 3\,\% and 30\,\% of the speed of light in a vacuum.
The use of copper connections on silicon devices is one way to increase
the speed of electricity, and it is quite possible that additional
advances will push closer still to the actual speed of light.
In addition, there have been some experiments with tiny optical fibers
as interconnects within and between chips, based on the fact that
the speed of light in glass is more than 60\,\% of the speed of light
in a vacuum.
One obstacle to such optical fibers is the inefficiency conversion
between electricity and light and vice versa, resulting in both
power-consumption and heat-dissipation problems.
That said, absent some fundamental advances in the field of physics,
any exponential increases in the speed of data flow
will be sharply limited by the actual speed of light in a vacuum.
\subsection{Special-Purpose Accelerators}
\label{sec:cpu:Special-Purpose Accelerators}
A general-purpose CPU working on a specialized problem is often spending
significant time and energy doing work that is only tangentially related
to the problem at hand.
For example, when taking the dot product of a pair of vectors, a
general-purpose CPU will normally use a loop (possibly unrolled)
with a loop counter.
Decoding the instructions, incrementing the loop counter, testing this
counter, and branching back to the
top of the loop are in some sense wasted effort: the real goal is
instead to multiply corresponding elements of the two vectors.
Therefore, a specialized piece of hardware designed specifically to
multiply vectors could get the job done more quickly and with less
energy consumed.
This is in fact the motivation for the vector instructions present in
many commodity microprocessors.
Because these instructions operate on multiple data items simultaneously,
they would permit a dot product to be computed with less instruction-decode
and loop overhead.
Similarly, specialized hardware can more efficiently encrypt and decrypt,
compress and decompress, encode and decode, and many other tasks besides.
Unfortunately, this efficiency does not come for free.
A computer system incorporating this specialized hardware will contain
more transistors, which will consume some power even when not in use.
Software must be modified to take advantage of this specialized hardware,
and this specialized hardware must be sufficiently generally useful
that the high up-front hardware-design costs can be spread over enough
users to make the specialized hardware affordable.
In part due to these sorts of economic considerations, specialized
hardware has thus far appeared only for a few application areas,
including graphics processing (GPUs), vector processors (MMX, SSE,
and VMX instructions), and, to a lesser extent, encryption.
Unlike the server and PC arena, smartphones have long used a wide
variety of hardware accelerators.
These hardware accelerators are often used for media decoding,
so much so that a high-end MP3 player might be able to play audio
for several minutes---with its CPU fully powered off the entire time.
The purpose of these accelerators is to improve energy efficiency
and thus extend battery life: special purpose hardware can often
compute more efficiently than can a general-purpose CPU.
This is another example of the principle called out in
Section~\ref{sec:intro:Generality}: Generality is almost never free.
Nevertheless, given the end of Moore's-Law-induced single-threaded
performance increases, it seems safe to predict that there will
be an increasing variety of special-purpose hardware going forward.
\subsection{Existing Parallel Software}
\label{sec:cpu:Existing Parallel Software}
Although multicore CPUs seem to have taken the computing industry
by surprise, the fact remains that shared-memory parallel computer
systems have been commercially available for more than a quarter
century.
This is more than enough time for significant parallel software
to make its appearance, and it indeed has.
Parallel operating systems are quite commonplace, as are parallel
threading libraries, parallel relational database management systems,
and parallel numerical software.
Use of existing parallel software can go a long ways towards solving any
parallel-software crisis we might encounter.
Perhaps the most common example is the parallel relational database
management system.
It is not unusual for single-threaded programs, often written in
high-level scripting languages, to access a central relational
database concurrently.
In the resulting highly parallel system, only the database need actually
deal directly with parallelism.
A very nice trick when it works!