| % cpu/hwfreelunch.tex |
| |
| \section{Hardware Free Lunch?} |
| \label{sec:cpu:Hardware Free Lunch?} |
| |
| The major reason that concurrency has been receiving so much focus over |
| the past few years is the end of Moore's-Law induced single-threaded |
| performance increases |
| (or ``free lunch''~\cite{HerbSutter2008EffectiveConcurrency}), |
| as shown in |
| Figure~\ref{fig:intro:Clock-Frequency Trend for Intel CPUs} on |
| page~\pageref{fig:intro:Clock-Frequency Trend for Intel CPUs}. |
| This section briefly surveys a few ways that hardware designers |
| might be able to bring back some form of the ``free lunch''. |
| |
| However, the preceding section presented some substantial hardware |
| obstacles to exploiting concurrency. |
| One severe physical limitation that hardware designers face is the |
| finite speed of light. |
| As noted in |
| Figure~\ref{fig:cpu:System Hardware Architecture} on |
| page~\pageref{fig:cpu:System Hardware Architecture}, |
| light can travel only about an 8-centimeters round trip |
| in a vacuum during the duration of a 1.8 GHz clock period. |
| This distance drops to about 3 centimeters for a 5 GHz clock. |
| Both of these distances are relatively small compared to the size |
| of a modern computer system. |
| |
| To make matters even worse, electrons in silicon move from three to |
| thirty times more slowly than does light in a vacuum, and common |
| clocked logic constructs run still more slowly, for example, a |
| memory reference may need to wait for a local cache lookup to complete |
| before the request may be passed on to the rest of the system. |
| Furthermore, relatively low speed and high power drivers are required |
| to move electrical signals from one silicon die to another, for example, |
| to communicate between a CPU and main memory. |
| |
| \QuickQuiz{} |
| But individual electrons don't move anywhere near that fast, |
| even in conductors!!! |
| The electron drift velocity in a conductor under the low voltages |
| found in semiconductors is on the order of only one \emph{millimeter} |
| per second. |
| What gives??? |
| \QuickQuizAnswer{ |
| Electron drift velocity tracks the long-term movement of individual |
| electrons. |
| It turns out that individual electrons bounce around quite |
| randomly, so that their instantaneous speed is very high, but |
| over the long term, they don't move very far. |
| In this, electrons resemble long-distance commuters, who |
| might spend most of their time traveling at full highway |
| speed, but over the long term going nowhere. |
| These commuters' speed might be 70 miles per hour |
| (113 kilometers per hour), but their long-term drift velocity |
| relative to the planet's surface is zero. |
| |
| When designing circuitry, electrons' instantaneous speed is |
| often more important than their drift velocity. |
| When a voltage is applied to a wire, more electrons enter the |
| wire than leave it, but the electrons entering cause the |
| electrons already there to move a bit further down the wire, |
| which causes other electrons to move down, and so on. |
| The result is that the electric field moves quite quickly down |
| the wire. |
| Just as the speed of sound in air is much greater than is |
| the typical wind speed, the electric field propagates down |
| the wire at a much higher velocity than the electron drift |
| velocity. |
| } \QuickQuizEnd |
| |
| There are nevertheless some technologies (both hardware and software) |
| that might help improve matters: |
| |
| \begin{enumerate} |
| \item 3D integration, |
| \item Novel materials and processes, |
| \item Substituting light for electrons, |
| \item Special-purpose accelerators, and |
| \item Existing parallel software. |
| \end{enumerate} |
| |
| Each of these is described in one of the following sections. |
| |
| \subsection{3D Integration} |
| \label{sec:cpu:3D Integration} |
| |
| 3-dimensional integration (3DI) is the practice of bonding |
| very thin silicon dies to each other in a vertical stack. |
| This practice provides potential benefits, but also poses |
| significant fabrication challenges~\cite{JohnKnickerbocker2008:3DI}. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cpu/3DI}} |
| \end{center} |
| \caption{Latency Benefit of 3D Integration} |
| \label{fig:cpu:Latency Benefit of 3D Integration} |
| \end{figure} |
| |
| Perhaps the most important benefit of 3DI is decreased path length through |
| the system, as shown in |
| Figure~\ref{fig:cpu:Latency Benefit of 3D Integration}. |
| A 3-centimeter silicon die is replaced with a stack of four 1.5-centimeter |
| dies, in theory decreasing the maximum path through the system by a factor |
| of two, keeping in mind that each layer is quite thin. |
| In addition, given proper attention to design and placement, |
| long horizontal electrical connections (which are both slow and |
| power hungry) can be replaced by short vertical electrical connections, |
| which are both faster and more power efficient. |
| |
| However, delays due to levels of clocked logic will not be decreased |
| by 3D integration, and significant manufacturing, testing, power-supply, |
| and heat-dissipation problems must be solved for 3D integration to |
| reach production while still delivering on its promise. |
| The heat-dissipation problems might be solved using |
| semiconductors based on diamond, which is a good conductor |
| for heat, but an electrical insulator. |
| That said, it remains difficult to grow large single diamond crystals, |
| to say nothing of slicing them into wafers. |
| In addition, it seems unlikely that any of these technologies will be able to |
| deliver the exponential increases to which some people have become accustomed. |
| That said, they may be necessary steps on the path to the late Jim Gray's |
| ``smoking hairy golf balls''~\cite{JimGray2002SmokingHairyGolfBalls}. |
| |
| \subsection{Novel Materials and Processes} |
| \label{sec:cpu:Novel Materials and Processes} |
| |
| Stephen Hawking is said to have claimed that semiconductor manufacturers |
| have but two fundamental problems: (1) the finite speed of light and |
| (2) the atomic nature of matter~\cite{BryanGardiner2007}. |
| It is possible that semiconductor manufacturers are approaching these |
| limits, but there are nevertheless a few avenues of research and |
| development focused on working around these fundamental limits. |
| |
| One workaround for the atomic nature of matter are so-called |
| ``high-K dielectric'' materials, which allow larger devices to mimic the |
| electrical properties of infeasibly small devices. |
| These materials pose some severe fabrication challenges, but nevertheless |
| may help push the frontiers out a bit farther. |
| Another more-exotic workaround stores multiple bits in a single electron, |
| relying on the fact that a given electron can exist at a number of |
| energy levels. |
| It remains to be seen if this particular approach can be made to work |
| reliably in production semiconductor devices. |
| |
| Another proposed workaround is the ``quantum dot'' approach that |
| allows much smaller device sizes, but which is still in the research |
| stage. |
| |
| \subsection{Light, Not Electrons} |
| \label{sec:cpu:Light, Not Electrons} |
| |
| Although the speed of light would be a hard limit, the fact is that |
| semiconductor devices are limited by the speed of electrons rather |
| than that of light, given that electrons in semiconductor materials |
| move at between 3\% and 30\% of the speed of light in a vacuum. |
| The use of copper connections on silicon devices is one way to increase |
| the speed of electrons, and it is quite possible that additional |
| advances will push closer still to the actual speed of light. |
| In addition, there have been some experiments with tiny optical fibers |
| as interconnects within and between chips, based on the fact that |
| the speed of light in glass is more than 60\% of the speed of light |
| in a vacuum. |
| One obstacle to such optical fibers is the inefficiency conversion |
| between electricity and light and vice versa, resulting in both |
| power-consumption and heat-dissipation problems. |
| |
| That said, absent some fundamental advances in the field of physics, |
| any exponential increases in the speed of data flow |
| will be sharply limited by the actual speed of light in a vacuum. |
| |
| \subsection{Special-Purpose Accelerators} |
| \label{sec:cpu:Special-Purpose Accelerators} |
| |
| A general-purpose CPU working on a specialized problem is often spending |
| significant time and energy doing work that is only tangentially related |
| to the problem at hand. |
| For example, when taking the dot product of a pair of vectors, a |
| general-purpose CPU will normally use a loop (possibly unrolled) |
| with a loop counter. |
| Decoding the instructions, incrementing the loop counter, testing this |
| counter, and branching back to the |
| top of the loop are in some sense wasted effort: the real goal is |
| instead to multiply corresponding elements of the two vectors. |
| Therefore, a specialized piece of hardware designed specifically to |
| multiply vectors could get the job done more quickly and with less |
| energy consumed. |
| |
| This is in fact the motivation for the vector instructions present in |
| many commodity microprocessors. |
| Because these instructions operate on multiple data items simultaneously, |
| they would permit a dot product to be computed with less instruction-decode |
| and loop overhead. |
| |
| Similarly, specialized hardware can more efficiently encrypt and decrypt, |
| compress and decompress, encode and decode, and many other tasks besides. |
| Unfortunately, this efficiency does not come for free. |
| A computer system incorporating this specialized hardware will contain |
| more transistors, which will consume some power even when not in use. |
| Software must be modified to take advantage of this specialized hardware, |
| and this specialized hardware must be sufficiently generally useful |
| that the high up-front hardware-design costs can be spread over enough |
| users to make the specialized hardware affordable. |
| In part due to these sorts of economic considerations, specialized |
| hardware has thus far appeared only for a few application areas, |
| including graphics processing (GPUs), vector processors (MMX, SSE, |
| and VMX instructions), and, to a lesser extent, encryption. |
| |
| Unlike the server and PC arena, smartphones have long used a wide |
| variety of hardware accelerators. |
| These hardware accelerators are often used for media decoding, |
| so much so that a high-end MP3 player might be able to play audio |
| for several minutes---with its CPU fully powered off the entire time. |
| The purpose of these accelerators is to improve energy efficiency |
| and thus extend battery life: special purpose hardware can often |
| compute more efficiently than can a general-purpose CPU. |
| This is another example of the principle called out in |
| Section~\ref{sec:intro:Generality}: Generality is almost never free. |
| |
| Nevertheless, given the end of Moore's-Law-induced single-threaded |
| performance increases, it seems safe to predict that there will |
| be an increasing variety of special-purpose hardware going forward. |
| |
| \subsection{Existing Parallel Software} |
| \label{sec:cpu:Existing Parallel Software} |
| |
| Although multicore CPUs seem to have taken the computing industry |
| by surprise, the fact remains that shared-memory parallel computer |
| systems have been commercially available for more than a quarter |
| century. |
| This is more than enough time for significant parallel software |
| to make its appearance, and it indeed has. |
| Parallel operating systems are quite commonplace, as are parallel |
| threading libraries, parallel relational database management systems, |
| and parallel numerical software. |
| Use of existing parallel software can go a long ways towards solving any |
| parallel-software crisis we might encounter. |
| |
| Perhaps the most common example is the parallel relational database |
| management system. |
| It is not unusual for single-threaded programs, often written in |
| high-level scripting languages, to access a central relational |
| database concurrently. |
| In the resulting highly parallel system, only the database need actually |
| deal directly with parallelism. |
| A very nice trick when it works! |