| % cpu/overview.tex |
| |
| \section{Overview} |
| \label{sec:cpu:Overview} |
| |
| Careless reading of computer-system specification sheets might lead one |
| to believe that CPU performance is a footrace on a clear track, as |
| illustrated in Figure~\ref{fig:cpu:CPU Performance at its Best}, |
| where the race always goes to the swiftest. |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/trackmeet}} |
| \end{center} |
| \caption{CPU Performance at its Best} |
| \ContributedBy{Figure}{fig:cpu:CPU Performance at its Best}{Melissa McKenney} |
| \end{figure} |
| |
| Although there are a few CPU-bound benchmarks that approach the ideal |
| shown in Figure~\ref{fig:cpu:CPU Performance at its Best}, |
| the typical program more closely resembles an obstacle course than |
| a race track. |
| This is because the internal architecture of CPUs has changed dramatically |
| over the past few decades, courtesy of Moore's Law. |
| These changes are described in the following sections. |
| |
| \subsection{Pipelined CPUs} |
| \label{sec:cpu:Pipelined CPUs} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/whippersnapper}} |
| \end{center} |
| \caption{CPUs Old and New} |
| \ContributedBy{Figure}{fig:cpu:CPUs Old and New}{Melissa McKenney} |
| \end{figure} |
| |
| In the early 1980s, the typical microprocessor fetched an instruction, |
| decoded it, and executed it, typically taking \emph{at least} three |
| clock cycles to complete one instruction before proceeding to the next. |
| In contrast, the CPU of the late 1990s and early 2000s will be executing |
| many instructions simultaneously, using a deep ``pipeline'' to control |
| the flow of instructions internally to the CPU, this difference being |
| illustrated by Figure~\ref{fig:cpu:CPUs Old and New}. |
| |
| Achieving full performance with a CPU having a long pipeline requires |
| highly predictable control flow through the program. |
| Suitable control flow can be provided by a program that executes primarily |
| in tight loops, for example, programs doing arithmetic on large matrices |
| or vectors. |
| The CPU can then correctly predict that the branch at the end of the loop |
| will be taken in almost all cases. |
| In such programs, the pipeline can be kept full and the CPU can execute |
| at full speed. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/pipeline}} |
| \end{center} |
| \caption{CPU Meets a Pipeline Flush} |
| \ContributedBy{Figure}{fig:cpu:CPU Meets a Pipeline Flush}{Melissa McKenney} |
| \end{figure} |
| |
| If, on the other hand, the program has many loops with small loop counts, |
| or if the program is object oriented with many virtual objects that |
| can reference many different real objects, all with different implementations |
| for frequently invoked member functions, then it is difficult or even |
| impossible for the CPU to predict where a given branch might lead. |
| The CPU must then either stall waiting for execution to proceed far enough |
| to know for certain where the branch will lead, or guess --- and, in the |
| face of programs with unpredictable control flow, frequently guess wrong. |
| In either case, the pipeline will empty and have to be refilled, leading |
| to stalls that can drastically reduce performance, |
| as fancifully depicted in Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}. |
| |
| Unfortunately, pipeline flushes are not the only hazards in the obstacle |
| course that modern CPUs must run. |
| The next section covers the hazards of referencing memory. |
| |
| \subsection{Memory References} |
| \label{sec:cpu:Memory References} |
| |
| In the 1980s, it often took less time for a microprocessor to load a value |
| from memory than it did to execute an instruction. |
| In 2006, a microprocessor might be capable of executing hundreds or even |
| thousands of instructions in the time required to access memory. |
| This disparity is due to the fact that Moore's Law has increased CPU |
| performance at a much greater rate than it has increased memory |
| performance, in part due to the rate at which memory sizes have |
| grown. |
| For example, a typical 1970s minicomputer might have 4KB (yes, kilobytes, |
| not megabytes, let alone gigabytes) of main memory, with |
| single-cycle access.\footnote{ |
| It is only fair to add that each of these single cycles |
| consumed no less than 1.6\emph{microseconds}.} |
| In 2008, CPU designers still can construct a 4KB memory with single-cycle |
| access, even on systems with multi-GHz clock frequencies. |
| And in fact they frequently do construct such memories, but they now |
| call them ``level-0 caches''. |
| |
| Although the large caches found on modern microprocessors can do quite |
| a bit to help combat memory-access latencies, |
| these caches require highly predictable data-access patterns to |
| successfully hide memory latencies. |
| Unfortunately, common operations, such as traversing a linked list, |
| have extremely unpredictable memory-access patterns --- after all, |
| if the pattern was predictable, us software types would not bother |
| with the pointers, right? |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/ref}} |
| \end{center} |
| \caption{CPU Meets a Memory Reference} |
| \ContributedBy{Figure}{fig:cpu:CPU Meets a Memory Reference}{Melissa McKenney} |
| \end{figure} |
| |
| Therefore, as shown in |
| Figure~\ref{fig:cpu:CPU Meets a Memory Reference}, |
| memory references are often severe obstacles for modern CPUs. |
| |
| Thus far, we have only been considering obstacles that can arise during |
| a given CPU's execution of single-threaded code. |
| Multi-threading presents additional obstacles to the CPU, as |
| described in the following sections. |
| |
| \subsection{Atomic Operations} |
| \label{sec:cpu:Atomic Operations} |
| |
| One such obstacle is atomic operations. |
| The whole idea of an atomic operation in some sense conflicts with |
| the piece-at-a-time assembly-line operation of a CPU pipeline. |
| To hardware designers' credit, modern CPUs use a number of extremely clever |
| tricks to make such operations \emph{look} atomic even though they |
| are in fact being executed piece-at-a-time, but even so, there are |
| cases where the pipeline must be delayed or even flushed in order to |
| permit a given atomic operation to complete correctly. |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Atomic-reference}} |
| \end{center} |
| \caption{CPU Meets an Atomic Operation} |
| \ContributedBy{Figure}{fig:cpu:CPU Meets an Atomic Operation}{Melissa Broussard} |
| \end{figure} |
| |
| The resulting effect on performance is depicted in |
| Figure~\ref{fig:cpu:CPU Meets an Atomic Operation}. |
| |
| Unfortunately, atomic operations usually apply only to single elements |
| of data. |
| Because many parallel algorithms require that ordering constraints |
| be maintained between updates of multiple data elements, most CPUs |
| provide memory barriers. |
| These memory barriers also serve as performance-sapping obstacles, |
| as described in the next section. |
| |
| \QuickQuiz{} |
| What types of machines would allow atomic operations on |
| multiple data elements? |
| \QuickQuizAnswer{ |
| One answer to this question is that it is often possible to |
| pack multiple elements of data into a single machine word, |
| which can then be manipulated atomically. |
| |
| A more trendy answer would be machines supporting transactional |
| memory~\cite{DBLomet1977SIGSOFT}. |
| However, such machines are still research curiosities, |
| although as of early 2012 it appears that commodity systems |
| supporting limited forms of hardware transactional memory |
| will be commercially available within a couple of years. |
| The jury is still out on the applicability of software transactional |
| memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT, |
| ChistopherJRossbach2007a,CalinCascaval2008tmtoy, |
| AleksandarDragovejic2011STMnotToy,AlexanderMatveev2012PessimisticTM}. |
| } \QuickQuizEnd |
| |
| Fortunately, CPU designers have focused heavily on atomic operations, |
| so that as of early 2012 they have greately reduced (but by no |
| means eliminated) their overhead. |
| |
| \subsection{Memory Barriers} |
| \label{sec:cpu:Memory Barriers} |
| |
| Memory barriers will be considered in more detail in |
| Section~\ref{sec:advsync:Memory Barriers} and |
| Appendix~\ref{chp:app:whymb:Why Memory Barriers?}. |
| In the meantime, consider the following simple lock-based critical |
| section: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 spin_lock(&mylock); |
| 2 a = a + 1; |
| 3 spin_unlock(&mylock); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Memory-barrier}} |
| \end{center} |
| \caption{CPU Meets a Memory Barrier} |
| \ContributedBy{Figure}{fig:cpu:CPU Meets a Memory Barrier}{Melissa Broussard} |
| \end{figure} |
| |
| If the CPU were not constrained to execute these statements in the |
| order shown, the effect would be that the variable ``a'' would be |
| incremented without the protection of ``mylock'', which would certainly |
| defeat the purpose of acquiring it. |
| To prevent such destructive reordering, locking primitives contain |
| either explicit or implicit memory barriers. |
| Because the whole purpose of these memory barriers is to prevent reorderings |
| that the CPU would otherwise undertake in order to increase performance, |
| memory barriers almost always reduce performance, as depicted in |
| Figure~\ref{fig:cpu:CPU Meets a Memory Barrier}. |
| |
| As with atomic operations, CPU designers have been working hard to |
| reduce memory-barrier overhead, and have made substantial progress. |
| |
| \subsection{Cache Misses} |
| \label{sec:cpu:Cache Misses} |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/tollbooth}} |
| \end{center} |
| \caption{CPU Meets a Cache Miss} |
| \ContributedBy{Figure}{fig:cpu:CPU Meets a Cache Miss}{Melissa McKenney} |
| \end{figure} |
| |
| An additional multi-threading obstacle to CPU performance is |
| the ``cache miss''. |
| As noted earlier, modern CPUs sport large caches in order to reduce the |
| performance penalty that would otherwise be incurred due to high memory |
| latencies. |
| However, these caches are actually counter-productive for variables that |
| are frequently shared among CPUs. |
| This is because when a given CPU wishes to modify the variable, it is |
| most likely the case that some other CPU has modified it recently. |
| In this case, the variable will be in that other CPU's cache, but not |
| in this CPU's cache, which will therefore incur an expensive cache miss |
| (see Section~\ref{sec:app:whymb:Cache Structure} for more detail). |
| Such cache misses form a major obstacle to CPU performance, as shown |
| in Figure~\ref{fig:cpu:CPU Meets a Cache Miss}. |
| |
| \QuickQuiz{} |
| So have CPU designers also greatly reduced the overhead of |
| cache misses? |
| \QuickQuizAnswer{ |
| Unfortunately, not so much. |
| There has been some reduction given constant numbers of CPUs, |
| but the finite speed of light and the atomic nature of |
| matter limits their ability to reduce cache-miss overhead |
| for larger systems. |
| Section~\ref{sec:cpu:Hardware Free Lunch?} |
| discusses some possible avenues for possible future progress. |
| } \QuickQuizEnd |
| |
| \subsection{I/O Operations} |
| \label{sec:cpu:I/O Operations} |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/PhoneBooth}} |
| \end{center} |
| \caption{CPU Waits for I/O Completion} |
| \ContributedBy{Figure}{fig:cpu:CPU Waits for I/O Completion}{Melissa McKenney} |
| \end{figure} |
| |
| A cache miss can be thought of as a CPU-to-CPU I/O operation, and as |
| such is one of the cheapest I/O operations available. |
| I/O operations involving networking, mass storage, or (worse yet) human |
| beings pose much greater obstacles than the internal obstacles called |
| out in the prior sections, |
| as illustrated by |
| Figure~\ref{fig:cpu:CPU Waits for I/O Completion}. |
| |
| This is one of the differences between shared-memory and distributed-system |
| parallelism: shared-memory parallel programs must normally deal with no |
| obstacle worse than a cache miss, while a distributed parallel program |
| will typically incur the larger network communication latencies. |
| In both cases, the relevant latencies can be thought of as a cost of |
| communication---a cost that would be absent in a sequential program. |
| Therefore, the ratio between the overhead of the communication to |
| that of the actual work being performed is a key design parameter. |
| A major goal of parallel hardware design is to reduce this ratio as |
| needed to achieve the relevant performance and scalability goals. |
| In turn, as will be seen in |
| Chapter~\ref{cha:Partitioning and Synchronization Design}, |
| a major goal of parallel software design is to reduce the |
| frequency of expensive operations like communications cache misses. |
| |
| Of course, it is one thing to say that a given operation is an obstacle, |
| and quite another to show that the operation is a \emph{significant} |
| obstacle. |
| This distinction is discussed in the following sections. |