| % cpu/overheads.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Overheads} |
| \label{sec:cpu:Overheads} |
| |
| This section presents actual overheads of the obstacles to performance |
| listed out in the previous section. |
| However, it is first necessary to get a rough view of hardware system |
| architecture, which is the subject of the next section. |
| |
| \subsection{Hardware System Architecture} |
| \label{sec:cpu:Hardware System Architecture} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cpu/SystemArch}} |
| \caption{System Hardware Architecture} |
| \label{fig:cpu:System Hardware Architecture} |
| \end{figure} |
| |
| Figure~\ref{fig:cpu:System Hardware Architecture} |
| shows a rough schematic of an eight-core computer system. |
| Each die has a pair of CPU cores, each with its cache, as well as an |
| interconnect allowing the pair of CPUs to communicate with each other. |
| The system interconnect in the middle of the diagram allows the |
| four dies to communicate, and also connects them to main memory. |
| |
| Data moves through this system in units of ``cache lines'', which |
| are power-of-two fixed-size aligned blocks of memory, usually ranging |
| from 32 to 256 bytes in size. |
| When a CPU loads a variable from memory to one of its registers, it must |
| first load the cacheline containing that variable into its cache. |
| Similarly, when a CPU stores a value from one of its registers into |
| memory, it must also load the cacheline containing that variable into |
| its cache, but must also ensure that no other CPU has a copy of that |
| cacheline. |
| |
| For example, if CPU~0 were to perform a compare-and-swap (CAS) operation on a |
| variable whose cacheline resided in CPU~7's cache, the following |
| over-simplified sequence of events might ensue: |
| |
| \begin{enumerate} |
| \item CPU~0 checks its local cache, and does not find the cacheline. |
| \item The request is forwarded to CPU~0's and 1's interconnect, |
| which checks CPU~1's local cache, and does not find the cacheline. |
| \item The request is forwarded to the system interconnect, which |
| checks with the other three dies, learning that the cacheline |
| is held by the die containing CPU~6 and 7. |
| \item The request is forwarded to CPU~6's and 7's interconnect, which |
| checks both CPUs' caches, finding the value in CPU~7's cache. |
| \item CPU~7 forwards the cacheline to its interconnect, and also |
| flushes the cacheline from its cache. |
| \item CPU~6's and 7's interconnect forwards the cacheline to the |
| system interconnect. |
| \item The system interconnect forwards the cacheline to CPU~0's and 1's |
| interconnect. |
| \item CPU~0's and 1's interconnect forwards the cacheline to CPU~0's |
| cache. |
| \item CPU~0 can now perform the CAS operation on the value in its cache. |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| This is a \emph{simplified} sequence of events? |
| How could it \emph{possibly} be any more complex? |
| \QuickQuizAnswer{ |
| This sequence ignored a number of possible complications, |
| including: |
| |
| \begin{enumerate} |
| \item Other CPUs might be concurrently attempting to perform |
| CAS operations involving this same cacheline. |
| \item The cacheline might have been replicated read-only in |
| several CPUs' caches, in which case, it would need to |
| be flushed from their caches. |
| \item CPU~7 might have been operating on the cache line when |
| the request for it arrived, in which case CPU~7 might |
| need to hold off the request until its own operation |
| completed. |
| \item CPU~7 might have ejected the cacheline from its cache |
| (for example, in order to make room for other data), |
| so that by the time that the request arrived, the |
| cacheline was on its way to memory. |
| \item A correctable error might have occurred in the cacheline, |
| which would then need to be corrected at some point before |
| the data was used. |
| \end{enumerate} |
| |
| Production-quality cache-coherence mechanisms are extremely |
| complicated due to these sorts of |
| considerations~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel}. |
| % |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Why is it necessary to flush the cacheline from CPU~7's cache? |
| \QuickQuizAnswer{ |
| If the cacheline was not flushed from CPU~7's cache, then |
| CPUs~0 and 7 might have different values for the same set |
| of variables in the cacheline. |
| This sort of incoherence would greatly complicate parallel |
| software, and so hardware architects have been convinced to |
| avoid it. |
| } \QuickQuizEnd |
| |
| This simplified sequence is just the beginning of a discipline called |
| \emph{cache-coherency protocols}~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel}. |
| |
| \subsection{Costs of Operations} |
| \label{sec:cpu:Costs of Operations} |
| |
| \begin{table} |
| \small |
| \centering |
| \begin{tabular}{l||r|r} |
| & & Ratio \\ |
| Operation & Cost (ns) & (cost/clock) \\ |
| \hline |
| \hline |
| Clock period & 0.6 & 1.0 \\ |
| \hline |
| Best-case CAS & 37.9 & 63.2 \\ |
| \hline |
| Best-case lock & 65.6 & 109.3 \\ |
| \hline |
| Single cache miss & 139.5 & 232.5 \\ |
| \hline |
| CAS cache miss & 306.0 & 510.0 \\ |
| \hline |
| Comms Fabric & 5,000\textcolor{white}{.0} |
| & 8,330\textcolor{white}{.0} |
| \\ |
| \hline |
| Global Comms & 195,000,000\textcolor{white}{.0} |
| & 325,000,000\textcolor{white}{.0} \\ |
| \\ |
| \end{tabular} |
| \caption{Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System} |
| \label{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System} |
| \end{table} |
| |
| The overheads of some common operations important to parallel programs are |
| displayed in |
| Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}. |
| This system's clock period rounds to 0.6ns. |
| Although it is not unusual for modern microprocessors to be able to |
| retire multiple instructions per clock period, the operations's costs are |
| nevertheless normalized to a clock period in the third column, labeled |
| ``Ratio''. |
| The first thing to note about this table is the large values of many of |
| the ratios. |
| |
| The best-case compare-and-swap (CAS) operation consumes almost forty |
| nanoseconds, a duration more than sixty times that of the clock period. |
| Here, ``best case'' means that the same CPU now performing the CAS operation |
| on a given variable was the last CPU to operate on this variable, so |
| that the corresponding cache line is already held in that CPU's cache. |
| Similarly, the best-case lock operation (a ``round trip'' pair consisting |
| of a lock acquisition followed by a lock release) consumes more than |
| sixty nanoseconds, or more than one hundred clock cycles. |
| Again, ``best case'' means that the data structure representing the |
| lock is already in the cache belonging to the CPU acquiring and releasing |
| the lock. |
| The lock operation is more expensive than CAS because it requires two |
| atomic operations on the lock data structure. |
| |
| An operation that misses the cache consumes almost one hundred and forty |
| nanoseconds, or more than two hundred clock cycles. |
| The code used for this cache-miss measurement passes the cache line |
| back and forth between a pair of CPUs, so this cache miss is satisfied |
| not from memory, but rather from the other CPU's cache. |
| A CAS operation, which must look at the old value of the variable as |
| well as store a new value, consumes over three hundred nanoseconds, or |
| more than five hundred clock cycles. |
| Think about this a bit. |
| In the time required to do \emph{one} CAS operation, the CPU could have |
| executed more than \emph{five hundred} normal instructions. |
| This should demonstrate the limitations not only of fine-grained locking, |
| but of any other synchronization mechanism relying on fine-grained |
| global agreement. |
| |
| \QuickQuiz{} |
| Surely the hardware designers could be persuaded to improve |
| this situation! |
| Why have they been content with such abysmal performance |
| for these single-instruction operations? |
| \QuickQuizAnswer{ |
| The hardware designers \emph{have} been working on this |
| problem, and have consulted with no less a luminary than |
| the physicist Stephen Hawking. |
| Hawking's observation was that the hardware designers have |
| two basic problems~\cite{BryanGardiner2007}: |
| |
| \begin{enumerate} |
| \item the finite speed of light, and |
| \item the atomic nature of matter. |
| \end{enumerate} |
| |
| \begin{table} |
| \small |
| \centering |
| \begin{tabular}{l||r|r} |
| & & Ratio \\ |
| Operation & Cost (ns) & (cost/clock) \\ |
| \hline |
| \hline |
| Clock period & 0.4 & 1.0 \\ |
| \hline |
| ``Best-case'' CAS & 12.2 & 33.8 \\ |
| \hline |
| Best-case lock & 25.6 & 71.2 \\ |
| \hline |
| Single cache miss & 12.9 & 35.8 \\ |
| \hline |
| CAS cache miss & 7.0 & 19.4 \\ |
| \hline |
| Off-Core & & \\ |
| \hline |
| Single cache miss & 31.2 & 86.6 \\ |
| \hline |
| CAS cache miss & 31.2 & 86.5 \\ |
| \hline |
| Off-Socket & & \\ |
| \hline |
| Single cache miss & 92.4 & 256.7 \\ |
| \hline |
| CAS cache miss & 95.9 & 266.4 \\ |
| \hline |
| Comms Fabric & 2,600\textcolor{white}{.0} |
| & 7,220\textcolor{white}{.0} \\ |
| \hline |
| Global Comms & 195,000,000\textcolor{white}{.0} |
| & 542,000,000\textcolor{white}{.0} \\ |
| \end{tabular} |
| \caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| \label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| \end{table} |
| |
| The first problem limits raw speed, and the second limits |
| miniaturization, which in turn limits frequency. |
| And even this sidesteps the power-consumption issue that |
| is currently holding production frequencies to well below |
| 10 GHz. |
| |
| Nevertheless, some progress is being made, as may be seen |
| by comparing |
| Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| with |
| Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System} |
| on |
| page~\pageref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}. |
| Integration of hardware threads in a single core and multiple |
| cores on a die have improved latencies greatly, at least within the |
| confines of a single core or single die. |
| There has been some improvement in overall system latency, |
| but only by about a factor of two. |
| Unfortunately, neither the speed of light nor the atomic nature |
| of matter has changed much in the past few years. |
| |
| Section~\ref{sec:cpu:Hardware Free Lunch?} |
| looks at what else hardware designers might be |
| able to do to ease the plight of parallel programmers. |
| } \QuickQuizEnd |
| |
| I/O operations are even more expensive. |
| As shown in the ``Comms Fabric'' row, |
| high performance (and expensive!) communications fabric, such as |
| InfiniBand or any number of proprietary interconnects, has a latency |
| of roughly five microseconds for an end-to-end round trip, during which |
| time more than eight \emph{thousand} instructions might have been executed. |
| Standards-based communications networks often require some sort of |
| protocol processing, which further increases the latency. |
| Of course, geographic distance also increases latency, with the |
| speed-of-light through optical fiber latency around the world coming to |
| roughly 195 \emph{milliseconds}, or more than 300 million clock |
| cycles, as shown in the ``Global Comms'' row. |
| |
| % Reference of Infiniband latency: |
| % http://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/1_Mellanox.pdf |
| % page 6/76 'Leading Interconnect, Leading Performance' |
| |
| \QuickQuiz{} |
| These numbers are insanely large! |
| How can I possibly get my head around them? |
| \QuickQuizAnswer{ |
| Get a roll of toilet paper. |
| In the USA, each roll will normally have somewhere around 350-500 |
| sheets. |
| Tear off one sheet to represent a single clock cycle, setting it aside. |
| Now unroll the rest of the roll. |
| |
| The resulting pile of toilet paper will likely represent a single |
| CAS cache miss. |
| |
| For the more-expensive inter-system communications latencies, |
| use several rolls (or multiple cases) of toilet paper to represent |
| the communications latency. |
| |
| Important safety tip: make sure to account for the needs of |
| those you live with when appropriating toilet paper! |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/Data-chasing-light-wave}} |
| \caption{Hardware and Software: On Same Side} |
| \ContributedBy{Figure}{fig:cpu:Hardware and Software: On Same Side}{Melissa Broussard} |
| \end{figure} |
| |
| In short, hardware and software engineers are really fighting on the same |
| side, trying to make computers go fast despite the best efforts of the |
| laws of physics, as fancifully depicted in |
| Figure~\ref{fig:cpu:Hardware and Software: On Same Side} |
| where our data stream is trying its best to exceed the speed of light. |
| The next section discusses some of the things that the hardware engineers |
| might (or might not) be able to do. |
| Software's contribution to this fight is outlined in the remaining chapters |
| of this book. |