| % cpu/overheads.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Overheads} |
| \label{sec:cpu:Overheads} |
| % |
| \epigraph{Don't design bridges in ignorance of materials, and don't design |
| low-level software in ignorance of the underlying hardware.} |
| {Unknown} |
| |
| This section presents actual \IXpl{overhead} of the obstacles to performance |
| listed out in the previous section. |
| However, it is first necessary to get a rough view of hardware system |
| architecture, which is the subject of the next section. |
| |
| \subsection{Hardware System Architecture} |
| \label{sec:cpu:Hardware System Architecture} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cpu/SystemArch}} |
| \caption{System Hardware Architecture} |
| \label{fig:cpu:System Hardware Architecture} |
| \end{figure} |
| |
| \Cref{fig:cpu:System Hardware Architecture} |
| shows a rough schematic of an eight-core computer system. |
| Each die has a pair of CPU cores, each with its cache, as well as an |
| interconnect allowing the pair of CPUs to communicate with each other. |
| The system interconnect allows the four dies to communicate with each |
| other and with main memory. |
| |
| Data moves through this system in units of ``\IXpl{cache line}'', which |
| are power-of-two fixed-size aligned blocks of memory, usually ranging |
| from 32 to 256 bytes in size. |
| When a CPU loads a variable from memory to one of its registers, it must |
| first load the cacheline containing that variable into its cache, |
| which may be thought of as a hardware hash table. |
| Similarly, when a CPU stores a value from one of its registers into |
| memory, it must also load the cacheline containing that variable into |
| its cache, but must also ensure that no other CPU has a copy of that |
| cacheline. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cpu/SimpleWrite}} |
| \caption{Lifetime of a ``Simple'' Store} |
| \label{fig:cpu:Lifetime of a Simple Store} |
| \end{figure} |
| |
| For example, suppose that CPU~1 wrote to a variable \co{x} whose cacheline |
| was in CPU~6's cache. |
| An over-simplified view of this process is illustrated by the following |
| sequence of steps in conjunction with |
| \cref{fig:cpu:Lifetime of a Simple Store}, |
| each row of which condenses |
| \cref{fig:cpu:System Hardware Architecture} |
| to show only CPUs~1 and~6, their store buffers and local caches, along |
| with the system interconnect denoted by the grey rectangle connecting |
| the pair of CPUs. |
| |
| \begin{enumerate} |
| \item CPU~1 checks its local cache, and does not find the cacheline. |
| It therefore records the write in its store buffer as shown |
| in row~A of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item A request for this cacheline is forwarded to CPU~0's and~1's |
| interconnect, which checks CPU~1's local cache, and does not |
| find the cacheline. |
| Because nothing has changed, the system state is still as shown |
| in row~A of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item This request is forwarded to the system interconnect, as shown |
| in row~B of |
| \cref{fig:cpu:Lifetime of a Simple Store}, |
| which checks with the other three dies, learning that the |
| cacheline is held by the die containing CPU~6 and~7. |
| \item This request is forwarded to CPU~6's and~7's interconnect, which |
| checks both CPUs' caches, finding the value in CPU~6's cache, |
| as shown in row~C of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item CPU~6 forwards the cacheline to its interconnect, and also |
| flushes the cacheline from its cache, and then |
| CPU~6's and~7's interconnect forwards the cacheline to the |
| system interconnect, as shown in row~D of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item The system interconnect forwards the cacheline to CPU~0's and~1's |
| interconnect, as shown in row~E of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item CPU~0's and~1's interconnect forwards the cacheline to CPU~1, |
| as shown in row~F of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item The cache line is deposited into CPU~1's cache, as shown in |
| row~G of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \item CPU~1 can now complete the write, updating the relevant portions |
| of the newly arrived cacheline from the value previously |
| recorded in the store buffer, as shown in row~H of |
| \cref{fig:cpu:Lifetime of a Simple Store}. |
| \end{enumerate} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| This is a \emph{simplified} sequence of events? |
| How could it \emph{possibly} be any more complex? |
| }\QuickQuizAnswerB{ |
| This sequence ignored a number of possible complications, |
| including: |
| |
| \begin{enumerate} |
| \item Other CPUs might be concurrently attempting to perform |
| memory-reference operations involving this same cacheline. |
| \item The cacheline might have been replicated read-only in |
| several CPUs' caches, in which case, it would need to |
| be flushed from their caches. |
| \item CPU~6 might have been operating on the cache line when |
| the request for it arrived, in which case CPU~6 might |
| need to hold off the request until its own operation |
| completed. |
| \item CPU~6 might have ejected the cacheline from its cache |
| (for example, in order to make room for other data), |
| so that by the time that the request arrived, the |
| cacheline was on its way to memory. |
| \item A correctable error might have occurred in the cacheline, |
| which would then need to be corrected at some point before |
| the data was used. |
| \end{enumerate} |
| |
| Production-quality cache-coherence mechanisms are extremely |
| complicated due to these sorts of |
| considerations~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel}. |
| % |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Why is it necessary to flush the cacheline from CPU~6's cache? |
| }\QuickQuizAnswerE{ |
| If the cacheline was not flushed from CPU~6's cache, then |
| CPUs~1 and~6 might have different values for the same set |
| of variables in the cacheline. |
| This sort of incoherence greatly complicates parallel software, |
| which is why wise hardware architects avoid it. |
| }\QuickQuizEndE |
| } |
| |
| This simplified sequence is just the beginning of a discipline called |
| \emph{cache-coherency protocols}~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel}, |
| which is discussed in more detail in |
| \cref{chp:app:whymb:Why Memory Barriers?}\@. |
| As can be seen in the sequence of events triggered by a simple memory |
| write operation, |
| a single instruction can cause considerable protocol traffic, which |
| can significantly degrade your parallel program's performance. |
| |
| Fortunately, if a given variable is being frequently read during a time |
| interval during which it is never updated, that variable can be replicated |
| across all CPUs' caches. |
| This replication permits all CPUs to enjoy extremely fast access to |
| this \emph{read-mostly} variable. |
| \Cref{chp:Deferred Processing} presents synchronization |
| mechanisms that take full advantage of this important hardware read-mostly |
| optimization. |
| |
| \subsection{Costs of Operations} |
| \label{sec:cpu:Costs of Operations} |
| |
| \begin{table} |
| %\rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \centering\small |
| \tcresizewidth{ |
| \begin{tabular} |
| { |
| ll |
| S[table-format = 9.1] |
| S[table-format = 9.1] |
| r |
| } |
| \toprule |
| \multicolumn{2}{l}{Operation} |
| & \multicolumn{1}{r}{Cost (ns)} |
| & {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}} |
| & CPUs \\ |
| \midrule |
| \multicolumn{2}{l}{Clock period} |
| & 0.5 & 1.0 & \\ |
| \midrule |
| \multicolumn{2}{l}{Same-CPU} |
| & & & 0 \\ |
| & CAS & 7.0 & 14.6 & \\ |
| & lock & 15.4 & 32.3 & \\ |
| \midrule |
| \multicolumn{2}{l}{On-Core} |
| & & & 224 \\ |
| & Blind CAS& 7.2 & 15.2 & \\ |
| & CAS & 18.0 & 37.7 & \\ |
| \midrule |
| \multicolumn{2}{l}{Off-Core} |
| & & & 1--27 \\ |
| & Blind CAS& 47.5 & 99.8 & 225--251 \\ |
| & CAS & 101.9 & 214.0 & \\ |
| \midrule |
| \multicolumn{2}{l}{Off-Socket} |
| & & & 28--111 \\ |
| & Blind CAS& 148.8 & 312.5 & 252--335 \\ |
| & CAS & 442.9 & 930.1 & \\ |
| \midrule |
| \multicolumn{2}{l}{Cross-Interconnect} |
| & & & 112--223 \\ |
| & Blind CAS& 336.6 & 706.8 & 336--447 \\ |
| & CAS & 944.8 & 1984.2 & \\ |
| \midrule |
| \multicolumn{2}{l}{Off-System} |
| & & & \\ |
| & Comms Fabric & 5 000 & 10 500 & \\ |
| & Global Comms & 195 000 000 & 409 500 000 & \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10\,GHz} |
| \label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| \end{table} |
| |
| The overheads of some common operations important to parallel programs are |
| displayed in |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}. |
| This system's clock period rounds to 0.5\,ns. |
| Although it is not unusual for modern microprocessors to be able to |
| retire multiple instructions per clock period, the operations' costs are |
| nevertheless normalized to a clock period in the third column, labeled |
| ``Ratio''. |
| The first thing to note about this table is the large values of many of |
| the ratios. |
| |
| The same-CPU \acrmf{cas} operation consumes about seven |
| nanoseconds, a duration more than ten times that of the clock period. |
| CAS is an atomic operation in which the hardware compares the contents |
| of the specified memory location to a specified ``old'' value, and if |
| they compare equal, stores a specified ``new'' value, in which case the |
| CAS operation succeeds. |
| If they compare unequal, the memory location keeps its (unexpected) value, |
| and the CAS operation fails. |
| The operation is atomic in that the hardware guarantees that the memory |
| location will not be changed between the compare and the store. |
| CAS functionality is provided by the \co{lock;cmpxchg} instruction on x86. |
| |
| The ``same-CPU'' prefix means that the CPU now performing the CAS operation |
| on a given variable was also the last CPU to access this variable, so |
| that the corresponding cacheline is already held in that CPU's cache. |
| Similarly, the same-CPU lock operation (a ``round trip'' pair consisting |
| of a lock acquisition and release) consumes more than fifteen nanoseconds, |
| or more than thirty clock cycles. |
| The lock operation is more expensive than CAS because it requires two |
| atomic operations on the lock data structure, one for acquisition and |
| the other for release. |
| |
| On-core operations involving interactions between the hardware threads |
| sharing a single core are about the same cost as same-CPU operations. |
| This should not be too surprising, given that these two hardware threads |
| also share the full cache hierarchy. |
| |
| In the case of the blind CAS, the software specifies the old value |
| without looking at the memory location. |
| This approach is appropriate when attempting to acquire a lock. |
| If the unlocked state is represented by zero and the locked state |
| is represented by the value one, then a CAS operation on the lock |
| that specifies zero for the old value and one for the new value |
| will acquire the lock if it is not already held. |
| The key point is that there is only one access to the memory |
| location, namely the CAS operation itself. |
| |
| In contrast, a normal CAS operation's old value is derived from |
| some earlier load. |
| For example, to implement an atomic increment, the current value in |
| a shared variable is loaded into one machine register and then incremented |
| to produce the new value in another machine register. |
| Then in the CAS operation, the first register is specified as the old |
| value and the second register as the new value. |
| If the shared variable's value did not change in the meantime, the |
| CAS operation would store the new value, thus incrementing that shared |
| variable. |
| However, if the shared variable's value did change, then the old value |
| would not match, causing a miscompare that would result in the CAS |
| operation failing and thus making no further change to that shared |
| variable. |
| The key point is that there are now two accesses to the memory location, |
| the load and the CAS\@. |
| |
| Thus, it is not surprising that on-core blind CAS consumes only about |
| seven nanoseconds, while on-core CAS consumes about 18 nanoseconds. |
| The non-blind case's extra load does not come for free. |
| That said, the overhead of these operations are similar to same-CPU |
| CAS and lock, respectively. |
| |
| \QuickQuiz{ |
| \Cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| shows CPU~0 sharing a core with CPU~224. |
| However, isn't it more logical for CPU 0 |
| to share a core with CPU 1 instead of CPU 224??? |
| }\QuickQuizAnswer{ |
| It is easy to be sympathetic to this view, but the file |
| \path{/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list} |
| really does contain the string \co{0,224}. |
| Therefore, CPU~0's hyperthread twin really is CPU~224. |
| Some people speculate that this numbering allows naive applications |
| and schedulers to perform better, citing the fact that on many |
| workloads the second hyperthread does not provide a huge |
| amount of additional performance. |
| This speculation assumes that naive applications and schedulers |
| would utilize CPUs in numerical order, leaving aside the weaker |
| hyperthread twin CPUs until all cores are in use. |
| }\QuickQuizEnd |
| |
| A blind CAS involving CPUs in different cores but on the same socket |
| consumes almost fifty nanoseconds, or almost one hundred clock cycles. |
| The code used for this cache-miss measurement passes the cache line |
| back and forth between a pair of CPUs, so this cache miss is satisfied |
| not from memory, but rather from the other CPU's cache. |
| A non-blind CAS operation, which as noted earlier must look at the old |
| value of the variable as well as store a new value, consumes over one |
| hundred nanoseconds, or more than two hundred clock cycles. |
| Think about this a bit. |
| In the time required to do \emph{one} CAS operation, the CPU could have |
| executed more than \emph{two hundred} normal instructions. |
| This should demonstrate the limitations not only of fine-grained locking, |
| but of any other synchronization mechanism relying on fine-grained |
| global agreement. |
| |
| If the pair of CPUs are on different sockets, the operations are considerably |
| more expensive. |
| A blind CAS operation consumes almost 150~nanoseconds, or more than |
| three hundred clock cycles. |
| A normal CAS operation consumes more than 400~nanoseconds, or almost |
| \emph{one thousand} clock cycles. |
| |
| Worse yet, not all pairs of sockets are created equal. |
| This particular system appears to be constructed as a pair of four-socket |
| components, with additional latency penalties when the CPUs reside |
| in different components. |
| In this case, a blind CAS operation consumes more than three hundred |
| nanoseconds, or more than seven hundred clock cycles. |
| A CAS operation consumes almost a full microsecond, or almost two |
| thousand clock cycles. |
| |
| \QuickQuizLabelRel{\QspeedOfLightAtoms}{1} % cann't put label inside QQSeries |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Surely the hardware designers could be persuaded to improve |
| this situation! |
| Why have they been content with such abysmal performance |
| for these single-instruction operations? |
| }\QuickQuizAnswerB{ |
| The hardware designers \emph{have} been working on this |
| problem, and have consulted with no less a luminary than |
| the late physicist Stephen Hawking. |
| Hawking's observation was that the hardware designers have |
| two basic problems~\cite{BryanGardiner2007}: |
| |
| \begin{enumerate} |
| \item The finite speed of light, and |
| \item The atomic nature of matter. |
| \end{enumerate} |
| |
| \begin{table} |
| %\rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \centering\small |
| \begin{tabular} |
| { |
| ll |
| S[table-format = 9.1] |
| S[table-format = 9.1] |
| } |
| \toprule |
| \multicolumn{2}{l}{Operation} |
| & \multicolumn{1}{r}{Cost (ns)} |
| & {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}} \\ |
| \midrule |
| \multicolumn{2}{l}{Clock period} |
| & 0.4 & 1.0 \\ |
| \midrule |
| \multicolumn{2}{l}{Same-CPU} |
| & & \\ |
| & CAS & 12.2 & 33.8 \\ |
| & lock & 25.6 & 71.2 \\ |
| \midrule |
| \multicolumn{2}{l}{On-Core} |
| & & \\ |
| & Blind CAS & 12.9 & 35.8 \\ |
| & CAS & 7.0 & 19.4 \\ |
| \midrule |
| \multicolumn{2}{l}{Off-Core} |
| & & \\ |
| & Blind CAS & 31.2 & 86.6 \\ |
| & CAS & 31.2 & 86.5 \\ |
| \midrule |
| \multicolumn{2}{l}{Off-Socket} |
| & & \\ |
| & Blind CAS & 92.4 & 256.7 \\ |
| & CAS & 95.9 & 266.4 \\ |
| \midrule |
| \multicolumn{2}{l}{Off-System} |
| & & \\ |
| & Comms Fabric & 2 600 & 7 220 \\ |
| & Global Comms & 195 000 000 & 542 000 000 \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Performance of Synchronization Mechanisms on 16-CPU 2.8\,GHz Intel X5550 (Nehalem) System} |
| \label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| \end{table} |
| |
| The first problem limits raw speed, and the second limits |
| miniaturization, which in turn limits frequency. |
| And even this sidesteps the power-consumption issue that |
| is currently limiting production frequencies to well below |
| 10\,GHz. |
| |
| In addition, |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| on |
| \cpageref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| represents a reasonably large system with no fewer than 448~hardware |
| threads. |
| Smaller systems often achieve better latency, as may be seen in |
| \cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}, |
| which represents a much smaller system with only 16~hardware threads. |
| A similar view is provided by the rows of |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| down to and including the two ``Off-Core'' rows. |
| |
| \begin{table} |
| %\rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \centering\small |
| \tcresizewidth{ |
| \begin{tabular} |
| { |
| ll |
| S[table-format = 9.1] |
| S[table-format = 9.1] |
| r |
| } |
| \toprule |
| \multicolumn{2}{l}{Operation} |
| & \multicolumn{1}{r}{Cost (ns)} |
| & {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}} |
| & CPUs \\ |
| \midrule |
| \multicolumn{2}{l}{Clock period} |
| & 0.5 & 1.0 & \\ |
| \midrule |
| \multicolumn{2}{l}{Same-CPU} & & & 0 \\ |
| & CAS & 6.2 & 13.6 & \\ |
| & lock & 13.5 & 29.6 & \\ |
| \midrule |
| \multicolumn{2}{l}{On-Core} & & & 6 \\ |
| & Blind CAS & 6.5 & 14.3 & \\ |
| & CAS & 16.2 & 35.6 & \\ |
| \midrule |
| \multicolumn{2}{l}{Off-Core} & & & 1--5 \\ |
| & Blind CAS & 22.2 & 48.8 & 7--11 \\ |
| & CAS & 53.6 & 117.9 & \\ |
| \midrule |
| \multicolumn{2}{l}{Off-System}& & & \\ |
| & Comms Fabric & 5 000 & 11 000 & \\ |
| & Global Comms & 195 000 000 & 429 000 000 & \\ |
| \bottomrule |
| \end{tabular} |
| } |
| \caption{CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20\,GHz} |
| \label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz} |
| \end{table} |
| |
| Furthermore, newer small-scale single-socket systems such |
| as the laptop on which I am typing this also have more |
| reasonable latencies, as can be seen in |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}. |
| |
| Alternatively, a 64-CPU system in the mid 1990s had |
| cross-interconnect latencies in excess of five microseconds, |
| so even the eight-socket 448-hardware-thread monster shown in |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| represents more than a five-fold improvement over its |
| 25-years-prior counterparts. |
| |
| Integration of hardware threads in a single core and multiple |
| cores on a die have improved latencies greatly, at least within the |
| confines of a single core or single die. |
| There has been some improvement in overall system latency, |
| but only by about a factor of two. |
| Unfortunately, neither the speed of light nor the atomic nature |
| of matter has changed much in the past few |
| years~\cite{NoBugsHare2016CPUoperations}. |
| Therefore, spatial and temporal locality are first-class concerns |
| for concurrent software, even when running on relatively |
| small systems. |
| |
| \Cref{sec:cpu:Hardware Free Lunch?} |
| looks at what else hardware designers might be |
| able to do to ease the plight of parallel programmers. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| \Cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| in the answer to \QuickQuizARef{\QspeedOfLightAtoms} on |
| \cpageref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System} |
| says that on-core CAS is faster than both of same-CPU CAS and |
| on-core blind CAS\@. |
| What is happening there? |
| }\QuickQuizAnswerE{ |
| I \emph{was} surprised by the data I obtained and did a rigorous |
| check of their validity. |
| I got the same result persistently. |
| One theory that might explain the observation would be: |
| The two threads in the core are able to overlap their accesses, |
| while the single CPU must do everything sequentially. |
| Unfortunately, there seems to be no public documentation explaining |
| why the Intel X5550 (Nehalem) system behaved like that. |
| }\QuickQuizEndE |
| } % End of \QuickQuizSeries |
| |
| \begin{table} |
| \rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \centering\small |
| \begin{tabular}{lrrrrr} |
| \toprule |
| Level & Scope & Line Size & Sets & Ways & Size \\ |
| \midrule |
| L0 & Core & 64 & 64 & 8 & 32K \\ |
| L1 & Core & 64 & 64 & 8 & 32K \\ |
| L2 & Core & 64 & 1024 & 16 & 1024K \\ |
| L3 & Socket & 64 & 57,344 & 11 & 39,424K \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10\,GHz} |
| \label{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz} |
| \end{table} |
| |
| Unfortunately, the high speed of within-core and within-socket communication |
| does not come for free. |
| First, there are only two CPUs within a given core and only 56 within |
| a given socket, compared to 448 across the system. |
| Second, as shown in |
| \cref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz}, |
| the on-core caches are quite small compared to the on-socket caches, which |
| are in turn quite small compared to the 1.4\,TB of memory configured on |
| this system. |
| Third, again referring to the figure, the caches are organized as |
| a hardware hash table with a limited number of items per bucket. |
| For example, the raw size of the L3 cache (``Size'') is almost 40\,MB, but each |
| bucket (``Line'') can only hold 11 blocks of memory (``Ways''), each |
| of which can be at most 64 bytes (``Line Size''). |
| This means that only 12 bytes of memory (admittedly at carefully chosen |
| addresses) are required to overflow this 40\,MB cache. |
| On the other hand, equally careful choice of addresses might make good |
| use of the entire 40\,MB. |
| |
| Spatial locality of reference is clearly extremely important, as is |
| spreading the data across memory. |
| |
| I/O operations are even more expensive. |
| As shown in the ``Comms Fabric'' row, |
| high performance (and expensive!\@) communications fabric, such as |
| InfiniBand or any number of proprietary interconnects, has a latency |
| of roughly five microseconds for an end-to-end round trip, during which |
| time more than \emph{ten thousand} instructions might have been executed. |
| Standards-based communications networks often require some sort of |
| protocol processing, which further increases the \IX{latency}. |
| Of course, geographic distance also increases latency, with the |
| speed-of-light through optical fiber latency around the world coming to |
| roughly 195 \emph{milliseconds}, or more than 400 million clock |
| cycles, as shown in the ``Global Comms'' row of |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}. |
| |
| % Reference for Infiniband latency: |
| % http://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/1_Mellanox.pdf |
| % page 6/76 'Leading Interconnect, Leading Performance' |
| % Needs updating... |
| |
| \QuickQuiz{ |
| These numbers are insanely large! |
| How can I possibly get my head around them? |
| }\QuickQuizAnswer{ |
| Get a roll of toilet paper. |
| In the USA, each roll will normally have somewhere around |
| 350--500 sheets. |
| Tear off one sheet to represent a single clock cycle, setting it aside. |
| Now unroll the rest of the roll. |
| |
| The resulting pile of toilet paper will likely represent a single |
| \IXacr{cas} cache miss. |
| |
| For the more-expensive inter-system communications latencies, |
| use several rolls (or multiple cases) of toilet paper to represent |
| the communications latency. |
| |
| Important safety tip: |
| Make sure to account for the needs of those you live with when |
| appropriating toilet paper, especially |
| in 2020 or during a similar time when store shelves are free of |
| toilet paper and much else besides. |
| |
| Furthermore, for those working on kernel code, a CPU disabling |
| interrupts across a cache miss is analogous to you holding your |
| breath while unrolling a roll of toilet paper. |
| How many rolls of toilet paper can \emph{you} unroll while holding |
| your breath? |
| You might wish to avoid disabling interrupts across that many |
| cache misses.\footnote{ |
| Kudos to Matthew Wilcox for this holding-breath analogy.} |
| }\QuickQuizEnd |
| |
| \subsection{Hardware Optimizations} |
| \label{sec:cpu:Hardware Optimizations} |
| |
| It is only natural to ask how the hardware is helping, and the answer |
| is ``Quite a bit!'' |
| |
| One hardware optimization is large cachelines. |
| This can provide a big performance boost, especially when software is |
| accessing memory sequentially. |
| For example, given a 64-byte cacheline and software accessing 64-bit |
| variables, the first access will still be slow due to speed-of-light |
| delays (if nothing else), but the remaining seven can be quite fast. |
| However, this optimization has a dark side, namely \IX{false sharing}, |
| which happens when different variables in the same cacheline are |
| being updated by different CPUs, resulting in a high cache-miss rate.\footnote{ |
| This situation is sometimes referred to as ``cache thrashing'' |
| or ``cacheline bouncing''.} |
| Software can use the alignment directives available in many compilers |
| to avoid false sharing, and adding such directives is a common step |
| in tuning parallel software. |
| |
| A second related hardware optimization is cache prefetching, in which |
| the hardware reacts to consecutive accesses by prefetching subsequent |
| cachelines, thereby evading speed-of-light delays for these |
| subsequent cachelines. |
| Of course, the hardware must use simple heuristics to determine when |
| to prefetch, and these heuristics can be fooled by the complex data-access |
| patterns in many applications. |
| Fortunately, some CPU families allow for this by providing special |
| prefetch instructions. |
| Unfortunately, the effectiveness of these instructions in the general |
| case is the subject of some dispute. |
| |
| A third hardware optimization is the store buffer, which allows a string |
| of store instructions to execute quickly even when the stores are to |
| non-consecutive addresses and when none of the needed cachelines are |
| present in the CPU's cache. |
| The dark side of this optimization is memory misordering, for which see |
| \cref{chp:Advanced Synchronization: Memory Ordering}. |
| |
| A fourth hardware optimization is speculative execution, which can |
| allow the hardware to make good use of the store buffers without |
| resulting in memory misordering. |
| The dark side of this optimization can be energy inefficiency and |
| lowered performance if the speculative execution goes awry and must |
| be rolled back and retried. |
| Worse yet, the advent of |
| Spectre and Meltdown~\cite{JannHorn2018MeltdownSpectre} |
| made it apparent that hardware speculation can also enable side-channel |
| attacks that defeat memory-protection hardware so as to allow unprivileged |
| processes to read memory that they should not have access to. |
| It is clear that the combination of speculative execution and cloud |
| computing needs more than a bit of rework! |
| |
| One could argue that if people would act reasonably, mitigations for |
| side-channel attacks would not be necessary. |
| However, remotely accessible computer systems really are often under |
| attack by organized crime and by nation states, to say nothing of by |
| bored teenagers. |
| There is an old saying ``It only takes a few to spoil things for |
| everyone'', but the reality is that remotely accessible computer systems |
| must be actively defended from attack. |
| |
| A fifth hardware optimization is large caches, allowing individual |
| CPUs to operate on larger datasets without incurring expensive cache |
| misses. |
| Although large caches can degrade both \IXh{energy}{efficiency} and |
| \IXh{cache-miss} {latency}, the ever-growing cache sizes on production |
| microprocessors attests to the power of this optimization. |
| |
| A final hardware optimization is read-mostly replication, in which |
| data that is frequently read but rarely updated is present in all |
| CPUs' caches. |
| This optimization allows the read-mostly data to be accessed |
| exceedingly efficiently, and is the subject of |
| \cref{chp:Deferred Processing}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/Data-chasing-light-wave}} |
| \caption{Hardware and Software: |
| On Same Side} |
| \ContributedBy{fig:cpu:Hardware and Software: On Same Side}{Melissa Broussard} |
| \end{figure} |
| |
| In short, hardware and software engineers are really on the same side, |
| with both trying to make computers go fast despite the best efforts of |
| the laws of physics, as fancifully depicted in |
| \cref{fig:cpu:Hardware and Software: On Same Side} |
| where our data stream is trying its best to exceed the speed of light, |
| further hindered by the non-zero sizes of atoms. |
| The next section discusses some additional things that the hardware engineers |
| might (or might not) be able to do, depending on how well recent |
| research translates to practice. |
| Software's contribution to this noble goal is outlined in the remaining |
| chapters of this book. |