cpu/overview.tex - pub/scm/linux/kernel/git/paulmck/perfbook.git - Git at Google

 % cpu/overview.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Overview}
 \label{sec:cpu:Overview}
 %
 \epigraph{Mechanical Sympathy:
 	  Hardware and software working together in harmony.}
 	 {Martin Thompson}

 Careless reading of computer-system specification sheets might lead one
 to believe that CPU performance is a footrace on a clear track, as
 illustrated in \cref{fig:cpu:CPU Performance at its Best},
 where the race always goes to the swiftest.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-track-meet}}
 \caption{CPU Performance at its Best}
 \ContributedBy{fig:cpu:CPU Performance at its Best}{Melissa Broussard}
 \end{figure}

 Although there are a few CPU-bound benchmarks that approach the ideal case
 shown in \cref{fig:cpu:CPU Performance at its Best},
 the typical program more closely resembles an obstacle course than
 a race track.
 This is because the internal architecture of CPUs has changed dramatically
 over the past few decades, courtesy of a collision of \IXr{Moore's Law}
 with certain laws of physics.
 A number of these architectural changes are described in the following
 sections.

 \subsection{Pipelined CPUs}
 \label{sec:cpu:Pipelined CPUs}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Old-man-and-Brat}}
 \caption{CPUs Old and New}
 \ContributedBy{fig:cpu:CPUs Old and New}{Melissa Broussard}
 \end{figure}

 In the 1980s, the typical microprocessor fetched an instruction, decoded
 it, and executed it, typically taking \emph{at least} three clock cycles
 to complete one instruction before even starting the next.
 In contrast, the CPU of the late 1990s and of the 2000s execute many
 instructions simultaneously, using \emph{pipelines}; \emph{superscalar}
 techniques; \emph{out-of-order} instruction and data handling;
 \emph{speculative execution}, and
 more~\cite{Hennessy2017}
 in order to optimize the flow of instructions and data through the CPU\@.
 Some cores have more than one hardware thread, which is variously called
 \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
 (HT)~\cite{JFennel1973SMT},
 each of which appears as
 an independent CPU to software, at least from a functional viewpoint.
 These modern hardware features can greatly improve performance, as
 illustrated by \cref{fig:cpu:CPUs Old and New}.

 Achieving full performance with a CPU having a long pipeline requires
 highly predictable control flow through the program.
 Suitable control flow is provided by programs that run in tight loops,
 for example, those doing arithmetic on large matrices or vectors.
 Such loops allow the CPU to correctly predict that the end-of-loop branch
 will be taken in almost all cases, allowing the pipeline to be kept full
 and the CPU to execute at full speed.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-branch-error}}
 \caption{CPU Meets a Pipeline Flush}
 \ContributedBy{fig:cpu:CPU Meets a Pipeline Flush}{Melissa Broussard}
 \end{figure}

 However, branch prediction is not always so easy.
 For example, consider a program with many loops, each of which iterates
 a small but random number of times.
 For another example, consider a museum-piece object-oriented program with
 many virtual objects that can reference many different real objects, all
 with different implementations for frequently invoked member functions,
 resulting in many calls through pointers.
 In these cases, it is difficult or even
 impossible for the CPU to predict where the next branch might lead.
 Then either the CPU must stall waiting for execution to proceed far
 enough to be certain where that branch leads, or it must guess and
 then proceed using speculative execution.
 Although guessing works extremely well for programs with predictable
 control flow, for unpredictable branches (such as those in binary search)
 the guesses will frequently be wrong.
 A wrong guess can be expensive because the CPU must discard any
 speculatively executed instructions following the corresponding
 branch, resulting in a pipeline flush.
 If pipeline flushes appear too frequently, they drastically reduce
 overall performance, as fancifully depicted in
 \cref{fig:cpu:CPU Meets a Pipeline Flush}.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cpu/microarch}}
 \caption{Rough View of Modern Micro-Architecture}
 \label{fig:cpu:Rough View of Modern Micro-Architecture}
 \end{figure}

 This gets even worse for hyperthreading (or SMT, if you prefer),
 especially on pipelined superscalar out-of-order CPU featuring speculative
 execution.
 In this increasingly common case, all the hardware threads sharing
 a core also share that core's resources, including registers, cache,
 execution units, and so on.
 The instructions are often decoded into micro-operations, and use of the
 shared execution units and the hundreds of hardware registers is often
 coordinated by a micro-operation scheduler.
 A rough diagram of such a two-threaded core is shown in
 \cref{fig:cpu:Rough View of Modern Micro-Architecture},
 and more accurate (and thus more complex) diagrams are available in
 textbooks and scholarly papers.\footnote{
 	Here is one example for a late-2010s Intel core:
 	\url{https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)}.}
 Therefore, the execution of one hardware thread can and often is perturbed
 by the actions of other hardware threads sharing that core.

 Even if only one hardware thread is active (for example, in old-school
 CPU designs where there is only one thread), counterintuitive results
 are quite common.
 Execution units often have overlapping capabilities, so that a CPU's
 choice of execution unit can result in pipeline stalls due to contention
 for that execution unit from later instructions.
 In theory, this contention is avoidable, but in practice CPUs must choose
 very quickly and without the benefit of clairvoyance.
 In particular, adding an instruction to a tight loop can sometimes
 actually cause execution to \emph{speed up}.

 Unfortunately, pipeline flushes and shared-resource contention are not
 the only hazards in the obstacle course that modern CPUs must run.
 The next section covers the hazards of referencing memory.

 \subsection{Memory References}
 \label{sec:cpu:Memory References}

 In the 1980s, it often took less time for a microprocessor to load a value
 from memory than it did to execute an instruction.
 More recently, microprocessors might execute hundreds or even thousands
 of instructions in the time required to access memory.
 This disparity is due to the fact that \IXr{Moore's Law} has increased CPU
 performance at a much greater rate than it has decreased memory \IX{latency},
 in part due to the rate at which memory sizes have grown.
 For example, a typical 1970s minicomputer might have 4\,KB (yes, kilobytes,
 not megabytes, let alone gigabytes or terabytes) of main memory, with
 single-cycle access.\footnote{
 	It is only fair to add that each of these single cycles
 	lasted no less than 1.6 \emph{microseconds}.}
 Present-day CPU designers still can construct a 4\,KB memory with single-cycle
 access, even on systems with multi-GHz clock frequencies.
 And in fact they frequently do construct such memories, but they now
 call them ``level-0 caches'', plus they can be a bit larger than 4\,KB.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-memory-reference}}
 \caption{CPU Meets a Memory Reference}
 \ContributedBy{fig:cpu:CPU Meets a Memory Reference}{Melissa Broussard}
 \end{figure}

 Although the large caches found on modern microprocessors can do quite
 a bit to help combat memory-access latencies,
 these caches require highly predictable data-access patterns to
 successfully hide those latencies.
 Unfortunately, common operations such as traversing a linked list
 have extremely unpredictable memory-access patterns---after all,
 if the pattern was predictable, us software types would not bother
 with the pointers, right?
 Therefore, as shown in
 \cref{fig:cpu:CPU Meets a Memory Reference},
 memory references often pose severe obstacles to modern CPUs.

 Thus far, we have only been considering obstacles that can arise during
 a given CPU's execution of single-threaded code.
 Multi-threading presents additional obstacles to the CPU, as
 described in the following sections.

 \subsection{Atomic Operations}
 \label{sec:cpu:Atomic Operations}

 One such obstacle is \IX{atomic} operations.
 The problem here is that the whole idea of an atomic operation conflicts with
 the piece-at-a-time assembly-line operation of a CPU pipeline.
 To hardware designers' credit, modern CPUs use a number of extremely clever
 tricks to make such operations \emph{look} atomic even though they
 are in fact being executed piece-at-a-time,
 with one common trick being to identify all the cachelines containing the
 data to be atomically operated on,
 ensure that these cachelines are owned by the CPU executing the
 atomic operation, and only then proceed with the atomic operation
 while ensuring that these cachelines remained owned by this CPU\@.
 Because all the data is private to this CPU, other CPUs are unable to
 interfere with the atomic operation despite the piece-at-a-time nature
 of the CPU's pipeline.
 Needless to say, this sort of trick can require that
 the pipeline must be delayed or even flushed in order to
 perform the cacheline-setup operations that
 permit a given atomic operation to complete correctly.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Atomic-reference}}
 \caption{CPU Meets an Atomic Operation}
 \ContributedBy{fig:cpu:CPU Meets an Atomic Operation}{Melissa Broussard}
 \end{figure}

 In contrast, when executing a non-atomic operation, the CPU can load
 values from cachelines as they appear and place the results in the
 store buffer, without the need to wait for cacheline ownership.
 Although there are a number of hardware optimizations that can sometimes
 hide cache latencies, an atomic operation's effect on performance is
 all too often as depicted in
 \cref{fig:cpu:CPU Meets an Atomic Operation}.

 Unfortunately, atomic operations usually apply only to single elements
 of data.
 Because many parallel algorithms require that ordering constraints
 be maintained between updates of multiple data elements, most CPUs
 provide memory barriers.
 These memory barriers also serve as performance-sapping obstacles,
 as described in the next section.

 \QuickQuiz{
 	What types of machines would allow atomic operations on
 	multiple data elements?
 }\QuickQuizAnswer{
 	One answer to this question is that it is often possible to
 	pack multiple elements of data into a single machine word,
 	which can then be manipulated atomically.

 	A more trendy answer would be machines supporting transactional
 	memory~\cite{DBLomet1977SIGSOFT,Knight:1986:AMF:319838.319854,Herlihy93a}.
 	By early 2014, several mainstream systems provided limited
 	hardware transactional memory implementations, the ups and
 	downs of which are covered in more detail in
 	\cref{sec:future:Hardware Transactional Memory}.
 	The jury is also still out on the applicability of software
 	transactional
 	memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT,
 	ChistopherJRossbach2007a,CalinCascaval2008tmtoy,
 	AleksandarDragovejic2011STMnotToy,AlexanderMatveev2012PessimisticTM},
 	which is covered in \cref{sec:future:Transactional Memory}.
 }\QuickQuizEnd

 \subsection{Memory Barriers}
 \label{sec:cpu:Memory Barriers}

 \IXpl{Memory barrier} will be considered in more detail in
 \cref{chp:Advanced Synchronization: Memory Ordering} and
 \cref{chp:app:whymb:Why Memory Barriers?}\@.
 In the meantime, consider the following simple lock-based \IX{critical
 section}:

 \begin{VerbatimN}[samepage=true]
 spin_lock(&mylock);
 a = a + 1;
 spin_unlock(&mylock);
 \end{VerbatimN}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2023-Memory-barrier}}
 \caption{CPU Meets a Memory Barrier}
 \ContributedBy{fig:cpu:CPU Meets a Memory Barrier}{Melissa Broussard}
 \end{figure}

 If the CPU were not constrained to execute these statements in the order
 shown, the effect would be that the variable ``a'' would be incremented
 without the protection of ``mylock'', which would defeat the purpose of
 acquiring it.
 To prevent such destructive reordering, locking primitives contain
 either explicit or implicit memory barriers.
 Because the whole purpose of these memory barriers is to prevent
 reorderings that the CPU (to say nothing of the compiler) would otherwise
 undertake in order to increase performance, memory barriers almost always
 reduce performance, as depicted in
 \cref{fig:cpu:CPU Meets a Memory Barrier}.

 As with atomic operations, CPU designers have been working hard to
 reduce \IXh{memory-barrier}{overhead}, and have made substantial progress.

 \subsection{Functional Unit Failings}
 \label{sec:cpu:Functional Unit Failings}

 Modern superscalar CPUs have numerous functional units with varying
 purposes and capabilities.
 Each CPU is likely to have several arithmetic-logic units (ALUs)
 for integer and boolean arithmetic, a few vector units, a couple of
 floating-point units (FPUs), and at least one each branch unit, load unit,
 and store unit.
 Different CPUs will of course have different combinations of functional
 units.
 However, not only will different applications need different combinations
 of functional units, a given application is likely to need different
 combinations at different phases of its execution.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{cartoons/CPU-track-meet-functional-units}}
 \caption{CPU Functional-Unit Mismatch}
 \ContributedBy{fig:cpu:CPU Functional-Unit Mismatch}{Melissa Broussard, remixed}
 \end{figure}

 This means that it does not make much sense to think in terms of a given
 CPU having a perfect combination of functional units.
 Sometimes a CPU instead must ``hop along on one leg'' using only a
 very few out of its impressive array of functional units, as fancifully
 depicted by the unfortunate CPU losing the race in
 \cref{fig:cpu:CPU Functional-Unit Mismatch}.

 \QuickQuiz{
 	But what does this have to do with scaling workloads across
 	a multi-core CPU???
 }\QuickQuizAnswer{
 	In many cases, quite a bit!
 	For example, on a system with hyperthreaded cores, the hardware
 	threads sharing a core also share its functional units, which
 	can result in yet another form of contention that can limit
 	scalability.
 	For another example, the whole point of having an impressive array
 	of functional units is to support instruction-level parallelism,
 	that is, intra-CPU concurrency and thus scalability.
 }\QuickQuizEnd

 Unfortunately, a workload could make perfectly efficient use of each and
 every one of a given CPU's functional units and still lose, for example,
 as described in the next section.

 \subsection{Thermal Throttling}
 \label{sec:cpu:Thermal Throttling}

 One increasingly common frustrating experience is to carefully
 micro-optimize a critical code path, greatly reducing the number of
 clock cycles consumed by that code path, only to find that the
 wall-clock time consumed by that code has actually \emph{increased}.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2022-Thermal-throttling}}
 \caption{CPU Encounters Thermal Throttling}
 \ContributedBy{fig:cpu:Encounters Thermal Throttling}{Melissa Broussard, remixed}
 \end{figure}

 Welcome to modern \IX{thermal throttling}.

 If you reduced the number of clock cycles by making more effective
 use of the CPU's functional units, you will have increased the
 power consumed by that CPU\@.
 This will in turn increase the amount of heat dissipated by that CPU\@.
 If this heat dissipation exceeds the cooling system's capacity, the
 system will thermally throttle that CPU, for example, by reducing
 its clock frequency, as fancifully depicted by the snow penguin in
 \cref{fig:cpu:Encounters Thermal Throttling}.

 If performance is of the essence, the proper fix is improved cooling,
 an approach loved by serious gamers, by overclockers (some of whom make
 good use of liquid nitrogen), and by performance-first systems vendors
 (one of which used per-CPU aluminum heat sinks, each of which weighed
 one pound, that is, 0.454~kilograms).
 But if you cannot modify your computer's cooling system, perhaps because
 you are renting it from a cloud provider, then you will need to take
 some other optimization approach.
 For example, you might need to apply algorithmic optimizations instead
 of hardware-centric micro-optimizations.
 Alternatively, perhaps you can parallelize your code, spreading the
 work (and thus the heat) over multiple CPU cores.

 Some recommend normalizing the results of a given test based on
 the various CPU core clock frequencies, which can help.
 However, the latency of cache misses and memory accesses do not
 depend on CPU core clock frequency.
 In addition, if the code under test features tight interaction among
 multiple CPUs, then the performance that code running on one of the CPUs
 will depend on the core clock frequencies of those other CPUs.

 Finally, if the goal is instead to obtain repeatable benchmark
 measurements, one approach is to extend traditional cache warmup to
 include thermal warmup, with the required warm-up duration depending on
 the size and thus the thermal inertia of the system under test.

 Another approach for repeatable benchmarks applies to event-based
 applications that run in short but widely spaced bursts of
 latency-critical execution.
 In this case, the previous approach's thermal warmup interval should be
 replaced by a cooldown interval during which the system remains idle.

 Even given careful thermal warmup or cooldown, results can vary with
 changes in ambient temperature, which can depend strongly on time of day,
 to say nothing of time of year in certain parts of the world.

 \QuickQuiz{
 	Given a long thermal warmup or cooldown period and a fixed
 	workload, why would ambient temperature matter?
 }\QuickQuizAnswer{
 	Yes, a fixed workload running at a fixed CPU core clock frequency
 	might well generate a constant flow of heat, give or take the
 	many challenges one faces when creating thermal models of CPUs.
 	However, the lower the ambient temperature, the more quickly
 	heat flows from the CPU to the surrounding environment.
 	This in turn, coupled with constant flow of heat, means that
 	lower ambient temperatures will result in lower CPU temperatures,
 	and thus less thermal throttling.
 	Therefore, a given benchmark might well yield higher performance
 	at night or during winter, that is, when ambient temperatures
 	tend to be lower.
 }\QuickQuizEnd

 \subsection{Cache Misses}
 \label{sec:cpu:Cache Misses}

 An additional multi-threading obstacle to CPU performance is
 the ``cache miss''.
 As noted earlier, modern CPUs sport large caches in order to reduce the
 performance penalty that would otherwise be incurred due to high memory
 latencies.
 However, these caches are actually counter-productive for variables that
 are frequently shared among CPUs.
 This is because when a given CPU wishes to modify the variable, it is
 most likely the case that some other CPU has modified it recently.
 In this case, the variable will be in that other CPU's cache, but not
 in this CPU's cache, which will therefore incur an expensive cache miss
 (see \cref{sec:app:whymb:Cache Structure} for more detail).
 Such cache misses form a major obstacle to CPU performance, as shown
 in \cref{fig:cpu:CPU Meets a Cache Miss}.

 \QuickQuiz{
 	So have CPU designers also greatly reduced the overhead of
 	cache misses?
 }\QuickQuizAnswer{
 	Unfortunately, not so much.
 	There has been some reduction given constant numbers of CPUs,
 	but the finite speed of light and the atomic nature of
 	matter limits their ability to reduce cache-miss overhead
 	for larger systems.
 	\Cref{sec:cpu:Hardware Free Lunch?}
 	discusses some possible avenues for possible future progress.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-track-meet-cache-miss-toll-booth}}
 \caption{CPU Meets a Cache Miss}
 \ContributedBy{fig:cpu:CPU Meets a Cache Miss}{Melissa Broussard}
 \end{figure}

 \subsection{I/O Operations}
 \label{sec:cpu:I/O Operations}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-track-meet-phone-booth}}
 \caption{CPU Waits for I/O Completion}
 \ContributedBy{fig:cpu:CPU Waits for I/O Completion}{Melissa Broussard}
 \end{figure}

 A cache miss can be thought of as a CPU-to-CPU I/O operation, and as
 such is one of the cheapest I/O operations available.
 I/O operations involving networking, mass storage, or (worse yet) human
 beings pose much greater obstacles than the internal obstacles called
 out in the prior sections,
 as illustrated by
 \cref{fig:cpu:CPU Waits for I/O Completion}.

 This is one of the differences between shared-memory and distributed-system
 parallelism:
 Shared-memory parallel programs must normally deal with no
 obstacle worse than a cache miss, while a distributed parallel program
 will typically incur the larger network communication latencies.
 In both cases, the relevant latencies can be thought of as a cost of
 communication---a cost that would be absent in a sequential program.
 Therefore, the ratio between the overhead of the communication to
 that of the actual work being performed is a key design parameter.
 A major goal of parallel hardware design is to reduce this ratio as
 needed to achieve the relevant performance and scalability goals.
 In turn, as will be seen in
 \cref{chp:Partitioning and Synchronization Design},
 a major goal of parallel software design is to reduce the
 frequency of expensive operations like communications cache misses.

 Of course, it is one thing to say that a given operation is an obstacle,
 and quite another to show that the operation is a \emph{significant}
 obstacle.
 This distinction is discussed in the following sections.