blob: b3bf5196217465c2fbf318b544334c032a0febf0 [file] [log] [blame]
% cpu/overheads.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Overheads}
\label{sec:cpu:Overheads}
%
\epigraph{Don't design bridges in ignorance of materials, and don't design
low-level software in ignorance of the underlying hardware.}
{Unknown}
This section presents actual \IXpl{overhead} of the obstacles to performance
listed out in the previous section.
However, it is first necessary to get a rough view of hardware system
architecture, which is the subject of the next section.
\subsection{Hardware System Architecture}
\label{sec:cpu:Hardware System Architecture}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cpu/SystemArch}}
\caption{System Hardware Architecture}
\label{fig:cpu:System Hardware Architecture}
\end{figure}
\Cref{fig:cpu:System Hardware Architecture}
shows a rough schematic of an eight-core computer system.
Each die has a pair of CPU cores, each with its cache, as well as an
interconnect allowing the pair of CPUs to communicate with each other.
The system interconnect allows the four dies to communicate with each
other and with main memory.
Data moves through this system in units of ``\IXpl{cache line}'', which
are power-of-two fixed-size aligned blocks of memory, usually ranging
from 32 to 256 bytes in size.
When a CPU loads a variable from memory to one of its registers, it must
first load the cacheline containing that variable into its cache,
which may be thought of as a hardware hash table.
Similarly, when a CPU stores a value from one of its registers into
memory, it must also load the cacheline containing that variable into
its cache, but must also ensure that no other CPU has a copy of that
cacheline.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cpu/SimpleWrite}}
\caption{Lifetime of a ``Simple'' Store}
\label{fig:cpu:Lifetime of a Simple Store}
\end{figure}
For example, suppose that CPU~1 wrote to a variable \co{x} whose cacheline
was in CPU~6's cache.
An over-simplified view of this process is illustrated by the following
sequence of steps in conjunction with
\cref{fig:cpu:Lifetime of a Simple Store},
each row of which condenses
\cref{fig:cpu:System Hardware Architecture}
to show only CPUs~1 and~6, their store buffers and local caches, along
with the system interconnect denoted by the grey rectangle connecting
the pair of CPUs.
\begin{enumerate}
\item CPU~1 checks its local cache, and does not find the cacheline.
It therefore records the write in its store buffer as shown
in row~A of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item A request for this cacheline is forwarded to CPU~0's and~1's
interconnect, which checks CPU~1's local cache, and does not
find the cacheline.
Because nothing has changed, the system state is still as shown
in row~A of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item This request is forwarded to the system interconnect, as shown
in row~B of
\cref{fig:cpu:Lifetime of a Simple Store},
which checks with the other three dies, learning that the
cacheline is held by the die containing CPU~6 and~7.
\item This request is forwarded to CPU~6's and~7's interconnect, which
checks both CPUs' caches, finding the value in CPU~6's cache,
as shown in row~C of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item CPU~6 forwards the cacheline to its interconnect, and also
flushes the cacheline from its cache, and then
CPU~6's and~7's interconnect forwards the cacheline to the
system interconnect, as shown in row~D of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item The system interconnect forwards the cacheline to CPU~0's and~1's
interconnect, as shown in row~E of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item CPU~0's and~1's interconnect forwards the cacheline to CPU~1,
as shown in row~F of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item The cache line is deposited into CPU~1's cache, as shown in
row~G of
\cref{fig:cpu:Lifetime of a Simple Store}.
\item CPU~1 can now complete the write, updating the relevant portions
of the newly arrived cacheline from the value previously
recorded in the store buffer, as shown in row~H of
\cref{fig:cpu:Lifetime of a Simple Store}.
\end{enumerate}
\QuickQuizSeries{%
\QuickQuizB{
This is a \emph{simplified} sequence of events?
How could it \emph{possibly} be any more complex?
}\QuickQuizAnswerB{
This sequence ignored a number of possible complications,
including:
\begin{enumerate}
\item Other CPUs might be concurrently attempting to perform
memory-reference operations involving this same cacheline.
\item The cacheline might have been replicated read-only in
several CPUs' caches, in which case, it would need to
be flushed from their caches.
\item CPU~6 might have been operating on the cache line when
the request for it arrived, in which case CPU~6 might
need to hold off the request until its own operation
completed.
\item CPU~6 might have ejected the cacheline from its cache
(for example, in order to make room for other data),
so that by the time that the request arrived, the
cacheline was on its way to memory.
\item A correctable error might have occurred in the cacheline,
which would then need to be corrected at some point before
the data was used.
\end{enumerate}
Production-quality cache-coherence mechanisms are extremely
complicated due to these sorts of
considerations~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel}.
%
}\QuickQuizEndB
%
\QuickQuizE{
Why is it necessary to flush the cacheline from CPU~6's cache?
}\QuickQuizAnswerE{
If the cacheline was not flushed from CPU~6's cache, then
CPUs~1 and~6 might have different values for the same set
of variables in the cacheline.
This sort of incoherence greatly complicates parallel software,
which is why wise hardware architects avoid it.
}\QuickQuizEndE
}
This simplified sequence is just the beginning of a discipline called
\emph{cache-coherency protocols}~\cite{Hennessy95a,DavidECuller1999,MiloMKMartin2012scale,DanielJSorin2011MemModel},
which is discussed in more detail in
\cref{chp:app:whymb:Why Memory Barriers?}\@.
As can be seen in the sequence of events triggered by a simple memory
write operation,
a single instruction can cause considerable protocol traffic, which
can significantly degrade your parallel program's performance.
Fortunately, if a given variable is being frequently read during a time
interval during which it is never updated, that variable can be replicated
across all CPUs' caches.
This replication permits all CPUs to enjoy extremely fast access to
this \emph{read-mostly} variable.
\Cref{chp:Deferred Processing} presents synchronization
mechanisms that take full advantage of this important hardware read-mostly
optimization.
\subsection{Costs of Operations}
\label{sec:cpu:Costs of Operations}
\begin{table}
%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
\tcresizewidth{
\begin{tabular}
{
ll
S[table-format = 9.1]
S[table-format = 9.1]
r
}
\toprule
\multicolumn{2}{l}{Operation}
& \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}}
& CPUs \\
\midrule
\multicolumn{2}{l}{Clock period}
& 0.5 & 1.0 & \\
\midrule
\multicolumn{2}{l}{Same-CPU}
& & & 0 \\
& CAS & 7.0 & 14.6 & \\
& lock & 15.4 & 32.3 & \\
\midrule
\multicolumn{2}{l}{On-Core}
& & & 224 \\
& Blind CAS& 7.2 & 15.2 & \\
& CAS & 18.0 & 37.7 & \\
\midrule
\multicolumn{2}{l}{Off-Core}
& & & 1--27 \\
& Blind CAS& 47.5 & 99.8 & 225--251 \\
& CAS & 101.9 & 214.0 & \\
\midrule
\multicolumn{2}{l}{Off-Socket}
& & & 28--111 \\
& Blind CAS& 148.8 & 312.5 & 252--335 \\
& CAS & 442.9 & 930.1 & \\
\midrule
\multicolumn{2}{l}{Cross-Interconnect}
& & & 112--223 \\
& Blind CAS& 336.6 & 706.8 & 336--447 \\
& CAS & 944.8 & 1984.2 & \\
\midrule
\multicolumn{2}{l}{Off-System}
& & & \\
& Comms Fabric & 5 000 & 10 500 & \\
& Global Comms & 195 000 000 & 409 500 000 & \\
\bottomrule
\end{tabular}
}
\caption{CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10\,GHz}
\label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
\end{table}
The overheads of some common operations important to parallel programs are
displayed in
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}.
This system's clock period rounds to 0.5\,ns.
Although it is not unusual for modern microprocessors to be able to
retire multiple instructions per clock period, the operations' costs are
nevertheless normalized to a clock period in the third column, labeled
``Ratio''.
The first thing to note about this table is the large values of many of
the ratios.
The same-CPU \acrmf{cas} operation consumes about seven
nanoseconds, a duration more than ten times that of the clock period.
CAS is an atomic operation in which the hardware compares the contents
of the specified memory location to a specified ``old'' value, and if
they compare equal, stores a specified ``new'' value, in which case the
CAS operation succeeds.
If they compare unequal, the memory location keeps its (unexpected) value,
and the CAS operation fails.
The operation is atomic in that the hardware guarantees that the memory
location will not be changed between the compare and the store.
CAS functionality is provided by the \co{lock;cmpxchg} instruction on x86.
The ``same-CPU'' prefix means that the CPU now performing the CAS operation
on a given variable was also the last CPU to access this variable, so
that the corresponding cacheline is already held in that CPU's cache.
Similarly, the same-CPU lock operation (a ``round trip'' pair consisting
of a lock acquisition and release) consumes more than fifteen nanoseconds,
or more than thirty clock cycles.
The lock operation is more expensive than CAS because it requires two
atomic operations on the lock data structure, one for acquisition and
the other for release.
On-core operations involving interactions between the hardware threads
sharing a single core are about the same cost as same-CPU operations.
This should not be too surprising, given that these two hardware threads
also share the full cache hierarchy.
In the case of the blind CAS, the software specifies the old value
without looking at the memory location.
This approach is appropriate when attempting to acquire a lock.
If the unlocked state is represented by zero and the locked state
is represented by the value one, then a CAS operation on the lock
that specifies zero for the old value and one for the new value
will acquire the lock if it is not already held.
The key point is that there is only one access to the memory
location, namely the CAS operation itself.
In contrast, a normal CAS operation's old value is derived from
some earlier load.
For example, to implement an atomic increment, the current value in
a shared variable is loaded into one machine register and then incremented
to produce the new value in another machine register.
Then in the CAS operation, the first register is specified as the old
value and the second register as the new value.
If the shared variable's value did not change in the meantime, the
CAS operation would store the new value, thus incrementing that shared
variable.
However, if the shared variable's value did change, then the old value
would not match, causing a miscompare that would result in the CAS
operation failing and thus making no further change to that shared
variable.
The key point is that there are now two accesses to the memory location,
the load and the CAS\@.
Thus, it is not surprising that on-core blind CAS consumes only about
seven nanoseconds, while on-core CAS consumes about 18 nanoseconds.
The non-blind case's extra load does not come for free.
That said, the overhead of these operations are similar to same-CPU
CAS and lock, respectively.
\QuickQuiz{
\Cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
shows CPU~0 sharing a core with CPU~224.
However, isn't it more logical for CPU 0
to share a core with CPU 1 instead of CPU 224???
}\QuickQuizAnswer{
It is easy to be sympathetic to this view, but the file
\path{/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list}
really does contain the string \co{0,224}.
Therefore, CPU~0's hyperthread twin really is CPU~224.
Some people speculate that this numbering allows naive applications
and schedulers to perform better, citing the fact that on many
workloads the second hyperthread does not provide a huge
amount of additional performance.
This speculation assumes that naive applications and schedulers
would utilize CPUs in numerical order, leaving aside the weaker
hyperthread twin CPUs until all cores are in use.
}\QuickQuizEnd
A blind CAS involving CPUs in different cores but on the same socket
consumes almost fifty nanoseconds, or almost one hundred clock cycles.
The code used for this cache-miss measurement passes the cache line
back and forth between a pair of CPUs, so this cache miss is satisfied
not from memory, but rather from the other CPU's cache.
A non-blind CAS operation, which as noted earlier must look at the old
value of the variable as well as store a new value, consumes over one
hundred nanoseconds, or more than two hundred clock cycles.
Think about this a bit.
In the time required to do \emph{one} CAS operation, the CPU could have
executed more than \emph{two hundred} normal instructions.
This should demonstrate the limitations not only of fine-grained locking,
but of any other synchronization mechanism relying on fine-grained
global agreement.
If the pair of CPUs are on different sockets, the operations are considerably
more expensive.
A blind CAS operation consumes almost 150~nanoseconds, or more than
three hundred clock cycles.
A normal CAS operation consumes more than 400~nanoseconds, or almost
\emph{one thousand} clock cycles.
Worse yet, not all pairs of sockets are created equal.
This particular system appears to be constructed as a pair of four-socket
components, with additional latency penalties when the CPUs reside
in different components.
In this case, a blind CAS operation consumes more than three hundred
nanoseconds, or more than seven hundred clock cycles.
A CAS operation consumes almost a full microsecond, or almost two
thousand clock cycles.
\QuickQuizLabelRel{\QspeedOfLightAtoms}{1} % cann't put label inside QQSeries
\QuickQuizSeries{%
\QuickQuizB{
Surely the hardware designers could be persuaded to improve
this situation!
Why have they been content with such abysmal performance
for these single-instruction operations?
}\QuickQuizAnswerB{
The hardware designers \emph{have} been working on this
problem, and have consulted with no less a luminary than
the late physicist Stephen Hawking.
Hawking's observation was that the hardware designers have
two basic problems~\cite{BryanGardiner2007}:
\begin{enumerate}
\item The finite speed of light, and
\item The atomic nature of matter.
\end{enumerate}
\begin{table}
%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
\begin{tabular}
{
ll
S[table-format = 9.1]
S[table-format = 9.1]
}
\toprule
\multicolumn{2}{l}{Operation}
& \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}} \\
\midrule
\multicolumn{2}{l}{Clock period}
& 0.4 & 1.0 \\
\midrule
\multicolumn{2}{l}{Same-CPU}
& & \\
& CAS & 12.2 & 33.8 \\
& lock & 25.6 & 71.2 \\
\midrule
\multicolumn{2}{l}{On-Core}
& & \\
& Blind CAS & 12.9 & 35.8 \\
& CAS & 7.0 & 19.4 \\
\midrule
\multicolumn{2}{l}{Off-Core}
& & \\
& Blind CAS & 31.2 & 86.6 \\
& CAS & 31.2 & 86.5 \\
\midrule
\multicolumn{2}{l}{Off-Socket}
& & \\
& Blind CAS & 92.4 & 256.7 \\
& CAS & 95.9 & 266.4 \\
\midrule
\multicolumn{2}{l}{Off-System}
& & \\
& Comms Fabric & 2 600 & 7 220 \\
& Global Comms & 195 000 000 & 542 000 000 \\
\bottomrule
\end{tabular}
\caption{Performance of Synchronization Mechanisms on 16-CPU 2.8\,GHz Intel X5550 (Nehalem) System}
\label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\end{table}
The first problem limits raw speed, and the second limits
miniaturization, which in turn limits frequency.
And even this sidesteps the power-consumption issue that
is currently limiting production frequencies to well below
10\,GHz.
In addition,
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
on
\cpageref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
represents a reasonably large system with no fewer than 448~hardware
threads.
Smaller systems often achieve better latency, as may be seen in
\cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System},
which represents a much smaller system with only 16~hardware threads.
A similar view is provided by the rows of
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
down to and including the two ``Off-Core'' rows.
\begin{table}
%\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
\tcresizewidth{
\begin{tabular}
{
ll
S[table-format = 9.1]
S[table-format = 9.1]
r
}
\toprule
\multicolumn{2}{l}{Operation}
& \multicolumn{1}{r}{Cost (ns)}
& {\parbox[b]{.7in}{\raggedleft Ratio\\(cost/clock)}}
& CPUs \\
\midrule
\multicolumn{2}{l}{Clock period}
& 0.5 & 1.0 & \\
\midrule
\multicolumn{2}{l}{Same-CPU} & & & 0 \\
& CAS & 6.2 & 13.6 & \\
& lock & 13.5 & 29.6 & \\
\midrule
\multicolumn{2}{l}{On-Core} & & & 6 \\
& Blind CAS & 6.5 & 14.3 & \\
& CAS & 16.2 & 35.6 & \\
\midrule
\multicolumn{2}{l}{Off-Core} & & & 1--5 \\
& Blind CAS & 22.2 & 48.8 & 7--11 \\
& CAS & 53.6 & 117.9 & \\
\midrule
\multicolumn{2}{l}{Off-System}& & & \\
& Comms Fabric & 5 000 & 11 000 & \\
& Global Comms & 195 000 000 & 429 000 000 & \\
\bottomrule
\end{tabular}
}
\caption{CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20\,GHz}
\label{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}
\end{table}
Furthermore, newer small-scale single-socket systems such
as the laptop on which I am typing this also have more
reasonable latencies, as can be seen in
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}.
Alternatively, a 64-CPU system in the mid 1990s had
cross-interconnect latencies in excess of five microseconds,
so even the eight-socket 448-hardware-thread monster shown in
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}
represents more than a five-fold improvement over its
25-years-prior counterparts.
Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the
confines of a single core or single die.
There has been some improvement in overall system latency,
but only by about a factor of two.
Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few
years~\cite{NoBugsHare2016CPUoperations}.
Therefore, spatial and temporal locality are first-class concerns
for concurrent software, even when running on relatively
small systems.
\Cref{sec:cpu:Hardware Free Lunch?}
looks at what else hardware designers might be
able to do to ease the plight of parallel programmers.
}\QuickQuizEndB
%
\QuickQuizE{
\Cref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
in the answer to \QuickQuizARef{\QspeedOfLightAtoms} on
\cpageref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
says that on-core CAS is faster than both of same-CPU CAS and
on-core blind CAS\@.
What is happening there?
}\QuickQuizAnswerE{
I \emph{was} surprised by the data I obtained and did a rigorous
check of their validity.
I got the same result persistently.
One theory that might explain the observation would be:
The two threads in the core are able to overlap their accesses,
while the single CPU must do everything sequentially.
Unfortunately, there seems to be no public documentation explaining
why the Intel X5550 (Nehalem) system behaved like that.
}\QuickQuizEndE
} % End of \QuickQuizSeries
\begin{table}
\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\centering\small
\begin{tabular}{lrrrrr}
\toprule
Level & Scope & Line Size & Sets & Ways & Size \\
\midrule
L0 & Core & 64 & 64 & 8 & 32K \\
L1 & Core & 64 & 64 & 8 & 32K \\
L2 & Core & 64 & 1024 & 16 & 1024K \\
L3 & Socket & 64 & 57,344 & 11 & 39,424K \\
\bottomrule
\end{tabular}
\caption{Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10\,GHz}
\label{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz}
\end{table}
Unfortunately, the high speed of within-core and within-socket communication
does not come for free.
First, there are only two CPUs within a given core and only 56 within
a given socket, compared to 448 across the system.
Second, as shown in
\cref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz},
the on-core caches are quite small compared to the on-socket caches, which
are in turn quite small compared to the 1.4\,TB of memory configured on
this system.
Third, again referring to the figure, the caches are organized as
a hardware hash table with a limited number of items per bucket.
For example, the raw size of the L3 cache (``Size'') is almost 40\,MB, but each
bucket (``Line'') can only hold 11 blocks of memory (``Ways''), each
of which can be at most 64 bytes (``Line Size'').
This means that only 12 bytes of memory (admittedly at carefully chosen
addresses) are required to overflow this 40\,MB cache.
On the other hand, equally careful choice of addresses might make good
use of the entire 40\,MB.
Spatial locality of reference is clearly extremely important, as is
spreading the data across memory.
I/O operations are even more expensive.
As shown in the ``Comms Fabric'' row,
high performance (and expensive!\@) communications fabric, such as
InfiniBand or any number of proprietary interconnects, has a latency
of roughly five microseconds for an end-to-end round trip, during which
time more than \emph{ten thousand} instructions might have been executed.
Standards-based communications networks often require some sort of
protocol processing, which further increases the \IX{latency}.
Of course, geographic distance also increases latency, with the
speed-of-light through optical fiber latency around the world coming to
roughly 195 \emph{milliseconds}, or more than 400 million clock
cycles, as shown in the ``Global Comms'' row of
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}.
% Reference for Infiniband latency:
% http://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/1_Mellanox.pdf
% page 6/76 'Leading Interconnect, Leading Performance'
% Needs updating...
\QuickQuiz{
These numbers are insanely large!
How can I possibly get my head around them?
}\QuickQuizAnswer{
Get a roll of toilet paper.
In the USA, each roll will normally have somewhere around
350--500 sheets.
Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single
\IXacr{cas} cache miss.
For the more-expensive inter-system communications latencies,
use several rolls (or multiple cases) of toilet paper to represent
the communications latency.
Important safety tip:
Make sure to account for the needs of those you live with when
appropriating toilet paper, especially
in 2020 or during a similar time when store shelves are free of
toilet paper and much else besides.
Furthermore, for those working on kernel code, a CPU disabling
interrupts across a cache miss is analogous to you holding your
breath while unrolling a roll of toilet paper.
How many rolls of toilet paper can \emph{you} unroll while holding
your breath?
You might wish to avoid disabling interrupts across that many
cache misses.\footnote{
Kudos to Matthew Wilcox for this holding-breath analogy.}
}\QuickQuizEnd
\subsection{Hardware Optimizations}
\label{sec:cpu:Hardware Optimizations}
It is only natural to ask how the hardware is helping, and the answer
is ``Quite a bit!''
One hardware optimization is large cachelines.
This can provide a big performance boost, especially when software is
accessing memory sequentially.
For example, given a 64-byte cacheline and software accessing 64-bit
variables, the first access will still be slow due to speed-of-light
delays (if nothing else), but the remaining seven can be quite fast.
However, this optimization has a dark side, namely \IX{false sharing},
which happens when different variables in the same cacheline are
being updated by different CPUs, resulting in a high cache-miss rate.\footnote{
This situation is sometimes referred to as ``cache thrashing''
or ``cacheline bouncing''.}
Software can use the alignment directives available in many compilers
to avoid false sharing, and adding such directives is a common step
in tuning parallel software.
A second related hardware optimization is cache prefetching, in which
the hardware reacts to consecutive accesses by prefetching subsequent
cachelines, thereby evading speed-of-light delays for these
subsequent cachelines.
Of course, the hardware must use simple heuristics to determine when
to prefetch, and these heuristics can be fooled by the complex data-access
patterns in many applications.
Fortunately, some CPU families allow for this by providing special
prefetch instructions.
Unfortunately, the effectiveness of these instructions in the general
case is the subject of some dispute.
A third hardware optimization is the store buffer, which allows a string
of store instructions to execute quickly even when the stores are to
non-consecutive addresses and when none of the needed cachelines are
present in the CPU's cache.
The dark side of this optimization is memory misordering, for which see
\cref{chp:Advanced Synchronization: Memory Ordering}.
A fourth hardware optimization is speculative execution, which can
allow the hardware to make good use of the store buffers without
resulting in memory misordering.
The dark side of this optimization can be energy inefficiency and
lowered performance if the speculative execution goes awry and must
be rolled back and retried.
Worse yet, the advent of
Spectre and Meltdown~\cite{JannHorn2018MeltdownSpectre}
made it apparent that hardware speculation can also enable side-channel
attacks that defeat memory-protection hardware so as to allow unprivileged
processes to read memory that they should not have access to.
It is clear that the combination of speculative execution and cloud
computing needs more than a bit of rework!
One could argue that if people would act reasonably, mitigations for
side-channel attacks would not be necessary.
However, remotely accessible computer systems really are often under
attack by organized crime and by nation states, to say nothing of by
bored teenagers.
There is an old saying ``It only takes a few to spoil things for
everyone'', but the reality is that remotely accessible computer systems
must be actively defended from attack.
A fifth hardware optimization is large caches, allowing individual
CPUs to operate on larger datasets without incurring expensive cache
misses.
Although large caches can degrade both \IXh{energy}{efficiency} and
\IXh{cache-miss} {latency}, the ever-growing cache sizes on production
microprocessors attests to the power of this optimization.
A final hardware optimization is read-mostly replication, in which
data that is frequently read but rarely updated is present in all
CPUs' caches.
This optimization allows the read-mostly data to be accessed
exceedingly efficiently, and is the subject of
\cref{chp:Deferred Processing}.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/Data-chasing-light-wave}}
\caption{Hardware and Software:
On Same Side}
\ContributedBy{fig:cpu:Hardware and Software: On Same Side}{Melissa Broussard}
\end{figure}
In short, hardware and software engineers are really on the same side,
with both trying to make computers go fast despite the best efforts of
the laws of physics, as fancifully depicted in
\cref{fig:cpu:Hardware and Software: On Same Side}
where our data stream is trying its best to exceed the speed of light,
further hindered by the non-zero sizes of atoms.
The next section discusses some additional things that the hardware engineers
might (or might not) be able to do, depending on how well recent
research translates to practice.
Software's contribution to this noble goal is outlined in the remaining
chapters of this book.