future/cpu.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % future/cpu.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{The Future of CPU Technology Ain't What it Used to Be}
 \label{sec:future:The Future of CPU Technology Ain't What it Used to Be}
 %
 \epigraph{A great future behind him.}{David Maraniss}

 Years past always seem so simple and innocent when viewed through the
 lens of many years of experience.
 And the early 2000s were for the most part innocent of the impending
 failure of \IXr{Moore's Law} to continue delivering the then-traditional
 increases in CPU clock frequency.
 Oh, there were the occasional warnings about the limits of technology,
 but such warnings had been sounded for decades.
 With that in mind, consider the following scenarios:

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-future-uniprocessor-uber-alles}}
 \caption{Uniprocessor \"Uber Alles}
 \ContributedBy{fig:future:Uniprocessor \"Uber Alles}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2.6in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-Multithreaded-Mania}}
 \caption{Multithreaded Mania}
 \ContributedBy{fig:future:Multithreaded Mania}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-More-of-the-Same}}
 \caption{More of the Same}
 \ContributedBy{fig:future:More of the Same}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-Crash-dummies}}
 \caption{Crash Dummies Slamming into the Memory Wall}
 \ContributedBy{fig:future:Crash Dummies Slamming into the Memory Wall}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2021-CPU-future-astounding-accelerator}}
 \caption{Astounding Accelerators}
 \ContributedBy{fig:future:Astounding Accelerators}{Melissa Broussard, remixed}
 \end{figure}

 \begin{enumerate}
 \item	Uniprocessor \"Uber Alles
 	(\cref{fig:future:Uniprocessor \"Uber Alles}),
 \item	Multithreaded Mania
 	(\cref{fig:future:Multithreaded Mania}),
 \item	More of the Same
 	(\cref{fig:future:More of the Same}), and
 \item	Crash Dummies Slamming into the Memory Wall
 	(\cref{fig:future:Crash Dummies Slamming into the Memory Wall}).
 \item	Astounding Accelerators
 	(\cref{fig:future:Astounding Accelerators}).
 \end{enumerate}

 Each of these scenarios is covered in the following sections.

 \subsection{Uniprocessor \"Uber Alles}
 \label{sec:future:Uniprocessor \"Uber Alles}

 As was said in 2004~\cite{PaulEdwardMcKenneyPhD}:

 \begin{quote}
 	In this scenario, the combination of \IXaltr{Moore's-Law}{Moore's Law}
 	increases in CPU
 	clock rate and continued progress in horizontally scaled computing
 	render SMP systems irrelevant.
 	This scenario is therefore dubbed ``Uniprocessor \"Uber
 	Alles'', literally, uniprocessors above all else.

 	These uniprocessor systems would be subject only to instruction
 	overhead, since \IXpl{memory barrier}, cache thrashing, and contention
 	do not affect single-CPU systems.
 	In this scenario, RCU is useful only for niche applications, such
 	as interacting with \IXacrpl{nmi}.
 	It is not clear that an operating system lacking RCU would see
 	the need to adopt it, although operating
 	systems that already implement RCU might continue to do so.

 	However, recent progress with multithreaded CPUs seems to indicate
 	that this scenario is quite unlikely.
 \end{quote}

 Unlikely indeed!
 But the larger software community was reluctant to accept the fact that
 they would need to embrace parallelism, and so it was some time before
 this community concluded that the ``free lunch'' of
 \IXaltr{Moore's-Law}{Moore's Law}-induced
 CPU core-clock frequency increases was well and truly finished.
 Never forget:
 Belief is an emotion, not necessarily the result of a rational technical
 thought process!

 \subsection{Multithreaded Mania}
 \label{sec:future:Multithreaded Mania}

 Also from 2004~\cite{PaulEdwardMcKenneyPhD}:

 \begin{quote}
 	A less-extreme variant of Uniprocessor \"Uber Alles features
 	uniprocessors with hardware multithreading, and in fact
 	multithreaded CPUs are now standard for many desktop and laptop
 	computer systems.
 	The most aggressively multithreaded CPUs share all levels of
 	cache hierarchy, thereby eliminating CPU-to-CPU \IXh{memory}{latency},
 	in turn greatly reducing the performance penalty for traditional
 	synchronization mechanisms.
 	However, a multithreaded CPU would still incur overhead due to
 	contention and to pipeline stalls caused by memory barriers.
 	Furthermore, because all hardware threads share all levels
 	of cache, the cache available to a given hardware thread is a
 	fraction of what it would be on an equivalent single-threaded
 	CPU, which can degrade performance for applications with large
 	cache footprints.
 	There is also some possibility that the restricted amount of cache
 	available will cause RCU-based algorithms to incur performance
 	penalties due to their grace-period-induced additional memory
 	consumption.
 	Investigating this possibility is future work.

 	However, in order to avoid such performance degradation, a number
 	of multithreaded CPUs and multi-CPU chips partition at least
 	some of the levels of cache on a per-hardware-thread basis.
 	This increases the amount of cache available to each hardware
 	thread, but re-introduces memory latency for cachelines that
 	are passed from one hardware thread to another.
 \end{quote}

 And we all know how this story has played out, with multiple multi-threaded
 cores on a single die plugged into a single socket, with varying degrees
 of optimization for lower numbers of active threads per core.
 The question then becomes whether or not future shared-memory systems will
 always fit into a single socket.

 \subsection{More of the Same}
 \label{sec:meas:More of the Same}

 Again from 2004~\cite{PaulEdwardMcKenneyPhD}:

 \begin{quote}
 	The More-of-the-Same scenario assumes that the memory-latency
 	ratios will remain roughly where they are today.

 	This scenario actually represents a change, since to have more
 	of the same, interconnect performance must begin keeping up
 	with the \IXaltr{Moore's-Law}{Moore's Law} increases in core CPU performance.
 	In this scenario, overhead due to pipeline stalls, memory latency,
 	and contention remains significant, and RCU retains the high
 	level of applicability that it enjoys today.
 \end{quote}

 And the change has been the ever-increasing levels of integration
 that \IXr{Moore's Law} is still providing.
 But longer term, which will it be?
 More CPUs per die?
 Or more I/O, cache, and memory?

 Servers seem to be choosing the former, while embedded systems on a chip
 (SoCs) continue choosing the latter.

 \subsection{Crash Dummies Slamming into the Memory Wall}
 \label{sec:future:Crash Dummies Slamming into the Memory Wall}

 \begin{figure}
 \centering
 \epsfxsize=3in
 \epsfbox{future/latencytrend}
 % from Ph.D. thesis: related/latencytrend.eps
 \caption{Instructions per Local Memory Reference for Sequent Computers}
 \label{fig:future:Instructions per Local Memory Reference for Sequent Computers}
 \end{figure}

 \begin{figure}
 \centering
 \epsfxsize=3in
 \epsfbox{future/be-lb-n4-rf-all}
 % from Ph.D. thesis: an/plots/be-lb-n4-rf-all.eps
 \caption{Breakevens vs.\@ $r$, $\lambda$ Large, Four CPUs}
 \label{fig:future:Breakevens vs. r; lambda Large; Four CPUs}
 \end{figure}

 \begin{figure}
 \centering
 \epsfxsize=3in
 \epsfbox{future/be-lw-n4-rf-all}
 % from Ph.D. thesis: an/plots/be-lw-n4-rf-all.eps
 \caption{Breakevens vs.\@ $r$, $\lambda$ Small, Four CPUs}
 \label{fig:future:Breakevens vs. r; Worst-Case lambda; Four CPUs}
 \end{figure}

 And one more quote from 2004~\cite{PaulEdwardMcKenneyPhD}:

 \begin{quote}
 	If the memory-latency trends shown in
 	\cref{fig:future:Instructions per Local Memory Reference for Sequent Computers}
 	continue, then memory latency will continue to grow relative
 	to instruction-execution overhead.
 	Systems such as Linux that have significant use of RCU will find
 	additional use of RCU to be profitable, as shown in
 	\cref{fig:future:Breakevens vs. r; lambda Large; Four CPUs}.
 	As can be seen in this figure, if RCU is heavily used, increasing
 	memory-latency ratios give RCU an increasing advantage over other
 	synchronization mechanisms.
 	In contrast, systems with minor
 	use of RCU will require increasingly high degrees of read intensity
 	for use of RCU to pay off, as shown in
 	\cref{fig:future:Breakevens vs. r; Worst-Case lambda; Four CPUs}.
 	As can be seen in this figure, if RCU is lightly used,
 	increasing memory-latency ratios
 	put RCU at an increasing disadvantage compared to other synchronization
 	mechanisms.
 	Since Linux has been observed with over 1,600 callbacks per \IX{grace
 	period} under heavy load~\cite{Sarma04c},
 	it seems safe to say that Linux falls into the former category.
 \end{quote}

 On the one hand, this passage failed to anticipate the cache-warmth
 issues that RCU can suffer from in workloads with significant update
 intensity, in part because it seemed unlikely that RCU would really
 be used for such workloads.
 In the event, the \co{SLAB_TYPESAFE_BY_RCU} has been pressed into
 service in a number of instances where these cache-warmth issues would
 otherwise be problematic, as has sequence locking.
 On the other hand, this passage also failed to anticipate that
 RCU would be used to reduce scheduling latency or for security.

 Much of the data generated for this book was collected on an eight-socket
 system with 28 cores per socket and two hardware threads per core, for
 a total of 448 hardware threads.
 The idle-system memory latencies are less than one microsecond, which
 are no worse than those of similar-sized systems of the year 2004.
 Some claim that these latencies approach a microsecond only because of
 the x86 CPU family's relatively strong memory ordering, but it may be
 some time before that particular argument is settled.

 \subsection{Astounding Accelerators}
 \label{sec:future:Astounding Accelerators}

 The potential of hardware accelerators was not quite as clear in 2004
 as it is in 2021, so this section has no quote.
 However, the November 2020 Top 500 list~\cite{Top500} features a great
 many accelerators, so one could argue that this section is a view of
 the present rather than of the future.
 The same could be said of most of the preceding sections.

 Hardware accelerators are being put to many other uses, including
 encryption, compression, and machine learning.

 In short, beware of prognostications, including those in the remainder
 of this chapter.
	% future/cpu.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{The Future of CPU Technology Ain't What it Used to Be}
	\label{sec:future:The Future of CPU Technology Ain't What it Used to Be}
	%
	\epigraph{A great future behind him.}{David Maraniss}

	Years past always seem so simple and innocent when viewed through the
	lens of many years of experience.
	And the early 2000s were for the most part innocent of the impending
	failure of \IXr{Moore's Law} to continue delivering the then-traditional
	increases in CPU clock frequency.
	Oh, there were the occasional warnings about the limits of technology,
	but such warnings had been sounded for decades.
	With that in mind, consider the following scenarios:

	\begin{figure}
	\centering
	\resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-future-uniprocessor-uber-alles}}
	\caption{Uniprocessor \"Uber Alles}
	\ContributedBy{fig:future:Uniprocessor \"Uber Alles}{Melissa Broussard}
	\end{figure}

	\begin{figure}
	\centering
	\resizebox{2.6in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-Multithreaded-Mania}}
	\caption{Multithreaded Mania}
	\ContributedBy{fig:future:Multithreaded Mania}{Melissa Broussard}
	\end{figure}

	\begin{figure}
	\centering
	\resizebox{2.5in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-More-of-the-Same}}
	\caption{More of the Same}
	\ContributedBy{fig:future:More of the Same}{Melissa Broussard}
	\end{figure}

	\begin{figure}
	\centering
	\resizebox{3in}{!}{\includegraphics{cartoons/r-2014-CPU-Future-Crash-dummies}}
	\caption{Crash Dummies Slamming into the Memory Wall}
	\ContributedBy{fig:future:Crash Dummies Slamming into the Memory Wall}{Melissa Broussard}
	\end{figure}

	\begin{figure}
	\centering
	\resizebox{3in}{!}{\includegraphics{cartoons/r-2021-CPU-future-astounding-accelerator}}
	\caption{Astounding Accelerators}
	\ContributedBy{fig:future:Astounding Accelerators}{Melissa Broussard, remixed}
	\end{figure}

	\begin{enumerate}
	\item Uniprocessor \"Uber Alles
	(\cref{fig:future:Uniprocessor \"Uber Alles}),
	\item Multithreaded Mania
	(\cref{fig:future:Multithreaded Mania}),
	\item More of the Same
	(\cref{fig:future:More of the Same}), and
	\item Crash Dummies Slamming into the Memory Wall
	(\cref{fig:future:Crash Dummies Slamming into the Memory Wall}).
	\item Astounding Accelerators
	(\cref{fig:future:Astounding Accelerators}).
	\end{enumerate}

	Each of these scenarios is covered in the following sections.

	\subsection{Uniprocessor \"Uber Alles}
	\label{sec:future:Uniprocessor \"Uber Alles}

	As was said in 2004~\cite{PaulEdwardMcKenneyPhD}:

	\begin{quote}
	In this scenario, the combination of \IXaltr{Moore's-Law}{Moore's Law}
	increases in CPU
	clock rate and continued progress in horizontally scaled computing
	render SMP systems irrelevant.
	This scenario is therefore dubbed ``Uniprocessor \"Uber
	Alles'', literally, uniprocessors above all else.

	These uniprocessor systems would be subject only to instruction
	overhead, since \IXpl{memory barrier}, cache thrashing, and contention
	do not affect single-CPU systems.
	In this scenario, RCU is useful only for niche applications, such
	as interacting with \IXacrpl{nmi}.
	It is not clear that an operating system lacking RCU would see
	the need to adopt it, although operating
	systems that already implement RCU might continue to do so.

	However, recent progress with multithreaded CPUs seems to indicate
	that this scenario is quite unlikely.
	\end{quote}

	Unlikely indeed!
	But the larger software community was reluctant to accept the fact that
	they would need to embrace parallelism, and so it was some time before
	this community concluded that the ``free lunch'' of
	\IXaltr{Moore's-Law}{Moore's Law}-induced
	CPU core-clock frequency increases was well and truly finished.
	Never forget:
	Belief is an emotion, not necessarily the result of a rational technical
	thought process!

	\subsection{Multithreaded Mania}
	\label{sec:future:Multithreaded Mania}

	Also from 2004~\cite{PaulEdwardMcKenneyPhD}:

	\begin{quote}
	A less-extreme variant of Uniprocessor \"Uber Alles features
	uniprocessors with hardware multithreading, and in fact
	multithreaded CPUs are now standard for many desktop and laptop
	computer systems.
	The most aggressively multithreaded CPUs share all levels of
	cache hierarchy, thereby eliminating CPU-to-CPU \IXh{memory}{latency},
	in turn greatly reducing the performance penalty for traditional
	synchronization mechanisms.
	However, a multithreaded CPU would still incur overhead due to
	contention and to pipeline stalls caused by memory barriers.
	Furthermore, because all hardware threads share all levels
	of cache, the cache available to a given hardware thread is a
	fraction of what it would be on an equivalent single-threaded
	CPU, which can degrade performance for applications with large
	cache footprints.
	There is also some possibility that the restricted amount of cache
	available will cause RCU-based algorithms to incur performance
	penalties due to their grace-period-induced additional memory
	consumption.
	Investigating this possibility is future work.

	However, in order to avoid such performance degradation, a number
	of multithreaded CPUs and multi-CPU chips partition at least
	some of the levels of cache on a per-hardware-thread basis.
	This increases the amount of cache available to each hardware
	thread, but re-introduces memory latency for cachelines that
	are passed from one hardware thread to another.
	\end{quote}

	And we all know how this story has played out, with multiple multi-threaded
	cores on a single die plugged into a single socket, with varying degrees
	of optimization for lower numbers of active threads per core.
	The question then becomes whether or not future shared-memory systems will
	always fit into a single socket.

	\subsection{More of the Same}
	\label{sec:meas:More of the Same}

	Again from 2004~\cite{PaulEdwardMcKenneyPhD}:

	\begin{quote}
	The More-of-the-Same scenario assumes that the memory-latency
	ratios will remain roughly where they are today.

	This scenario actually represents a change, since to have more
	of the same, interconnect performance must begin keeping up
	with the \IXaltr{Moore's-Law}{Moore's Law} increases in core CPU performance.
	In this scenario, overhead due to pipeline stalls, memory latency,
	and contention remains significant, and RCU retains the high
	level of applicability that it enjoys today.
	\end{quote}

	And the change has been the ever-increasing levels of integration
	that \IXr{Moore's Law} is still providing.
	But longer term, which will it be?
	More CPUs per die?
	Or more I/O, cache, and memory?

	Servers seem to be choosing the former, while embedded systems on a chip
	(SoCs) continue choosing the latter.

	\subsection{Crash Dummies Slamming into the Memory Wall}
	\label{sec:future:Crash Dummies Slamming into the Memory Wall}

	\begin{figure}
	\centering
	\epsfxsize=3in
	\epsfbox{future/latencytrend}
	% from Ph.D. thesis: related/latencytrend.eps
	\caption{Instructions per Local Memory Reference for Sequent Computers}
	\label{fig:future:Instructions per Local Memory Reference for Sequent Computers}
	\end{figure}

	\begin{figure}
	\centering
	\epsfxsize=3in
	\epsfbox{future/be-lb-n4-rf-all}
	% from Ph.D. thesis: an/plots/be-lb-n4-rf-all.eps
	\caption{Breakevens vs.\@ $r$, $\lambda$ Large, Four CPUs}
	\label{fig:future:Breakevens vs. r; lambda Large; Four CPUs}
	\end{figure}

	\begin{figure}
	\centering
	\epsfxsize=3in
	\epsfbox{future/be-lw-n4-rf-all}
	% from Ph.D. thesis: an/plots/be-lw-n4-rf-all.eps
	\caption{Breakevens vs.\@ $r$, $\lambda$ Small, Four CPUs}
	\label{fig:future:Breakevens vs. r; Worst-Case lambda; Four CPUs}
	\end{figure}

	And one more quote from 2004~\cite{PaulEdwardMcKenneyPhD}:

	\begin{quote}
	If the memory-latency trends shown in
	\cref{fig:future:Instructions per Local Memory Reference for Sequent Computers}
	continue, then memory latency will continue to grow relative
	to instruction-execution overhead.
	Systems such as Linux that have significant use of RCU will find
	additional use of RCU to be profitable, as shown in
	\cref{fig:future:Breakevens vs. r; lambda Large; Four CPUs}.
	As can be seen in this figure, if RCU is heavily used, increasing
	memory-latency ratios give RCU an increasing advantage over other
	synchronization mechanisms.
	In contrast, systems with minor
	use of RCU will require increasingly high degrees of read intensity
	for use of RCU to pay off, as shown in
	\cref{fig:future:Breakevens vs. r; Worst-Case lambda; Four CPUs}.
	As can be seen in this figure, if RCU is lightly used,
	increasing memory-latency ratios
	put RCU at an increasing disadvantage compared to other synchronization
	mechanisms.
	Since Linux has been observed with over 1,600 callbacks per \IX{grace
	period} under heavy load~\cite{Sarma04c},
	it seems safe to say that Linux falls into the former category.
	\end{quote}

	On the one hand, this passage failed to anticipate the cache-warmth
	issues that RCU can suffer from in workloads with significant update
	intensity, in part because it seemed unlikely that RCU would really
	be used for such workloads.
	In the event, the \co{SLAB_TYPESAFE_BY_RCU} has been pressed into
	service in a number of instances where these cache-warmth issues would
	otherwise be problematic, as has sequence locking.
	On the other hand, this passage also failed to anticipate that
	RCU would be used to reduce scheduling latency or for security.

	Much of the data generated for this book was collected on an eight-socket
	system with 28 cores per socket and two hardware threads per core, for
	a total of 448 hardware threads.
	The idle-system memory latencies are less than one microsecond, which
	are no worse than those of similar-sized systems of the year 2004.
	Some claim that these latencies approach a microsecond only because of
	the x86 CPU family's relatively strong memory ordering, but it may be
	some time before that particular argument is settled.

	\subsection{Astounding Accelerators}
	\label{sec:future:Astounding Accelerators}

	The potential of hardware accelerators was not quite as clear in 2004
	as it is in 2021, so this section has no quote.
	However, the November 2020 Top 500 list~\cite{Top500} features a great
	many accelerators, so one could argue that this section is a view of
	the present rather than of the future.
	The same could be said of most of the preceding sections.

	Hardware accelerators are being put to many other uses, including
	encryption, compression, and machine learning.

	In short, beware of prognostications, including those in the remainder
	of this chapter.