| % cpu/swdesign.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Software Design Implications} |
| \label{sec:cpu:Software Design Implications} |
| % |
| \epigraph{One ship drives east and another west \\ |
| While the self-same breezes blow; \\ |
| 'Tis the set of the sail and not the gail \\ |
| That bids them where to go.} |
| {Ella Wheeler Wilcox} |
| |
| The values of the ratios in |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| are critically important, as they limit the |
| efficiency of a given parallel application. |
| To see this, suppose that the parallel application uses \IXacr{cas} |
| operations to communicate among threads. |
| These CAS operations will typically involve a cache miss, that is, assuming |
| that the threads are communicating primarily with each other rather than |
| with themselves. |
| Suppose further that the unit of work corresponding to each CAS communication |
| operation takes 300\,ns, which is sufficient time to compute several |
| floating-point transcendental functions. |
| Then about half of the execution time will be consumed by the CAS |
| communication operations! |
| This in turn means that a two-CPU system running such a parallel program |
| would run no faster than a sequential implementation running on a |
| single CPU\@. |
| |
| The situation is even worse in the distributed-system case, where the |
| latency of a single communications operation might take as long as |
| thousands or even millions of floating-point operations. |
| This illustrates how important it is for communications operations to |
| be extremely infrequent and to enable very large quantities of processing. |
| |
| \QuickQuiz{ |
| Given that distributed-systems communication is so horribly |
| expensive, why does anyone bother with such systems? |
| }\QuickQuizAnswer{ |
| There are a number of reasons: |
| |
| \begin{enumerate} |
| \item Shared-memory multiprocessor systems have strict size limits. |
| If you need more than a few thousand CPUs, you have no |
| choice but to use a distributed system. |
| \item Large shared-memory systems tend to be more expensive |
| per unit computation than their smaller counterparts. |
| \item Large shared-memory systems tend to have much longer |
| cache-miss latencies than do smaller system. |
| To see this, compare |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| on \cpageref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz} |
| with |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 12-CPU Intel Core i7-8750H CPU @ 2.20GHz}. |
| \item The distributed-systems communications operations do |
| not necessarily use much CPU, so that computation can |
| proceed in parallel with message transfer. |
| \item Many important problems are ``embarrassingly parallel'', |
| so that extremely large quantities of processing may |
| be enabled by a very small number of messages. |
| SETI@HOME~\cite{SETIatHOME2008} |
| was but one example of such an application. |
| These sorts of applications can make good use of networks |
| of computers despite extremely long communications |
| latencies. |
| \end{enumerate} |
| |
| Thus, large shared-memory systems tend to be used for applications |
| that benefit from faster latencies than can be provided by |
| distributed computing, and particularly for those applications |
| that benefit from a large shared memory. |
| |
| It is likely that continued work on parallel applications will |
| increase the number of embarrassingly parallel applications that |
| can run well on machines and/or clusters having long communications |
| latencies, reductions in cost being the driving force that it is. |
| That said, greatly reduced hardware latencies would be an |
| extremely welcome development, both for single-system and |
| for distributed computing. |
| }\QuickQuizEnd |
| |
| The lesson should be quite clear: |
| Parallel algorithms must be explicitly designed with these hardware |
| properties firmly in mind. |
| One approach is to run nearly independent threads. |
| The less frequently the threads communicate, whether by \IX{atomic} operations, |
| locks, or explicit messages, the better the application's performance |
| and scalability will be. |
| This approach will be touched on in \cref{chp:Counting}, |
| explored in \cref{chp:Partitioning and Synchronization Design}, |
| and taken to its logical extreme in \cref{chp:Data Ownership}. |
| |
| Another approach is to make sure that any sharing be read-mostly, which |
| allows the CPUs' caches to replicate the read-mostly data, in turn |
| allowing all CPUs fast access. |
| This approach is touched on in |
| \cref{sec:count:Eventually Consistent Implementation}, |
| and explored more deeply in \cref{chp:Deferred Processing}. |
| |
| In short, achieving excellent parallel performance and scalability means |
| striving for embarrassingly parallel algorithms and implementations, |
| whether by careful choice of data structures and algorithms, use of |
| existing parallel applications and environments, or transforming the |
| problem into an embarrassingly parallel form. |
| |
| \QuickQuiz{ |
| OK, if we are going to have to apply distributed-programming |
| techniques to shared-memory parallel programs, why not just |
| always use these distributed techniques and dispense with |
| shared memory? |
| }\QuickQuizAnswer{ |
| Because it is often the case that only a small fraction of |
| the program is performance-critical. |
| Shared-memory parallelism allows us to focus distributed-programming |
| techniques on that small fraction, allowing simpler shared-memory |
| techniques to be used on the non-performance-critical bulk of |
| the program. |
| }\QuickQuizEnd |