| % cpu/swdesign.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Software Design Implications} |
| \label{sec:cpu:Software Design Implications} |
| % |
| \epigraph{One ship drives east and another west \\ |
| While the self-same breezes blow; \\ |
| 'Tis the set of the sail and not the gail \\ |
| That bids them where to go.} |
| {\emph{Ella Wheeler Wilcox}} |
| |
| The values of the ratios in |
| Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System} |
| are critically important, as they limit the |
| efficiency of a given parallel application. |
| To see this, suppose that the parallel application uses CAS |
| operations to communicate among threads. |
| These CAS operations will typically involve a cache miss, that is, assuming |
| that the threads are communicating primarily with each other rather than |
| with themselves. |
| Suppose further that the unit of work corresponding to each CAS communication |
| operation takes 300\,ns, which is sufficient time to compute several |
| floating-point transcendental functions. |
| Then about half of the execution time will be consumed by the CAS |
| communication operations! |
| This in turn means that a two-CPU system running such a parallel program |
| would run no faster than a sequential implementation running on a |
| single CPU. |
| |
| The situation is even worse in the distributed-system case, where the |
| latency of a single communications operation might take as long as |
| thousands or even millions of floating-point operations. |
| This illustrates how important it is for communications operations to |
| be extremely infrequent and to enable very large quantities of processing. |
| |
| \QuickQuiz{} |
| Given that distributed-systems communication is so horribly |
| expensive, why does anyone bother with such systems? |
| \QuickQuizAnswer{ |
| There are a number of reasons: |
| |
| \begin{enumerate} |
| \item Shared-memory multiprocessor systems have strict size limits. |
| If you need more than a few thousand CPUs, you have no |
| choice but to use a distributed system. |
| \item Extremely large shared-memory systems tend to be |
| quite expensive and to have even longer cache-miss |
| latencies than does the small four-CPU system |
| shown in |
| Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}. |
| \item The distributed-systems communications latencies do |
| not necessarily consume the CPU, which can often allow |
| computation to proceed in parallel with message transfer. |
| \item Many important problems are ``embarrassingly parallel'', |
| so that extremely large quantities of processing may |
| be enabled by a very small number of messages. |
| SETI@HOME~\cite{SETIatHOME2008} |
| is but one example of such an application. |
| These sorts of applications can make good use of networks |
| of computers despite extremely long communications |
| latencies. |
| \end{enumerate} |
| |
| It is likely that continued work on parallel applications will |
| increase the number of embarrassingly parallel applications that |
| can run well on machines and/or clusters having long communications |
| latencies. |
| That said, greatly reduced hardware latencies would be an |
| extremely welcome development. |
| } \QuickQuizEnd |
| |
| The lesson should be quite clear: |
| parallel algorithms must be explicitly designed with these hardware |
| properties firmly in mind. |
| One approach is to run nearly independent threads. |
| The less frequently the threads communicate, whether by atomic operations, |
| locks, or explicit messages, the better the application's performance |
| and scalability will be. |
| This approach will be touched on in |
| Chapter~\ref{chp:Counting}, |
| explored in |
| Chapter~\ref{cha:Partitioning and Synchronization Design}, |
| and taken to its logical extreme in |
| Chapter~\ref{chp:Data Ownership}. |
| |
| Another approach is to make sure that any sharing be read-mostly, which |
| allows the CPUs' caches to replicate the read-mostly data, in turn |
| allowing all CPUs fast access. |
| This approach is touched on in |
| Section~\ref{sec:count:Eventually Consistent Implementation}, |
| and explored more deeply in |
| Chapter~\ref{chp:Deferred Processing}. |
| |
| In short, achieving excellent parallel performance and scalability means |
| striving for embarrassingly parallel algorithms and implementations, |
| whether by careful choice of data structures and algorithms, use of |
| existing parallel applications and environments, or transforming the |
| problem into one for which an embarrassingly parallel solution exists. |
| |
| \QuickQuiz{} |
| OK, if we are going to have to apply distributed-programming |
| techniques to shared-memory parallel programs, why not just |
| always use these distributed techniques and dispense with |
| shared memory? |
| \QuickQuizAnswer{ |
| Because it is often the case that only a small fraction of |
| the program is performance-critical. |
| Shared-memory parallelism allows us to focus distributed-programming |
| techniques on that small fraction, allowing simpler shared-memory |
| techniques to be used on the non-performance-critical bulk of |
| the program. |
| } \QuickQuizEnd |