| % intro/intro.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Introduction}{Introduction}{qqzintro} |
| % |
| \Epigraph{If parallel programming is so hard, why are there so many |
| parallel programs?}{Unknown} |
| |
| Parallel programming has earned a reputation as one of the most |
| difficult areas a hacker can tackle. |
| Papers and textbooks warn of the perils of \IX{deadlock}, \IX{livelock}, |
| \IXpl{race condition}, non-determinism, |
| \IXaltr{Amdahl's-Law}{Amdahl's Law} limits to scaling, |
| and excessive realtime latencies. |
| And these perils are quite real; we authors have accumulated uncounted |
| % 2020: |
| % 30 for Paul E. McKenney |
| years of experience along with the resulting emotional scars, |
| grey hairs, and hair loss. |
| |
| However, new technologies that are difficult to use at introduction |
| invariably become easier over time. |
| For example, the once-rare ability to drive a car is now |
| commonplace in many countries. |
| This dramatic change came about for two basic reasons: |
| \begin{enumerate*}[(1)] |
| \item Cars became cheaper and more readily available, so that |
| more people had the opportunity to learn to drive, and |
| \item Cars became easier to operate due to automatic transmissions, |
| automatic chokes, automatic starters, greatly improved reliability, |
| and a host of other technological improvements. |
| \end{enumerate*} |
| |
| The same is true for many other technologies, including computers. |
| It is no longer necessary to operate a keypunch in order to program. |
| Spreadsheets allow most non-programmers to get results from their computers |
| that would have required a team of specialists a few decades ago. |
| Perhaps the most compelling example is web-surfing and content creation, |
| which since the early 2000s has been easily done by |
| untrained, uneducated people using various now-commonplace |
| social-networking tools. |
| As recently as 1968, such content creation was a far-out research |
| project~\cite{DouglasEngelbart1968}, described at |
| the time as |
| ``like a UFO landing on the White House lawn''~\cite{ScottGriffen2000}. |
| % http://www.ibiblio.org/pioneers/engelbart.html |
| % http://www.histech.rwth-aachen.de/www/quellen/engelbart/ahi62index.html |
| |
| Therefore, if you wish to argue that parallel programming will remain |
| as difficult as it is currently perceived by many to be, it is you |
| who bears the burden of proof, keeping in mind the many centuries of |
| counter-examples in many fields of endeavor. |
| |
| \section{Historic Parallel Programming Difficulties} |
| \label{sec:intro:Historic Parallel Programming Difficulties} |
| % |
| \epigraph{Not the power to remember, but its very opposite, the power to |
| forget, is a necessary condition for our existence.} |
| {Sholem Asch} |
| |
| As indicated by its title, this book takes a different approach. |
| Rather than complain about the difficulty of parallel programming, |
| it instead examines the reasons why parallel programming is |
| difficult, and then works to help the reader to overcome these |
| difficulties. |
| As will be seen, these difficulties have historically fallen into several |
| categories, including: |
| |
| \begin{enumerate} |
| \item The historic high cost and relative rarity of parallel systems. |
| \item The typical researcher's and practitioner's lack of experience |
| with parallel systems. |
| \item The paucity of publicly accessible parallel code. |
| \item The lack of a widely understood engineering discipline of |
| parallel programming. |
| \item The high \IX{overhead} of communication relative to that of processing, |
| even in tightly coupled shared-memory computers. |
| \end{enumerate} |
| |
| Many of these historic difficulties are well on the way to being overcome. |
| First, over the past few decades, the cost of parallel systems |
| has decreased from many multiples of that of a house to that of a |
| modest meal, courtesy of \IXr{Moore's Law}~\cite{GordonMoore1965MooresLaw}. |
| Papers calling out the advantages of multicore CPUs were published |
| as early as 1996~\cite{Olukotun96}. |
| IBM introduced simultaneous multi-threading |
| into its high-end \Power{} family in 2000, and multicore in 2001. |
| Intel introduced hyperthreading into its commodity Pentium line in |
| November 2000, and both AMD and Intel introduced |
| dual-core CPUs in 2005. |
| Sun followed with the multicore/multi-threaded Niagara in late 2005. |
| In fact, by 2008, it was becoming difficult |
| to find a single-CPU desktop system, with single-core CPUs being |
| relegated to netbooks and embedded devices. |
| By 2012, even smartphones were starting to sport multiple CPUs. |
| By 2020, safety-critical software standards started addressing |
| concurrency. |
| |
| Second, the advent of low-cost and readily available multicore systems |
| means that the once-rare experience of parallel programming is |
| now available to almost all researchers and practitioners. |
| In fact, parallel systems have long been within the budget of students |
| and hobbyists. |
| We can therefore expect greatly increased levels of invention and |
| innovation surrounding parallel systems, and that increased familiarity |
| will over time make the once prohibitively expensive field of parallel |
| programming much more friendly and commonplace. |
| |
| Third, in the 20\textsuperscript{th} century, large systems of |
| highly parallel software were almost always closely guarded proprietary |
| secrets. |
| In happy contrast, the 21\textsuperscript{st} century has seen numerous |
| open-source (and thus publicly available) parallel software projects, |
| including the Linux kernel~\cite{Torvalds2.6kernel}, |
| database systems~\cite{PostgreSQL2008,MySQL2008}, |
| and message-passing systems~\cite{OpenMPI2008,BOINC2008}. |
| This book will draw primarily from the Linux kernel, but will |
| provide much material suitable for user-level applications. |
| |
| Fourth, even though the large-scale parallel\-/programming projects of |
| the 1980s and 1990s were almost all proprietary projects, these |
| projects have seeded other communities with cadres of developers who |
| understand the engineering discipline required to develop production-quality |
| parallel code. |
| A major purpose of this book is to present this engineering discipline. |
| |
| Unfortunately, the fifth difficulty, the high cost of communication |
| relative to that of processing, remains largely in force. |
| This difficulty has been receiving increasing attention during |
| the new millennium. |
| However, according to \ppl{Stephen}{Hawking}, |
| the finite speed of light and the atomic |
| nature of matter will limit progress in this |
| area~\cite{BryanGardiner2007,GordonMoore03a}. |
| Fortunately, this difficulty has been in force since the late 1980s, |
| so that the aforementioned engineering discipline has evolved practical |
| and effective strategies for handling it. |
| In addition, hardware designers are increasingly aware of these issues, |
| so perhaps future hardware will be more friendly to parallel software, |
| as discussed in \cref{sec:cpu:Hardware Free Lunch?}. |
| |
| \QuickQuiz{ |
| Come on now!!! |
| Parallel programming has been known to be exceedingly |
| hard for many decades. |
| You seem to be hinting that it is not so hard. |
| What sort of game are you playing? |
| }\QuickQuizAnswer{ |
| If you really believe that parallel programming is exceedingly |
| hard, then you should have a ready answer to the question |
| ``Why is parallel programming hard?'' |
| One could list any number of reasons, ranging from deadlocks to |
| race conditions to testing coverage, but the real answer is that |
| {\em it is not really all that hard}. |
| After all, if parallel programming was really so horribly difficult, |
| how could a large number of open-source projects, ranging from Apache |
| to MySQL to the Linux kernel, have managed to master it? |
| |
| A better question might be: |
| ``Why is parallel programming \emph{perceived} to be so difficult?'' |
| To see the answer, let's go back to the year 1991. |
| Paul McKenney was walking across the parking lot to Sequent's |
| benchmarking center carrying six dual-80486 Sequent Symmetry CPU |
| boards, when he suddenly realized that he was carrying several |
| times the price of the house he had just purchased.\footnote{ |
| Yes, this sudden realization {\em did} cause him to walk quite |
| a bit more carefully. |
| Why do you ask?} |
| This high cost of parallel systems meant that |
| parallel programming was restricted to a privileged few who |
| worked for an employer who either manufactured or could afford to |
| purchase machines costing upwards of \$100,000---in 1991 dollars US. |
| |
| In contrast, in 2020, Paul finds himself typing these words on a |
| six-core x86 laptop. |
| Unlike the dual-80486 CPU boards, this laptop also contains |
| 64\,GB of main memory, a 1\,TB solid-state disk, a display, Ethernet, |
| USB ports, wireless, and Bluetooth. |
| And the laptop is more than an order of magnitude cheaper than |
| even one of those dual-80486 CPU boards, even before taking inflation |
| into account. |
| |
| Parallel systems have truly arrived. |
| They are no longer the sole domain of a privileged few, but something |
| available to almost everyone. |
| |
| The earlier restricted availability of parallel hardware is |
| the \emph{real} reason that parallel programming is considered |
| so difficult. |
| After all, it is quite difficult to learn to program even the simplest |
| machine if you have no access to it. |
| Since the age of rare and expensive parallel machines is for the most |
| part behind us, the age during which |
| parallel programming is perceived to be mind-crushingly difficult is |
| coming to a close.\footnote{ |
| Parallel programming is in some ways more difficult than |
| sequential programming, for example, parallel validation |
| is more difficult. |
| But no longer mind-crushingly difficult.} |
| }\QuickQuizEnd |
| |
| However, even though parallel programming might not be as hard as |
| is commonly advertised, it is often more work than is sequential |
| programming. |
| |
| \QuickQuiz{ |
| How could parallel programming \emph{ever} be as easy |
| as sequential programming? |
| }\QuickQuizAnswer{ |
| It depends on the programming environment. |
| SQL~\cite{DIS9075SQL92} is an underappreciated success |
| story, as it permits programmers who know nothing about parallelism |
| to keep a large parallel system productively busy. |
| We can expect more variations on this theme as parallel |
| computers continue to become cheaper and more readily available. |
| For example, one possible contender in the scientific and |
| technical computing arena is MATLAB*P, |
| which is an attempt to automatically parallelize common |
| matrix operations. |
| |
| Finally, on Linux and UNIX systems, consider the following |
| shell command: |
| |
| \begin{VerbatimU} |
| get_input | grep "interesting" | sort |
| \end{VerbatimU} |
| |
| This shell pipeline runs the \co{get_input}, \co{grep}, |
| and \co{sort} processes in parallel. |
| There, that wasn't so hard, now was it? |
| |
| In short, parallel programming is just as easy as sequential |
| programming---at least in those environments that hide the parallelism |
| from the user! |
| }\QuickQuizEnd |
| |
| It therefore makes sense to consider alternatives to parallel programming. |
| However, it is not possible to reasonably consider parallel-programming |
| alternatives without understanding parallel-programming goals. |
| This topic is addressed in the next section. |
| |
| \section{Parallel Programming Goals} |
| \label{sec:intro:Parallel Programming Goals} |
| % |
| \epigraph{If you don't know where you are going, you will end up somewhere |
| else.}{Yogi Berra} |
| |
| The three major goals of parallel programming (over and above those |
| of sequential programming) are as follows: |
| |
| \begin{enumerate} |
| \item \IX{Performance}. |
| \item \IX{Productivity}. |
| \item \IX{Generality}. |
| \end{enumerate} |
| |
| Unfortunately, given the current state of the art, it is possible to |
| achieve at best two of these three goals for any given parallel program. |
| These three goals therefore form the \emph{iron triangle of parallel |
| programming}, |
| a triangle upon which overly optimistic hopes all too often come to |
| grief.\footnote{ |
| Kudos to Michael Wong for naming the iron triangle.} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Oh, really??? |
| What about correctness, maintainability, robustness, and so on? |
| }\QuickQuizAnswerB{ |
| These are important goals, but they are just as important for |
| sequential programs as they are for parallel programs. |
| Therefore, important though they are, they do not belong on |
| a list specific to parallel programming. |
| }\QuickQuizEndB |
| % |
| \QuickQuizM{ |
| And if correctness, maintainability, and robustness don't |
| make the list, why do productivity and generality? |
| }\QuickQuizAnswerM{ |
| Given that parallel programming is perceived to be much harder |
| than sequential programming, productivity is tantamount and |
| therefore must not be omitted. |
| Furthermore, high-productivity parallel-programming environments |
| such as SQL serve a specific purpose, hence generality must |
| also be added to the list. |
| }\QuickQuizEndM |
| % |
| \QuickQuizM{ |
| Given that parallel programs are much harder to prove |
| correct than are sequential programs, again, shouldn't |
| correctness \emph{really} be on the list? |
| }\QuickQuizAnswerM{ |
| From an engineering standpoint, the difficulty in proving |
| correctness, either formally or informally, would be important |
| insofar as it impacts the primary goal of productivity. |
| So, in cases where correctness proofs are important, they |
| are subsumed under the ``productivity'' rubric. |
| }\QuickQuizEndM |
| % |
| \QuickQuizE{ |
| What about just having fun? |
| }\QuickQuizAnswerE{ |
| Having fun is important as well, but, unless you are a hobbyist, |
| would not normally be a \emph{primary} goal. |
| On the other hand, if you \emph{are} a hobbyist, go wild! |
| }\QuickQuizEndE |
| } |
| |
| Each of these goals is elaborated upon in the following sections. |
| |
| \subsection{Performance} |
| \label{sec:intro:Performance} |
| |
| Performance is the primary goal behind most parallel-programming effort. |
| After all, if performance is not a concern, why not do yourself a favor: |
| Just write sequential code, and be happy? |
| It will very likely be easier |
| and you will probably get done much more quickly. |
| |
| \QuickQuiz{ |
| Are there no cases where parallel programming is about something |
| other than performance? |
| }\QuickQuizAnswer{ |
| There certainly are cases where the problem to be solved is |
| inherently parallel, for example, Monte Carlo methods and |
| some numerical computations. |
| Even in these cases, however, there will be some amount of |
| extra work managing the parallelism. |
| |
| Parallelism is also sometimes used for reliability. |
| For but one example, |
| triple-modulo redundancy has three systems run in parallel |
| and vote on the result. |
| In extreme cases, the three systems will be independently |
| implemented using different algorithms and technologies. |
| }\QuickQuizEnd |
| |
| Note that ``performance'' is interpreted broadly here, |
| including for example \IX{scalability} (performance per CPU) and \IX{efficiency} |
| (performance per watt). |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}} |
| \caption{MIPS/Clock-Frequency Trend for Intel CPUs} |
| \label{fig:intro:Clock-Frequency Trend for Intel CPUs} |
| \end{figure} |
| |
| That said, the focus of performance has shifted from hardware to |
| parallel software. |
| This change in focus is due to the fact that, although \IXr{Moore's Law} |
| continues to deliver increases in transistor density, it has ceased to |
| provide the traditional single-threaded performance increases. |
| This can be seen in |
| \cref{fig:intro:Clock-Frequency Trend for Intel CPUs},\footnote{ |
| This plot shows clock frequencies for newer CPUs theoretically |
| capable of retiring one or more instructions per clock, and MIPS |
| (millions of instructions per second, usually from the old |
| Dhrystone benchmark) |
| for older CPUs requiring multiple clocks to execute even the |
| simplest instruction. |
| The reason for shifting between these two measures is that the |
| newer CPUs' ability to retire multiple instructions per clock is |
| typically limited by memory-system performance. |
| Furthermore, the benchmarks commonly used on the older CPUs |
| are obsolete, and it is difficult to run the newer benchmarks |
| on systems containing the old CPUs, in part because it is hard |
| to find working instances of the old CPUs.} |
| which shows that writing single-threaded code and simply waiting |
| a year or two for the CPUs to catch up may no longer be an option. |
| Given the recent trends on the part of all major manufacturers towards |
| multicore/multithreaded systems, parallelism is the way to go for |
| those wanting to avail themselves of the full performance of their |
| systems. |
| |
| \QuickQuiz{ |
| Why not instead rewrite programs from inefficient scripting |
| languages to C or C++? |
| }\QuickQuizAnswer{ |
| If the developers, budget, and time is available for such a |
| rewrite, and if the result will attain the required levels |
| of performance on a single CPU, this can be a reasonable |
| approach. |
| }\QuickQuizEnd |
| |
| Even so, the first goal is performance rather than scalability, |
| especially given that the easiest way to attain linear scalability |
| is to reduce the performance of each CPU~\cite{LinusTorvalds2001a}. |
| Given a four-CPU system, which would you prefer? |
| A program that provides 100 transactions per second on a single CPU, |
| but does not scale at all? |
| Or a program that provides 10 transactions per second on a single CPU, |
| but scales perfectly? |
| The first program seems like a better bet, though the answer might |
| change if you happened to have a 32-CPU system. |
| |
| That said, just because you have multiple CPUs is not necessarily |
| in and of itself a reason to use them all, especially given the |
| recent decreases in price of multi-CPU systems. |
| The key point to understand is that parallel programming is primarily |
| a performance optimization, and, as such, it is one potential optimization |
| of many. |
| If your program is fast enough as currently written, there is no |
| reason to optimize, either by parallelizing it or by applying any |
| of a number of potential sequential optimizations.\footnote{ |
| Of course, if you are a hobbyist whose primary interest is |
| writing parallel software, that is more than enough reason to |
| parallelize whatever software you are interested in.} |
| By the same token, if you are looking to apply parallelism as an |
| optimization to a sequential program, then you will need to compare |
| parallel algorithms to the best sequential algorithms. |
| This may require some care, as far too many publications ignore the |
| sequential case when analyzing the performance of parallel algorithms. |
| |
| \subsection{Productivity} |
| \label{sec:intro:Productivity} |
| |
| \EQuickQuiz{ |
| Why all this prattling on about non-technical issues??? |
| And not just \emph{any} non-technical issue, but \emph{productivity} |
| of all things? |
| Who cares? |
| }\EQuickQuizAnswer{ |
| If you are a pure hobbyist, perhaps you don't need to care. |
| But even pure hobbyists will often care about how much they |
| can get done, and how quickly. |
| After all, the most popular hobbyist tools are usually those |
| that are the best suited for the job, and an important part of |
| the definition of ``best suited'' involves productivity. |
| And if someone is paying you to write parallel code, they will |
| very likely care deeply about your productivity. |
| And if the person paying you cares about something, you would |
| be most wise to pay at least some attention to it! |
| |
| Besides, if you \emph{really} didn't care about productivity, |
| you would be doing it by hand rather than using a computer! |
| }\EQuickQuizEnd |
| |
| \IX{Productivity} has been becoming increasingly important in recent decades. |
| To see this, consider that the price of early computers was tens |
| of millions of dollars at |
| a time when engineering salaries were but a few thousand dollars a year. |
| If dedicating a team of ten engineers to such a machine would improve |
| its performance, even by only 10\pct, then their salaries |
| would be repaid many times over. |
| |
| One such machine was the CSIRAC, the oldest still-intact stored-program |
| computer, which was put into operation in |
| 1949~\cite{CSIRACMuseumVictoria,CSIRACUniversityMelbourne}. |
| Because this machine was built before the transistor era, it was constructed |
| of 2,000 vacuum tubes, ran with a clock frequency of 1\,kHz, |
| consumed 30\,kW of power, and weighed more than three metric tons. |
| Given that this machine had but 768 words of RAM, it is safe to say that |
| it did not suffer from the productivity issues that often plague |
| today's large-scale software projects. |
| |
| Today, it would be quite difficult to purchase a machine with so |
| little computing power. |
| Perhaps the closest equivalents |
| are 8-bit embedded microprocessors exemplified by the venerable |
| Z80~\cite{z80Wikipedia}, but even the old Z80 had a CPU clock |
| frequency more than 1,000 times faster than the CSIRAC\@. |
| The Z80 CPU had 8,500 transistors, and could be purchased in 2008 |
| for less than \$2 US per unit in 1,000-unit quantities. |
| In stark contrast to the CSIRAC, software-development costs are |
| anything but insignificant for the Z80. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/mipsperbuck}} |
| \caption{MIPS per Die for Intel CPUs} |
| \label{fig:intro:MIPS per Die for Intel CPUs} |
| \end{figure} |
| |
| The CSIRAC and the Z80 are two points in a long-term trend, as can be |
| seen in |
| \cref{fig:intro:MIPS per Die for Intel CPUs}. |
| This figure plots an approximation to computational power per die |
| over the past four decades, showing an impressive six-order-of-magnitude |
| increase over a period of forty years. |
| Note that the advent of multicore CPUs has permitted this increase to |
| continue apace despite the clock-frequency wall encountered in 2003, |
| albeit courtesy of dies supporting more than 50 hardware threads each. |
| |
| One of the inescapable consequences of the rapid decrease in |
| the cost of hardware is that software productivity becomes increasingly |
| important. |
| It is no longer sufficient merely to make efficient use of the hardware: |
| It is now necessary to make extremely efficient use of software |
| developers as well. |
| This has long been the case for sequential hardware, but |
| parallel hardware has become a low-cost commodity only recently. |
| Therefore, only recently has high productivity become critically important |
| when creating parallel software. |
| |
| \QuickQuiz{ |
| Given how cheap parallel systems have become, how can anyone |
| afford to pay people to program them? |
| }\QuickQuizAnswer{ |
| There are a number of answers to this question: |
| \begin{enumerate} |
| \item Given a large computational cluster of parallel machines, |
| the aggregate cost of the cluster can easily justify |
| substantial developer effort, because the development |
| cost can be spread over the large number of machines. |
| \item Popular software that is run by tens of millions of users |
| can easily justify substantial developer effort, |
| as the cost of this development can be spread over the tens |
| of millions of users. |
| Note that this includes things like kernels and system |
| libraries. |
| \item If the low-cost parallel machine is controlling the operation |
| of a valuable piece of equipment, then the cost of this |
| piece of equipment might easily justify substantial |
| developer effort. |
| \item If the software for the low-cost parallel machine produces an |
| extremely valuable result (e.g., energy savings), |
| then this valuable result might again justify substantial |
| developer cost. |
| \item Safety-critical systems protect lives, which can clearly |
| justify very large developer effort. |
| \item Hobbyists and researchers might instead seek knowledge, |
| experience, fun, or glory. |
| \end{enumerate} |
| So it is not the case that the decreasing cost of hardware renders |
| software worthless, but rather that it is no longer possible to |
| ``hide'' the cost of software development within the cost of |
| the hardware, at least not unless there are extremely large |
| quantities of hardware. |
| }\QuickQuizEnd |
| |
| Perhaps at one time, the sole purpose of parallel software was performance. |
| Now, however, productivity is gaining the spotlight. |
| |
| \subsection{Generality} |
| \label{sec:intro:Generality} |
| |
| One way to justify the high cost of developing parallel software |
| is to strive for maximal \IX{generality}. |
| All else being equal, the cost of a more-general software artifact |
| can be spread over more users than that of a less-general one. |
| In fact, this economic force explains much of the maniacal focus |
| on portability, which can be seen as an important special case |
| of generality.\footnote{ |
| Kudos to Michael Wong for pointing this out.} |
| |
| Unfortunately, generality often comes at the cost of performance, |
| productivity, or both. |
| For example, portability is often achieved via adaptation layers, |
| which inevitably exact a performance penalty. |
| To see this more generally, consider the following popular parallel |
| programming environments: |
| |
| \begin{description} |
| \item[C/C++ ``Locking Plus Threads'':] This category, which includes |
| POSIX Threads (pthreads)~\cite{OpenGroup1997pthreads}, |
| Windows Threads, and numerous |
| operating-system kernel environments, offers excellent performance |
| (at least within the confines of a single SMP system) |
| and also offers good generality. |
| Pity about the relatively low productivity. |
| \item[Java:] This general purpose and inherently multithreaded |
| programming environment is widely believed to offer much higher |
| productivity than C or C++, courtesy of the automatic garbage collector |
| and the rich set of class libraries. |
| However, its performance, though greatly improved in the early |
| 2000s, lags that of C and C++. |
| \item[MPI:] This Message Passing Interface~\cite{MPIForum2008} powers |
| the largest scientific and technical computing clusters in |
| the world and offers unparalleled performance and scalability. |
| In theory, it is general purpose, but it is mainly used |
| for scientific and technical computing. |
| Its productivity is believed by many to be even lower than that |
| of C/C++ ``locking plus threads'' environments. |
| \item[OpenMP:] This set of compiler directives can be used |
| to parallelize loops. |
| It is thus quite specific to this |
| task, and this specificity often limits its performance. |
| It is, however, much easier to use than MPI or C/C++ |
| ``locking plus threads.'' |
| \item[SQL:] Structured Query Language~\cite{DIS9075SQL92} is |
| specific to relational database queries. |
| However, its performance is quite good as measured by the |
| Transaction Processing Performance Council (TPC) |
| benchmark results~\cite{TPC}. |
| Productivity is excellent; in fact, this parallel programming |
| environment enables people to make good use of a large parallel |
| system despite having little or no knowledge of parallel |
| programming concepts. |
| \end{description} |
| |
| \QuickQuiz{ |
| SQL??? |
| That is ancient history. |
| Here in the 2020s, if you really want to make good use of a |
| large parallel system, why not use machine learning? |
| }\QuickQuizAnswer{ |
| It certainly is the case that machine learning can keep many |
| large parallel systems busy! |
| However, the jury is still out as to whether this constitutes |
| \emph{good} use of those systems. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{intro/PPGrelation}} |
| \caption{Software Layers and Performance, Productivity, and Generality} |
| \label{fig:intro:Software Layers and Performance; Productivity; and Generality} |
| \end{figure} |
| |
| The nirvana of parallel programming environments, one that offers |
| world-class performance, productivity, and generality, simply does |
| not yet exist. |
| Until such a nirvana appears, it will be necessary to make engineering |
| tradeoffs among performance, productivity, and generality. |
| One such tradeoff is depicted by the green ``iron triangle''\footnote{ |
| Kudos to Michael Wong for coining ``iron triangle.''} |
| shown in |
| \cref{fig:intro:Software Layers and Performance; Productivity; and Generality}, |
| which shows how productivity becomes increasingly important at the upper layers |
| of the system stack, |
| while performance and generality become increasingly important at the |
| lower layers of the system stack. |
| The huge development costs incurred at the lower layers |
| must be spread over equally huge numbers of users |
| (hence the importance of generality), and |
| performance lost in lower layers cannot easily be |
| recovered further up the stack. |
| In the upper layers of the stack, there might be very few users for a given |
| specific application, in which case productivity concerns are paramount. |
| This explains the tendency towards ``bloatware'' further up the stack: |
| Extra hardware is often cheaper than extra developers. |
| This book is intended for developers working near the bottom |
| of the stack, where performance and generality are of greatest concern. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{intro/Generality}} |
| \caption{Tradeoff Between Productivity and Generality} |
| \label{fig:intro:Tradeoff Between Productivity and Generality} |
| \end{figure} |
| |
| It is important to note that a tradeoff between productivity and |
| generality has existed for centuries in many fields. |
| For but one example, a nailgun is more productive than a hammer for |
| driving nails, but in contrast to the nailgun, a hammer can be used for |
| many things besides driving nails. |
| It should therefore be no surprise to see similar tradeoffs |
| appear in the field of parallel computing. |
| This tradeoff is shown schematically in |
| \cref{fig:intro:Tradeoff Between Productivity and Generality}. |
| Here, users~1, 2, 3, and~4 have specific jobs that they need the computer |
| to help them with. |
| The most productive possible language or environment for a given user is one |
| that simply does that user's job, without requiring any programming, |
| configuration, or other setup. |
| |
| \QuickQuiz{ |
| This is a ridiculously unachievable ideal! |
| Why not focus on something that is achievable in practice? |
| }\QuickQuizAnswer{ |
| This is eminently achievable. |
| The cellphone is a computer that can be used to make phone |
| calls and to send and receive text messages with little or |
| no programming or configuration on the part of the end user. |
| |
| This might seem to be a trivial example at first glance, |
| but if you consider it carefully you will see that it is |
| both simple and profound. |
| When we are willing to sacrifice generality, we can achieve |
| truly astounding increases in productivity. |
| Those who indulge in excessive generality will therefore fail to set |
| the productivity bar high enough to succeed near the top of the |
| software stack. |
| This fact of life even has its own acronym: |
| YAGNI, or ``You Ain't Gonna Need It.'' |
| }\QuickQuizEnd |
| |
| Unfortunately, a system that does the job required by user~1 is |
| unlikely to do user~2's job. |
| In other words, the most productive languages and environments are |
| domain-specific, and thus by definition lacking generality. |
| |
| Another option is to tailor a given programming language or environment |
| to the hardware system (for example, low-level languages such as |
| assembly, C, C++, or Java) or to some abstraction (for example, |
| Haskell, Prolog, or Snobol), as is shown by the circular region near |
| the center of |
| \cref{fig:intro:Tradeoff Between Productivity and Generality}. |
| These languages can be considered to be general in the sense that they |
| are equally ill-suited to the jobs required by users~1, 2, 3, and~4. |
| In other words, their generality comes at the expense of |
| decreased productivity when compared to domain-specific languages |
| and environments. |
| Worse yet, a language that is tailored to a given abstraction |
| is likely to suffer from performance and scalability problems |
| unless and until it can be efficiently mapped to real hardware. |
| |
| Is there no escape from iron triangle's three conflicting goals of |
| performance, productivity, and generality? |
| |
| It turns out that there often is an escape, for example, |
| using the alternatives to parallel programming discussed in the next section. |
| After all, parallel programming can be a great deal of fun, but |
| it is not always the best tool for the job. |
| |
| \section{Alternatives to Parallel Programming} |
| \label{sec:intro:Alternatives to Parallel Programming} |
| % |
| \epigraph{Experiment is folly when experience shows the way.} |
| {Roger M. Babson} |
| |
| In order to properly consider alternatives to parallel programming, |
| you must first decide on what exactly you expect the parallelism |
| to do for you. |
| As seen in \cref{sec:intro:Parallel Programming Goals}, |
| the primary goals of parallel programming are performance, productivity, |
| and generality. |
| Because this book is intended for developers working on |
| performance-critical code near the bottom of the software stack, |
| the remainder of this section focuses primarily on performance improvement. |
| |
| It is important to keep in mind that parallelism is but one way to |
| improve performance. |
| Other well-known approaches include the following, in roughly increasing |
| order of difficulty: |
| |
| \begin{enumerate} |
| \item Run multiple instances of a sequential application. |
| \item Make the application use existing parallel software. |
| \item Optimize the serial application. |
| \end{enumerate} |
| |
| These approaches are covered in the following sections. |
| |
| \subsection{Multiple Instances of a Sequential Application} |
| \label{sec:intro:Multiple Instances of a Sequential Application} |
| |
| Running multiple instances of a sequential application can allow you |
| to do parallel programming without actually doing parallel programming. |
| There are a large number of ways to approach this, depending on the |
| structure of the application. |
| |
| If your program is analyzing a large number of different scenarios, |
| or is analyzing a large number of independent data sets, one easy |
| and effective approach is to create a single sequential program that |
| carries out a single analysis, then use any of a number of scripting |
| environments (for example the \co{bash} shell) to run a number of |
| instances of that sequential program in parallel. |
| In some cases, this approach can be easily extended to a cluster of |
| machines. |
| |
| This approach may seem like cheating, and in fact some denigrate such |
| programs as ``\IX{embarrassingly parallel}''. |
| And in fact, this approach does have some potential disadvantages, |
| including increased memory consumption, waste of CPU cycles recomputing |
| common intermediate results, and increased copying of data. |
| However, it is often extremely productive, garnering extreme performance |
| gains with little or no added effort. |
| |
| \subsection{Use Existing Parallel Software} |
| \label{sec:intro:Use Existing Parallel Software} |
| |
| There is no longer any shortage of parallel software environments that |
| can present a single-threaded programming environment, |
| including relational |
| databases~\cite{Date82}, |
| web-application servers, and map-reduce environments. |
| For example, a common design provides a separate process for each |
| user, each of which generates SQL from user queries. |
| This per-user SQL is run against a common relational database, which |
| automatically runs the users' queries concurrently. |
| The per-user programs are responsible only for the user interface, |
| with the relational database taking full responsibility for the |
| difficult issues surrounding parallelism and persistence. |
| |
| In addition, there are a growing number of parallel library functions, |
| particularly for numeric computation. |
| Even better, some libraries take advantage of special\-/purpose |
| hardware such as vector units and general\-/purpose graphical processing |
| units (GPGPUs). |
| |
| Taking this approach often sacrifices some performance, at least when |
| compared to carefully hand-coding a fully parallel application. |
| However, such sacrifice is often well repaid by a huge reduction in |
| development effort. |
| |
| \QuickQuiz{ |
| Wait a minute! |
| Doesn't this approach simply shift the development effort from |
| you to whoever wrote the existing parallel software you are using? |
| }\QuickQuizAnswer{ |
| Exactly! |
| And that is the whole point of using existing software. |
| One team's work can be used by many other teams, resulting in a |
| large decrease in overall effort compared to all teams |
| needlessly reinventing the wheel. |
| }\QuickQuizEnd |
| |
| \subsection{Performance Optimization} |
| \label{sec:intro:Performance Optimization} |
| |
| Up through the early 2000s, CPU clock frequencies doubled every 18 months. |
| It was therefore usually more important to create new functionality than to |
| carefully optimize performance. |
| Now that \IXr{Moore's Law} is ``only'' increasing transistor density instead |
| of increasing both transistor density and per-transistor performance, |
| it might be a good time to rethink the importance of performance |
| optimization. |
| After all, new hardware generations no longer bring significant |
| single-threaded performance improvements. |
| Furthermore, many performance optimizations can also conserve energy. |
| |
| From this viewpoint, parallel programming is but another performance |
| optimization, albeit one that is becoming much more attractive |
| as parallel systems become cheaper and more readily available. |
| However, it is wise to keep in mind that the speedup available from |
| parallelism is limited to roughly the number of CPUs |
| (but see \cref{sec:SMPdesign:Beyond Partitioning} |
| for an interesting exception). |
| In contrast, the speedup available from traditional single-threaded |
| software optimizations can be much larger. |
| For example, replacing a long linked list with a hash table |
| or a search tree can improve performance by many orders of magnitude. |
| This highly optimized single-threaded program might run much |
| faster than its unoptimized parallel counterpart, making parallelization |
| unnecessary. |
| Of course, a highly optimized parallel program would be even better, |
| aside from the added development effort required. |
| |
| Furthermore, different programs might have different performance |
| bottlenecks. |
| For example, if your program spends most of its time |
| waiting on data from your disk drive, |
| using multiple CPUs will probably just increase the time wasted waiting |
| for the disks. |
| In fact, if the program was reading from a single large file laid out |
| sequentially on a rotating disk, parallelizing your program might |
| well make it a lot slower due to the added seek overhead. |
| You should instead optimize the data layout so that |
| the file can be smaller (thus faster to read), split the file into chunks |
| which can be accessed in parallel from different drives, |
| cache frequently accessed data in main memory, |
| or, if possible, |
| reduce the amount of data that must be read. |
| |
| \QuickQuiz{ |
| What other bottlenecks might prevent additional CPUs from |
| providing additional performance? |
| }\QuickQuizAnswer{ |
| There are any number of potential bottlenecks: |
| \begin{enumerate} |
| \item Main memory. |
| If a single thread consumes all available |
| memory, additional threads will simply page themselves |
| silly. |
| \item Cache. |
| If a single thread's cache footprint completely |
| fills any shared CPU cache(s), then adding more threads |
| will simply thrash those affected caches, as will be |
| seen in \cref{chp:Data Structures}. |
| \item Memory bandwidth. |
| If a single thread consumes all available |
| memory bandwidth, additional threads will simply |
| result in additional queuing on the system interconnect. |
| \item I/O bandwidth. |
| If a single thread is I/O bound, |
| adding more threads will simply result in them all |
| waiting in line for the affected I/O resource. |
| \end{enumerate} |
| |
| Specific hardware systems might have any number of additional |
| bottlenecks. |
| The fact is that every resource which is shared between |
| multiple CPUs or threads is a potential bottleneck. |
| }\QuickQuizEnd |
| |
| Parallelism can be a powerful optimization technique, but |
| it is not the only such technique, nor is it appropriate for all |
| situations. |
| Of course, the easier it is to parallelize your program, the |
| more attractive parallelization becomes as an optimization. |
| Parallelization has a reputation of being quite difficult, |
| which leads to the question ``exactly what makes parallel |
| programming so difficult?'' |
| |
| \section{What Makes Parallel Programming Hard?} |
| \label{sec:intro:What Makes Parallel Programming Hard?} |
| % |
| \epigraph{Real difficulties can be overcome; it is only the imaginary |
| ones that are unconquerable.}{Theodore N.~Vail} |
| |
| \OriginallyPublished{sec:intro:What Makes Parallel Programming Hard?}{What Makes Parallel Programming Hard?}{a Portland State University Technical Report}{PaulEMcKenney2009ProgrammingHard} |
| |
| It is important to note that the difficulty of parallel programming |
| is as much a human-factors issue as it is a set of technical properties of the |
| parallel programming problem. |
| We do need human beings to be able to tell parallel |
| systems what to do, otherwise known as programming. |
| But parallel programming involves two-way communication, with |
| a program's performance and scalability being the communication from |
| the machine to the human. |
| In short, the human writes a program telling the computer what to do, |
| and the computer critiques this program via the resulting performance and |
| scalability. |
| Therefore, appeals to abstractions or to mathematical analyses will |
| often be of severely limited utility. |
| |
| In the Industrial Revolution, the interface between human and machine |
| was evaluated by human-factor studies, then called time-and-motion |
| studies. |
| Although there have been a few human-factor studies examining parallel |
| programming~\cite{RyanEccles2005HPCSNovice,RyanEccles2006HPCSNoviceNeeds, |
| LorinHochstein2005SC,DuaneSzafron1994PEMPDS}, these studies have |
| been extremely narrowly focused, and hence unable to demonstrate any |
| general results. |
| Furthermore, given that the normal range of programmer productivity |
| spans more than an order of magnitude, it is unrealistic to expect |
| an affordable study to be capable of detecting (say) a 10\pct\ difference |
| in productivity. |
| Although the multiple-order-of-magnitude differences that such studies |
| \emph{can} reliably detect are extremely valuable, the most impressive |
| improvements tend to be based on a long series of 10\pct\ improvements. |
| |
| We must therefore take a different approach. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{intro/FourTaskCategories}} |
| \caption{Categories of Tasks Required of Parallel Programmers} |
| \label{fig:intro:Categories of Tasks Required of Parallel Programmers} |
| \end{figure} |
| |
| One such approach is to carefully consider the tasks that parallel |
| programmers must undertake that are not required of sequential programmers. |
| We can then evaluate how well a given programming language or environment |
| assists the developer with these tasks. |
| These tasks fall into the four categories shown in |
| \cref{fig:intro:Categories of Tasks Required of Parallel Programmers}, |
| each of which is covered in the following sections. |
| |
| \subsection{Work Partitioning} |
| \label{sec:intro:Work Partitioning} |
| |
| Work partitioning is absolutely required for parallel execution: |
| If there is but one ``glob'' of work, then it can be executed by at |
| most one CPU at a time, which is by definition sequential execution. |
| However, partitioning the work requires great care. |
| For example, uneven partitioning can result in sequential execution |
| once the small partitions have completed~\cite{GeneAmdahl1967AmdahlsLaw}. |
| In less extreme cases, load balancing can be used to fully utilize |
| available hardware and restore performance and scalability. |
| |
| Although partitioning can greatly improve performance and scalability, |
| it can also increase complexity. |
| For example, partitioning can complicate handling of global |
| errors and events: |
| A parallel program may need to carry out non-trivial synchronization |
| in order to safely process such global events. |
| More generally, each partition requires some sort of communication: |
| After all, if |
| a given thread did not communicate at all, it would have no effect and |
| would thus not need to be executed. |
| However, because communication incurs overhead, careless partitioning choices |
| can result in severe performance degradation. |
| |
| Furthermore, the number of concurrent threads must often be controlled, |
| as each such thread occupies common resources, for example, |
| space in CPU caches. |
| If too many threads are permitted to execute concurrently, the |
| CPU caches will overflow, resulting in high cache miss rate, which in |
| turn degrades performance. |
| Conversely, large numbers of threads are often required to |
| overlap computation and I/O so as to fully utilize I/O devices. |
| |
| \QuickQuiz{ |
| Other than CPU cache capacity, what might require limiting the |
| number of concurrent threads? |
| }\QuickQuizAnswer{ |
| There are any number of potential limits on the number of |
| threads: |
| \begin{enumerate} |
| \item Main memory. |
| Each thread consumes some memory |
| (for its stack if nothing else), so that excessive |
| numbers of threads can exhaust memory, resulting |
| in excessive paging or memory-allocation failures. |
| \item I/O bandwidth. |
| If each thread initiates a given |
| amount of mass-storage I/O or networking traffic, |
| excessive numbers of threads can result in excessive |
| I/O queuing delays, again degrading performance. |
| Some networking protocols may be subject to timeouts |
| or other failures if there are so many threads that |
| networking events cannot be responded to in a timely |
| fashion. |
| \item Synchronization overhead. |
| For many synchronization protocols, excessive numbers |
| of threads can result in excessive spinning, blocking, |
| or rollbacks, thus degrading performance. |
| \end{enumerate} |
| |
| Specific applications and platforms may have any number of additional |
| limiting factors. |
| }\QuickQuizEnd |
| |
| Finally, permitting threads to execute concurrently greatly increases |
| the program's state space, which can make the program difficult to |
| understand and debug, degrading productivity. |
| All else being equal, smaller state spaces having more regular structure |
| are more easily understood, but this is a human-factors statement as much |
| as it is a technical or mathematical statement. |
| Good parallel designs might have extremely large state spaces, but |
| nevertheless be easy to understand due to their regular structure, |
| while poor designs can be impenetrable despite having a comparatively |
| small state space. |
| The best designs exploit embarrassing parallelism, or transform the |
| problem to one having an embarrassingly parallel solution. |
| In either case, ``embarrassingly parallel'' is in fact |
| an embarrassment of riches. |
| The current state of the art enumerates good designs; more work is |
| required to make more general judgments on |
| state-space size and structure. |
| |
| \subsection{Parallel Access Control} |
| \label{sec:Parallel Access Control} |
| |
| Given a single-threaded sequential program, that single |
| thread has full access to all of the program's resources. |
| These resources are most often in-memory data structures, but can be CPUs, |
| memory (including caches), I/O devices, computational accelerators, files, |
| and much else besides. |
| |
| The first parallel-access-control issue is whether the form of access to |
| a given resource depends on that resource's location. |
| For example, in many message-passing environments, local-variable |
| access is via expressions and assignments, |
| while remote-variable access uses an entirely different |
| syntax, usually involving messaging. |
| The POSIX Threads environment~\cite{OpenGroup1997pthreads}, |
| Structured Query Language (SQL)~\cite{DIS9075SQL92}, and |
| partitioned global address-space (PGAS) environments |
| such as Universal Parallel C (UPC)~\cite{ElGhazawi2003UPC,UPCConsortium2013} |
| offer implicit access, |
| while Message Passing Interface (MPI)~\cite{MPIForum2008} offers |
| explicit access because access to remote data requires explicit |
| messaging. |
| |
| The other parallel-access-control issue is how threads coordinate |
| access to the resources. |
| This coordination is carried out by |
| the very large number of synchronization mechanisms |
| provided by various parallel languages and environments, |
| including message passing, locking, transactions, |
| reference counting, explicit timing, shared atomic variables, and data |
| ownership. |
| Many traditional parallel-programming concerns such as \IX{deadlock}, |
| \IX{livelock}, and transaction rollback stem from this coordination. |
| This framework can be elaborated to include comparisons |
| of these synchronization mechanisms, for example locking vs.\@ transactional |
| memory~\cite{McKenney2007PLOSTM}, but such elaboration is beyond the |
| scope of this section. |
| (See |
| \cref{sec:future:Transactional Memory,% |
| sec:future:Hardware Transactional Memory} |
| for more information on transactional memory.) |
| |
| \QuickQuiz{ |
| Just what is ``explicit timing''??? |
| }\QuickQuizAnswer{ |
| Where each thread is given access to some set of resources during |
| an agreed-to slot of time. |
| For example, a parallel program with eight threads might be |
| organized into eight-millisecond time intervals, so that the |
| first thread is given access during the first millisecond of |
| each interval, the second thread during the second millisecond, |
| and so on. |
| This approach clearly requires carefully synchronized clocks |
| and careful control of execution times, and therefore should |
| be used with considerable caution. |
| |
| In fact, outside of hard realtime environments, you almost |
| certainly want to use something else instead. |
| Explicit timing is nevertheless worth a mention, as it is |
| always there when you need it. |
| }\QuickQuizEnd |
| |
| \subsection{Resource Partitioning and Replication} |
| \label{sec:Resource Partitioning and Replication} |
| |
| The most effective parallel algorithms and systems exploit resource |
| parallelism, so much so that it is |
| usually wise to begin parallelization by partitioning your write-intensive |
| resources and replicating frequently accessed read-mostly resources. |
| The resource in question is most frequently data, which might be |
| partitioned over computer systems, mass-storage devices, \IXplr{NUMA node}, |
| CPU cores (or dies or hardware threads), pages, cache lines, instances |
| of synchronization primitives, or critical sections of code. |
| For example, partitioning over locking primitives is termed |
| ``\IXh{data}{locking}''~\cite{Beck85}. |
| |
| Resource partitioning is frequently application dependent. |
| For example, numerical applications frequently partition matrices |
| by row, column, or sub-matrix, while commercial applications frequently |
| partition write-intensive data structures and replicate |
| read-mostly data structures. |
| Thus, a commercial application might assign the data for a |
| given customer to a given few computers out of a large cluster. |
| An application might statically partition data, or dynamically |
| change the partitioning over time. |
| |
| Resource partitioning is extremely effective, but |
| it can be quite challenging for complex multilinked data |
| structures. |
| |
| \subsection{Interacting With Hardware} |
| \label{sec:Interacting With Hardware} |
| |
| Hardware interaction is normally the domain of the operating system, |
| the compiler, libraries, or other software-environment infrastructure. |
| However, developers working with novel hardware features and components |
| will often need to work directly with such hardware. |
| In addition, direct access to the hardware can be required when squeezing |
| the last drop of performance out of a given system. |
| In this case, the developer may need to tailor or configure the application |
| to the cache geometry, system topology, or interconnect protocol of the |
| target hardware. |
| |
| In some cases, hardware may be considered to be a resource which |
| is subject to partitioning or access control, as described in |
| the previous sections. |
| |
| \subsection{Composite Capabilities} |
| \label{sec:Composite Capabilities} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{intro/FourTaskOrder}} |
| \caption{Ordering of Parallel-Programming Tasks} |
| \label{fig:intro:Ordering of Parallel-Programming Tasks} |
| \end{figure} |
| |
| Although these four capabilities are fundamental, |
| good engineering practice uses composites of |
| these capabilities. |
| For example, the data-parallel approach first |
| partitions the data so as to minimize the need for |
| inter-partition communication, partitions the code accordingly, |
| and finally maps data partitions and threads so as to maximize |
| throughput while minimizing inter-thread communication, |
| as shown in |
| \cref{fig:intro:Ordering of Parallel-Programming Tasks}. |
| The developer can then |
| consider each partition separately, greatly reducing the size |
| of the relevant state space, in turn increasing productivity. |
| Even though some problems are non-partitionable, |
| clever transformations into forms permitting partitioning can |
| sometimes greatly enhance |
| both performance and scalability~\cite{PanagiotisMetaxas1999PDCS}. |
| |
| \subsection{How Do Languages and Environments Assist With These Tasks?} |
| \label{sec:intro:How Do Languages and Environments Assist With These Tasks?} |
| |
| Although many environments require the developer to deal manually |
| with these tasks, there are long-standing environments that bring |
| significant automation to bear. |
| The poster child for these environments is SQL, many implementations |
| of which automatically parallelize single large queries and also |
| automate concurrent execution of independent queries and updates. |
| |
| These four categories of tasks must be carried out in all parallel |
| programs, but that of course does not necessarily mean that the developer |
| must manually carry out these tasks. |
| We can expect to see ever-increasing automation of these four tasks |
| as parallel systems continue to become cheaper and more readily available. |
| |
| \QuickQuiz{ |
| Are there any other obstacles to parallel programming? |
| }\QuickQuizAnswer{ |
| There are a great many other potential obstacles to parallel |
| programming. |
| Here are a few of them: |
| \begin{enumerate} |
| \item The only known algorithms for a given project might |
| be inherently sequential in nature. |
| In this case, either avoid parallel programming |
| (there being no law saying that your project \emph{has} |
| to run in parallel) or invent a new parallel algorithm. |
| \item The project allows binary-only plugins that share the same |
| address space, such that no one developer has access to |
| all of the source code for the project. |
| Because many parallel bugs, including deadlocks, are |
| global in nature, such binary-only plugins pose a severe |
| challenge to current software development methodologies. |
| This might well change, but for the time being, all |
| developers of parallel code sharing a given address space |
| need to be able to see \emph{all} of the code running in |
| that address space. |
| \item The project contains heavily used APIs that were designed |
| without regard to |
| parallelism~\cite{HagitAttiya2011LawsOfOrder,Clements:2013:SCR:2517349.2522712}. |
| Some of the more ornate features of the System V |
| message-queue API form a case in point. |
| Of course, if your project has been around for a few |
| decades, and its developers did not have access to |
| parallel hardware, it undoubtedly has at least |
| its share of such APIs. |
| \item The project was implemented without regard to parallelism. |
| Given that there are a great many techniques that work |
| extremely well in a sequential environment, but that |
| fail miserably in parallel environments, if your project |
| ran only on sequential hardware for most of its lifetime, |
| then your project undoubtably has at least its share of |
| parallel-unfriendly code. |
| \item The project was implemented without regard to good |
| software-development practice. |
| The cruel truth is that shared-memory parallel |
| environments are often much less forgiving of sloppy |
| development practices than are sequential environments. |
| You may be well-served to clean up the existing design |
| and code prior to attempting parallelization. |
| \item The people who originally did the development on your |
| project have since moved on, and the people remaining, |
| while well able to maintain it or add small features, |
| are unable to make ``big animal'' changes. |
| In this case, unless you can work out a very simple |
| way to parallelize your project, you will probably |
| be best off leaving it sequential. |
| That said, there are a number of simple approaches that |
| you might use |
| to parallelize your project, including running multiple |
| instances of it, using a parallel implementation of |
| some heavily used library function, or making use of |
| some other parallel project, such as a database. |
| \end{enumerate} |
| |
| One can argue that many of these obstacles are non-technical |
| in nature, but that does not make them any less real. |
| In short, parallelization of a large body of code |
| can be a large and complex effort. |
| As with any large and complex effort, it makes sense to |
| do your homework beforehand. |
| }\QuickQuizEnd |
| |
| \section{Discussion} |
| \label{sec:intro:Discussion} |
| % |
| \epigraph{Until you try, you don't know what you can't do.} |
| {Henry James} |
| |
| This section has given an overview of the difficulties with, goals of, |
| and alternatives to parallel programming. |
| This overview was followed by a discussion of |
| what can make parallel programming hard, along with a high-level |
| approach for dealing with parallel programming's difficulties. |
| Those who still insist that parallel programming is impossibly difficult |
| should review some of the older guides to parallel |
| programmming~\cite{SQNTParallel,AndrewDBirrell1989Threads,Beck85,Inman85}. |
| The following quote from Andrew Birrell's |
| monograph~\cite{AndrewDBirrell1989Threads} is especially telling: |
| |
| \begin{quote} |
| Writing concurrent programs has a reputation for being exotic |
| and difficult. |
| I~believe it is neither. |
| You need a system that provides you with good primitives |
| and suitable libraries, |
| you need a basic caution and carefulness, you need an armory of |
| useful techniques, and you need to know of the common pitfalls. |
| I~hope that this paper has helped you towards sharing my belief. |
| \end{quote} |
| |
| The authors of these older guides were well up to the parallel programming |
| challenge back in the 1980s. |
| As such, there are simply no excuses for refusing to step up to the |
| parallel-programming challenge here in the 21\textsuperscript{st} century! |
| |
| We are now ready to proceed to the next chapter, which dives into the |
| relevant properties of the parallel hardware underlying our parallel |
| software. |
| |
| \QuickQuizAnswersChp{qqzintro} |