| % intro/intro.tex |
| |
| \QuickQuizChapter{chp:Introduction}{Introduction} |
| |
| Parallel programming has earned a reputation as one of the most |
| difficult areas a hacker can tackle. |
| Papers and textbooks warn of the perils of deadlock, livelock, |
| race conditions, non-determinism, Amdahl's-Law limits to scaling, |
| and excessive realtime latencies. |
| And these perils are quite real; we authors have accumulated uncounted |
| % 2009: |
| % 19 for Paul E. McKenney |
| years of experience dealing with them, and all of the emotional scars, |
| grey hairs, and hair loss that go with such experiences. |
| |
| However, new technologies have always been difficult to use at introduction, |
| but have invariably become easier over time. |
| For example, there was a time when the ability to drive a car was a rare |
| skill, but in many developed countries, this skill is now commonplace. |
| This dramatic change came about for two basic reasons: (1)~cars became |
| cheaper and more readily available, so that more people had the |
| opportunity to learn to drive, and (2)~cars became easier to operate, |
| due to automatic transmissions, automatic chokes, automatic starters, |
| greatly improved reliability, |
| and a host of other technological improvements. |
| |
| The same is true of a host of other technologies, including computers. |
| It is no longer necessary to operate a keypunch in order to program. |
| Spreadsheets allow most non-programmers to get results from their computers |
| that would have required a team of specialists a few decades ago. |
| Perhaps the most compelling example is web-surfing and content creation, |
| which since the early 2000s has been easily done by |
| untrained, uneducated people using various now-commonplace |
| social-networking tools. |
| As recently as 1968, such content creation was a far-out research |
| project~\cite{DouglasEngelbart1968}, described at |
| the time as |
| ``like a UFO landing on the White House lawn''\cite{ScottGriffen2000}. |
| % http://www.ibiblio.org/pioneers/engelbart.html |
| % http://www.histech.rwth-aachen.de/www/quellen/engelbart/ahi62index.html |
| |
| Therefore, if you wish to argue that parallel programming will remain |
| as difficult as it is currently perceived by many to be, it is you |
| who bears the burden of proof, keeping in mind the many centuries of |
| counter-examples in a variety of fields of endeavor. |
| |
| \section{Historic Parallel Programming Difficulties} |
| \label{sec:intro:Historic Parallel Programming Difficulties} |
| |
| As indicated by its title, this book takes a different approach. |
| Rather than complain about the difficulty of parallel programming, |
| it instead examines the reasons why parallel programming is |
| difficult, and then works to help the reader to overcome these |
| difficulties. |
| As will be seen, these difficulties have fallen into several categories, |
| including: |
| |
| \begin{enumerate} |
| \item The historic high cost and relative rarity of parallel systems. |
| \item The typical researcher's and practitioner's lack of experience |
| with parallel systems. |
| \item The paucity of publicly accessible parallel code. |
| \item The lack of a widely understood engineering discipline of |
| parallel programming. |
| \item The high overhead of communication relative to that of processing, |
| even in tightly coupled shared-memory computers. |
| \end{enumerate} |
| |
| Many of these historic difficulties are well on the way to being overcome. |
| First, over the past few decades, the cost of parallel systems |
| has decreased from many multiples of that of a house to a fraction of |
| that of a bicycle, thanks to the advent of multicore systems. |
| Papers calling out the advantages of multicore CPUs were published |
| as early as 1996~\cite{Olukotun96}, IBM introduced simultaneous multi-threading |
| into its high-end POWER family in 2000, and multicore in 2001. |
| Intel introduced hyperthreading into its commodity Pentium line in |
| November 2000, and both AMD and Intel introduced |
| dual-core CPUs in 2005. |
| Sun followed with the multicore/multi-threaded Niagara in late 2005. |
| In fact, by 2008, it was becoming difficult |
| to find a single-CPU desktop system, with single-core CPUs being |
| relegated to netbooks and embedded devices. |
| By 2012, even smartphones were starting to sport multiple CPUs. |
| |
| Second, the advent of low-cost and readily available multicore system |
| means that the once-rare experience of parallel programming is |
| now available to almost all researchers and practitioners. |
| In fact, parallel systems are now well within the budget of students |
| and hobbyists. |
| We can therefore expect greatly increased levels of invention and |
| innovation surrounding parallel systems, and that increased familiarity |
| will over time make once-forbidding field of parallel programming |
| much more friendly and commonplace. |
| |
| Third, in the 20\textsuperscript{th} century, large systems of |
| highly parallel software were almost always closely guarded proprietary |
| secrets. |
| In happy contrast, the 21\textsuperscript{st} century has seen numerous |
| open-source (and thus publicly available) parallel software projects, |
| including the Linux kernel~\cite{Torvalds2.6kernel}, |
| database systems~\cite{PostgreSQL2008,MySQL2008}, |
| and message-passing systems~\cite{OpenMPI2008,BOINC2008}. |
| This book will draw primarily from the Linux kernel, but will |
| provide much material suitable for user-level applications. |
| |
| Fourth, even though the large-scale parallel-programming projects of |
| the 1980s and 1990s were almost all proprietary projects, these |
| projects have seeded the community with a cadre of developers who |
| understand the engineering discipline required to develop production-quality |
| parallel code. |
| A major purpose of this book is to present this engineering discipline. |
| |
| Unfortunately, the fifth difficulty, the high cost of communication |
| relative to that of processing, remains largely in force. |
| Although this difficulty has been receiving increasing attention during |
| the new millennium, according to Stephen Hawking, |
| the finite speed of light and the atomic |
| nature of matter is likely to limit progress in this |
| area~\cite{BryanGardiner2007,GordonMoore03a}. |
| Fortunately, this difficulty has been in force since the late 1980s, |
| so that the aforementioned engineering discipline has evolved practical |
| and effective strategies for handling it. |
| In addition, hardware designers are increasingly aware of these issues, |
| so perhaps future hardware will be more friendly to parallel software |
| as discussed in Section~\ref{sec:cpu:Hardware Free Lunch?}. |
| |
| \QuickQuiz{} |
| Come on now!!! |
| Parallel programming has been known to be exceedingly |
| hard for many decades. |
| You seem to be hinting that it is not so hard. |
| What sort of game are you playing? |
| \QuickQuizAnswer{ |
| If you really believe that parallel programming is exceedingly |
| hard, then you should have a ready answer to the question |
| ``Why is parallel programming hard?'' |
| One could list any number of reasons, ranging from deadlocks to |
| race conditions to testing coverage, but the real answer is that |
| {\em it is not really all that hard}. |
| After all, if parallel programming was really so horribly difficult, |
| how could a large number of open-source projects, ranging from Apache |
| to MySQL to the Linux kernel, have managed to master it? |
| |
| A better question might be: ''Why is parallel programming {\em |
| perceived} to be so difficult?'' |
| To see the answer, let's go back to the year 1991. |
| Paul McKenney was walking across the parking lot to Sequent's |
| benchmarking center carrying six dual-80486 Sequent Symmetry CPU |
| boards, when he suddenly realized that he was carrying several |
| times the price of the house he had just purchased.\footnote{ |
| Yes, this sudden realization {\em did} cause him to walk quite |
| a bit more carefully. |
| Why do you ask?} |
| This high cost of parallel systems meant that |
| parallel programming was restricted to a privileged few who |
| worked for an employer who either manufactured or could afford to |
| purchase machines costing upwards of \$100,000 --- in 1991 dollars US. |
| |
| In contrast, in 2006, Paul finds himself typing these words on a |
| dual-core x86 laptop. |
| Unlike the dual-80486 CPU boards, this laptop also contains |
| 2GB of main memory, a 60GB disk drive, a display, Ethernet, |
| USB ports, wireless, and Bluetooth. |
| And the laptop is more than an order of magnitude cheaper than |
| even one of those dual-80486 CPU boards, even before taking inflation |
| into account. |
| |
| Parallel systems have truly arrived. |
| They are no longer the sole domain of a privileged few, but something |
| available to almost everyone. |
| |
| The earlier restricted availability of parallel hardware is |
| the \emph{real} reason that parallel programming is considered |
| so difficult. |
| After all, it is quite difficult to learn to program even the simplest |
| machine if you have no access to it. |
| Since the age of rare and expensive parallel machines is for the most |
| part behind us, the age during which |
| parallel programming is perceived to be mind-crushingly difficult is |
| coming to a close.\footnote{ |
| Parallel programming is in some ways more difficult than |
| sequential programming, for example, parallel validation |
| is more difficult. |
| But no longer mind-crushingly difficult.} |
| } \QuickQuizEnd |
| |
| However, even though parallel programming might not be as hard as |
| is commonly advertised, it is often more work than is sequential |
| programming. |
| |
| \QuickQuiz{} |
| How could parallel programming \emph{ever} be as easy |
| as sequential programming? |
| \QuickQuizAnswer{ |
| It depends on the programming environment. |
| SQL~\cite{DIS9075SQL92} is an underappreciated success |
| story, as it permits programmers who know nothing about parallelism |
| to keep a large parallel system productively busy. |
| We can expect more variations on this theme as parallel |
| computers continue to become cheaper and more readily available. |
| For example, one possible contender in the scientific and |
| technical computing arena is MATLAB*P, |
| which is an attempt to automatically parallelize common |
| matrix operations. |
| |
| Finally, on Linux and UNIX systems, consider the following |
| shell command: |
| |
| {\small \tt get\_input | grep "interesting" | sort} |
| |
| This shell pipeline runs the \co{get_input}, \co{grep}, |
| and \co{sort} processes in parallel. |
| There, that wasn't so hard, now was it? |
| } \QuickQuizEnd |
| |
| It therefore makes sense to consider alternatives to parallel programming. |
| However, it is not possible to reasonably consider parallel-programming |
| alternatives without understanding parallel-programming goals. |
| This topic is addressed in the next section. |
| |
| \section{Parallel Programming Goals} |
| \label{sec:intro:Parallel Programming Goals} |
| |
| The three major goals of parallel programming (over and above those |
| of sequential programming) are as follows: |
| |
| \begin{enumerate} |
| \item Performance. |
| \item Productivity. |
| \item Generality. |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| Oh, really??? |
| What about correctness, maintainability, robustness, and so on? |
| \QuickQuizAnswer{ |
| These are important goals, but they are just as important for |
| sequential programs as they are for parallel programs. |
| Therefore, important though they are, they do not belong on |
| a list specific to parallel programming. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| And if correctness, maintainability, and robustness don't |
| make the list, why do productivity and generality? |
| \QuickQuizAnswer{ |
| Given that parallel programming is perceived to be much harder |
| than is sequential programming, productivity is tantamount and |
| therefore must not be omitted. |
| Furthermore, high-productivity parallel-programming environments |
| such as SQL have been special purpose, hence generality must |
| also be added to the list. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Given that parallel programs are much harder to prove |
| correct than are sequential programs, again, shouldn't |
| correctness \emph{really} be on the list? |
| \QuickQuizAnswer{ |
| From an engineering standpoint, the difficulty in proving |
| correctness, either formally or informally, would be important |
| insofar as it impacts the primary goal of productivity. |
| So, in cases where correctness proofs are important, they |
| are subsumed under the ``productivity'' rubric. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| What about just having fun? |
| \QuickQuizAnswer{ |
| Having fun is important as well, but, unless you are a hobbyist, |
| would not normally be a \emph{primary} goal. |
| On the other hand, if you \emph{are} a hobbyist, go wild! |
| } \QuickQuizEnd |
| |
| Each of these goals is elaborated upon in the following sections. |
| |
| \subsection{Performance} |
| \label{sec:intro:Performance} |
| |
| Performance is the primary goal behind most parallel-programming effort. |
| After all, if performance is not a concern, why not do yourself |
| a favor, just write sequential code, and be happy? |
| It will very likely be easier, |
| and you will probably get done much more quickly. |
| |
| \QuickQuiz{} |
| Are there no cases where parallel programming is about something |
| other than performance? |
| \QuickQuizAnswer{ |
| There are certainly cases where the problem to be solved is |
| inherently parallel, for example, Monte Carlo methods and |
| some numerical computations. |
| Even in these cases, however, there will be some amount of |
| extra work managing the parallelism. |
| } \QuickQuizEnd |
| |
| Note that ``performance'' is interpreted quite broadly here, |
| including scalability (performance per CPU) and efficiency |
| (for example, performance per watt). |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/clockfreq}} |
| \end{center} |
| \caption{MIPS/Clock-Frequency Trend for Intel CPUs} |
| \label{fig:intro:Clock-Frequency Trend for Intel CPUs} |
| \end{figure} |
| |
| That said, the focus of performance has shifted from hardware to |
| parallel software. |
| This change in focus is due to the fact that although Moore's Law |
| continues to deliver increases in transistor density, it has ceased to |
| provide the traditional single-threaded performance |
| increases, as can be seen in |
| Figure~\ref{fig:intro:Clock-Frequency Trend for Intel CPUs}.\footnote{ |
| This plot shows clock frequencies for newer CPUs theoretically |
| capable of retiring one or more instructions per clock, and MIPS for |
| older CPUs requiring multiple clocks to execute even the |
| simplest instruction. |
| The reason for taking this approach is that the newer CPUs' |
| ability to retire multiple instructions per clock is typically |
| limited by memory-system performance.} |
| This means that writing single-threaded code and simply waiting |
| a year or two for the CPUs to catch up may no longer be an option. |
| Given the recent trends on the part of all major manufacturers towards |
| multicore/multithreaded systems, parallelism is the way to go for |
| those wanting the avail themselves of the full performance of their |
| systems. |
| |
| Even so, the first goal is performance rather than scalability, |
| especially given that the easiest way to attain linear scalability |
| is to reduce the performance of each CPU~\cite{LinusTorvalds2001a}. |
| Given a four-CPU system, which would you prefer? |
| A program that provides 100 transactions per second on a single CPU, |
| but does not scale at all? |
| Or a program that provides 10 transactions per second on a single CPU, |
| but scales perfectly? |
| The first program seems like a better bet, though the answer might |
| change if you happened to be one of the lucky few with access to a |
| 32-CPU system. |
| |
| That said, just because you have multiple CPUs is not necessarily |
| in and of itself a reason to use them all, especially given the |
| recent decreases in price of multi-CPU systems. |
| The key point to understand is that parallel programming is primarily |
| a performance optimization, and, as such, it is one potential optimization |
| of many. |
| If your program is fast enough as currently written, there is no |
| reason to optimize, either by parallelizing it or by applying any |
| of a number of potential sequential optimizations.\footnote{ |
| Of course, if you are a hobbyist whose primary interest is |
| writing parallel software, that is more than enough reason to |
| parallelize whatever software you are interested in.} |
| By the same token, if you are looking to apply parallelism as an |
| optimization to a sequential program, then you will need to compare |
| parallel algorithms to the best sequential algorithms. |
| This may require some care, as far too many publications ignore the |
| sequential case when analyzing the performance of parallel algorithms. |
| |
| \subsection{Productivity} |
| \label{sec:intro:Productivity} |
| |
| \QuickQuiz{} |
| Why all this prattling on about non-technical issues??? |
| And not just \emph{any} non-technical issue, but \emph{productivity} |
| of all things? |
| Who cares? |
| \QuickQuizAnswer{ |
| If you are a pure hobbyist, perhaps you don't need to care. |
| But even pure hobbyists will often care about how much they |
| can get done, and how quickly. |
| After all, the most popular hobbyist tools are usually those |
| that are the best suited for the job, and an important part of |
| the definition of ``best suited'' involves productivity. |
| And if someone is paying you to write parallel code, they will |
| very likely care deeply about your productivity. |
| And if the person paying you cares about something, you would |
| be most wise to pay at least some attention to it! |
| |
| Besides, if you \emph{really} didn't care about productivity, |
| you would be doing it by hand rather than using a computer! |
| } \QuickQuizEnd |
| |
| Productivity has been becoming increasingly important through the decades. |
| To see this, consider that early computers cost millions of dollars at |
| a time when engineering salaries were a few thousand dollars a year. |
| If dedicating a team of ten engineers to such a machine would improve |
| its performance by 10\%, their salaries would be repaid many times over. |
| |
| One such machine was the CSIRAC, the oldest still-intact stored-program |
| computer, put in operation in |
| 1949~\cite{CSIRACMuseumVictoria,CSIRACUniversityMelbourne}. |
| Given that the machine had but 768 words of RAM, it is safe to say |
| that the productivity issues that arise in large-scale software projects |
| were not an issue for this machine. |
| Because this machine was built before the transistor era, it was constructed |
| of 2,000 vacuum tubes, ran with a clock frequency of 1kHz, |
| consumed 30kW of power, and weighed more than three metric tons. |
| |
| It would be difficult to purchase a machine with this little compute |
| power roughly sixty years later (2008), with the closest equivalents |
| being 8-bit embedded microprocessors exemplified by the venerable |
| Z80~\cite{z80Wikipedia}. |
| This CPU had 8,500 transistors, and can still be purchased in 2008 |
| for less than \$2 US per unit in 1,000-unit quantities. |
| In stark contrast to the CSIRAC, software-development costs are |
| anything but insignificant for the Z80. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{SMPdesign/mipsperbuck}} |
| \end{center} |
| \caption{MIPS per Die for Intel CPUs} |
| \label{fig:intro:MIPS per Die for Intel CPUs} |
| \end{figure} |
| |
| The CSIRAC and the Z80 are two points in a long-term trend, as can be |
| seen in |
| Figure~\ref{fig:intro:MIPS per Die for Intel CPUs}. |
| This figure plots an approximation to computational power per die |
| over the past three decades, showing a consistent four-order-of-magnitude |
| increase. |
| Note that the advent of multicore CPUs has permitted this increase to |
| continue unabated despite the clock-frequency wall encountered in 2003. |
| |
| One of the inescapable consequences of the rapid decrease in |
| the cost of hardware is that software productivity grows increasingly |
| important. |
| It is no longer sufficient merely to make efficient use of the hardware, |
| it is now also necessary to make extremely efficient use of software |
| developers. |
| This has long been the case for sequential hardware, but only recently |
| has parallel hardware become a low-cost commodity. |
| Therefore, the need for high productivity in creating parallel software |
| has only recently become hugely important. |
| |
| \QuickQuiz{} |
| Given how cheap parallel hardware has become, how can anyone |
| afford to pay people to program it? |
| \QuickQuizAnswer{ |
| There are a number of answers to this question: |
| \begin{enumerate} |
| \item Given a large computational cluster of parallel machines, |
| the aggregate cost of the cluster can easily justify |
| substantial developer effort, because the development |
| cost can be spread over the large number of machines. |
| \item Popular software that is run by tens of millions of users |
| can easily justify substantial developer effort, |
| as the cost of this development can be spread over the tens |
| of millions of users. |
| Note that this includes things like kernels and system |
| libraries. |
| \item If the low-cost parallel machine is controlling the operation |
| of a valuable piece of equipment, then the cost of this |
| piece of equipment might easily justify substantial |
| developer effort. |
| \item If the software for the low-cost parallel produces an |
| extremely valuable result (e.g., mineral exploration), |
| then the valuable result might again justify substantial |
| developer cost. |
| \item Safety-critical systems protect lives, which can clearly |
| justify very large developer effort. |
| \item Hobbyists and researchers might seek knowledge, experience, |
| fun, or glory rather than mere money. |
| \end{enumerate} |
| So it is not the case that the decreasing cost of hardware renders |
| software worthless, but rather that it is no longer possible to |
| ``hide'' the cost of software development within the cost of |
| the hardware, at least not unless there are extremely large |
| quantities of hardware. |
| } \QuickQuizEnd |
| |
| Perhaps at one time, the sole purpose of parallel software was performance. |
| Now, however, productivity is increasingly important. |
| |
| \subsection{Generality} |
| \label{sec:intro:Generality} |
| |
| One way to justify the high cost of developing parallel software |
| is to strive for maximal generality. |
| All else being equal, the cost of a more-general software artifact |
| can be spread over more users than can a less-general artifact. |
| |
| Unfortunately, generality often comes at the cost of performance, |
| productivity, or both. |
| To see this, consider the following popular parallel programming |
| environments: |
| |
| \begin{description} |
| \item[C/C++ ``Locking Plus Threads'']: This category, which includes |
| POSIX Threads (pthreads)~\cite{OpenGroup1997pthreads}, |
| Windows Threads, and numerous |
| operating-system kernel environments, offers excellent performance |
| (at least within the confines of a single SMP system) |
| and also offers good generality. |
| Pity about the relatively low productivity. |
| \item[Java]: This programming environment, which is inherently |
| multithreaded, is widely believed to be much more productive |
| than C or C++, courtesy of the automatic garbage collector |
| and the rich set of class libraries, and is reasonably |
| general purpose. |
| However, its performance, though greatly improved in the early |
| 2000s, is generally considered to be less than that of C and C++. |
| \item[MPI]: This Message Passing Interface~\cite{MPIForum2008} powers |
| the largest scientific and technical computing clusters in |
| the world, so offers unparalleled performance and scalability. |
| It is in theory general purpose, but has generally been used |
| for scientific and technical computing. |
| Its productivity is believed by many to be even lower than that |
| of C/C++ ``locking plus threads'' environments. |
| \item[OpenMP]: This set of compiler directives can be used |
| to parallelize loops. It is thus quite specific to this |
| task, and this specificity often limits its performance. |
| It is, however, much easier to use than MPI or C/C++ |
| ``locking plus threads.'' |
| \item[SQL]: Structured Query Language~\cite{DIS9075SQL92} is extremely |
| specific, applying only to relational database queries. |
| However, its performance is quite good, doing quite well |
| in Transaction Processing Performance Council (TPC) |
| benchmarks~\cite{TPC}. |
| Productivity is excellent, in fact, this parallel programming |
| environment enables parallel-programming novices |
| to make good use of a large parallel system. |
| \end{description} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{intro/PPGrelation}} |
| \end{center} |
| \caption{Software Layers and Performance, Productivity, and Generality} |
| \label{fig:intro:Software Layers and Performance, Productivity, and Generality} |
| \end{figure} |
| |
| The nirvana of parallel programming environments, one that offers |
| world-class performance, productivity, and generality, simply does |
| not yet exist. |
| Until such a nirvana appears, it will be necessary to make engineering |
| tradeoffs among performance, productivity, and generality. |
| One such tradeoff is shown in |
| Figure~\ref{fig:intro:Software Layers and Performance, Productivity, and Generality}, |
| which shows how productivity becomes increasingly important at the upper layers |
| of the system stack, |
| while performance and generality become increasingly important at the |
| lower layers of the system stack. |
| The huge development costs incurred near the bottom of the stack |
| must be spread over equally huge numbers of users on the one hand |
| (hence the importance of generality), and |
| performance lost near the bottom of the stack cannot easily be |
| recovered further up the stack. |
| Near the top of the stack, there might be very few users for a given |
| specific application, in which case productivity concerns are paramount. |
| This explains the tendency towards ``bloatware'' further up the stack: |
| extra hardware is often cheaper than the extra developers. |
| This book is intended for developers working near the bottom |
| of the stack, where performance and generality are of great concern. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{intro/Generality}} |
| \end{center} |
| \caption{Tradeoff Between Productivity and Generality} |
| \label{fig:intro:Tradeoff Between Productivity and Generality} |
| \end{figure} |
| |
| It is important to note that a tradeoff between productivity and |
| generality has existed for centuries in many fields. |
| For but one example, a nailgun is far more productive than is a |
| hammer, but in contrast to the nailgun, a hammer can be used for |
| many things besides driving nails. |
| It should therefore be absolutely no surprise to see similar tradeoffs |
| appear in the field of parallel computing. |
| This tradeoff is shown schematically in |
| Figure~\ref{fig:intro:Tradeoff Between Productivity and Generality}. |
| Here, users~1, 2, 3, and 4 have specific jobs that they need the computer |
| to help them with. |
| The most productive possible language or environment for a given user is one |
| that simply does that user's job, without requiring any programming, |
| configuration, or other setup. |
| |
| \QuickQuiz{} |
| This is a ridiculously unachievable ideal! |
| Why not focus on something that is achievable in practice? |
| \QuickQuizAnswer{ |
| This is eminently achievable. |
| The cellphone is a computer that can be used to make phone |
| calls and to send and receive text messages with little or |
| no programming or configuration on the part of the end user. |
| |
| This might seem to be a trivial example at first glance, |
| but if you consider it carefully you will see that it is |
| both simple and profound. |
| When we are willing to sacrifice generality, we can achieve |
| truly astounding increases in productivity. |
| Those who cling to generality will therefore fail to set |
| the productivity bar high enough to succeed in production |
| environments. |
| } \QuickQuizEnd |
| |
| Unfortunately, a system that does the job required by user~1 is |
| unlikely to do user~2's job. |
| In other words, the most productive languages and environments are |
| domain-specific, and thus by definition lacking generality. |
| |
| Another option is to tailor a given programming language or environment |
| to the hardware system (for example, low-level languages such as |
| assembly, C, C++, or Java) or to some abstraction (for example, |
| Haskell, Prolog, or Snobol), as is shown by the circular region near |
| the center of |
| Figure~\ref{fig:intro:Tradeoff Between Productivity and Generality}. |
| These languages can be considered to be general in the sense that they |
| are equally ill-suited to the jobs required by users~1, 2, 3, and 4. |
| In other words, their generality is purchased at the expense of |
| decreased productivity when compared to domain-specific languages |
| and environments. |
| Worse yet, a language that is tailored to a given abstraction |
| is also likely to suffer from performance and scalability problems |
| unless and until someone figures out how to efficiently map that |
| abstraction to real hardware. |
| |
| With the three often-conflicting parallel-programming goals of |
| performance, productivity, |
| and generality in mind, it is now time to look into avoiding these |
| conflicts by considering alternatives to |
| parallel programming. |
| |
| \section{Alternatives to Parallel Programming} |
| \label{sec:intro:Alternatives to Parallel Programming} |
| |
| In order to properly consider alternatives to parallel programming, |
| you must first have thought through what you expect the parallelism |
| to do for you. |
| As seen in Section~\ref{sec:intro:Parallel Programming Goals}, |
| the primary goals of parallel programming are performance, productivity, |
| and generality. |
| Because this book is intended for developers working on |
| performance-critical code near the bottom of the software stack, |
| the remainder of this section focuses primarily on performance improvement. |
| |
| It is important to keep in mind that parallelism is but one way to |
| improve performance. |
| Other well-known approaches include the following, in roughly increasing |
| order of difficulty: |
| |
| \begin{enumerate} |
| \item Run multiple instances of a sequential application. |
| \item Make the application use existing parallel software. |
| \item Apply performance optimization to the serial application. |
| \end{enumerate} |
| |
| These approaches are covered in the sections. |
| |
| \subsection{Multiple Instances of a Sequential Application} |
| \label{sec:intro:Multiple Instances of a Sequential Application} |
| |
| Running multiple instances of a sequential application can allow you |
| to do parallel programming without actually doing parallel programming. |
| There are a large number of ways to approach this, depending on the |
| structure of the application. |
| |
| If your program is analyzing a large number of different scenarios, |
| or is analyzing a large number of independent data sets, one easy |
| and effective approach is to create a single sequential program that |
| carries out a single analysis, then use any of a number of scripting |
| environments (for example the \co{bash} shell) to run a number of |
| instances of this sequential program in parallel. |
| In some cases, this approach can be easily extended to a cluster of |
| machines. |
| |
| This approach may seem like cheating, and in fact some denigrate such |
| programs as ``embarrassingly parallel''. |
| And in fact, this approach does have some potential disadvantages, |
| including increased memory consumption, waste of CPU cycles recomputing |
| common intermediate results, and increased copying of data. |
| However, it is often extremely productive, garnering extreme performance |
| gains with little or no added effort. |
| |
| \subsection{Use Existing Parallel Software} |
| \label{sec:intro:Use Existing Parallel Software} |
| |
| There is no longer any shortage of parallel software environments that |
| can present a single-threaded programming environment, |
| including relational |
| databases~\cite{Date82}, |
| web-application servers, and map-reduce environments. |
| For example, a common design provides a separate program for each |
| user, each of which generates SQL that is run concurrently against a common |
| relational database. |
| The per-user programs are responsible only for the user interface, |
| with the relational database taking full responsibility for the |
| difficult issues surrounding parallelism and persistence. |
| |
| Taking this approach often sacrifices some performance, at least when |
| compared to carefully hand-coding a fully parallel application. |
| However, such sacrifice is often justified given the huge reduction in |
| development effort required. |
| |
| \subsection{Performance Optimization} |
| \label{sec:intro:Performance Optimization} |
| |
| Up through the early 2000s, CPU performance was doubling every 18 months. |
| In such an environment, it is often much more important to create new |
| functionality than to do careful performance optimization. |
| Now that Moore's Law is ``only'' increasing transistor density instead |
| of increasing both transistor density and per-transistor performance, |
| it might be a good time to rethink the importance of performance |
| optimization. |
| After all, performance optimization can reduce power consumption as |
| well as increasing performance. |
| |
| From this viewpoint, parallel programming is but another performance |
| optimization, albeit one that is becoming much more attractive |
| as parallel systems become cheaper and more readily available. |
| However, it is wise to keep in mind that the speedup available from |
| parallelism is limited to roughly the number of CPUs, while the |
| speedup potentially available from straight software optimization |
| can be multiple orders of magnitude. |
| |
| Furthermore, different programs might have different performance |
| bottlenecks. |
| Parallel programming will only help with some bottlenecks. |
| For example, if your program spends most of its time |
| waiting on data from your disk drive, |
| using multiple CPUs is not likely to gain much performance. |
| In fact, if the program was reading from a large file laid out |
| sequentially on a rotating disk, parallelizing your program might |
| well make it a lot slower. |
| You should instead add more disk drives, optimize the data so that |
| the file can be smaller (thus faster to read), or, if possible, |
| avoid the need to read quite so much of the data. |
| |
| \QuickQuiz{} |
| What other bottlenecks might prevent additional CPUs from |
| providing additional performance? |
| \QuickQuizAnswer{ |
| There are any number of potential bottlenecks: |
| \begin{enumerate} |
| \item Main memory. If a single thread consumes all available |
| memory, additional threads will simply page themselves |
| silly. |
| \item Cache. If a single thread's cache footprint completely |
| fills any shared CPU cache(s), then adding more threads |
| will simply thrash the affected caches. |
| \item Memory bandwidth. If a single thread consumes all available |
| memory bandwidth, additional threads will simply |
| result in additional queuing on the system interconnect. |
| \item I/O bandwidth. If a single thread is I/O bound, |
| adding more threads will simply result in them all |
| waiting in line for the affected I/O resource. |
| \end{enumerate} |
| |
| Specific hardware systems might have any number of additional |
| bottlenecks. |
| } \QuickQuizEnd |
| |
| Parallelism can be a powerful optimization technique, but |
| it is not the only such technique, nor is it appropriate for all |
| situations. |
| Of course, the easier it is to parallelize your program, the |
| more attractive parallelization becomes as an optimization. |
| Parallelization has a reputation of being quite difficult, |
| which leads to the question ``exactly what makes parallel |
| programming so difficult?'' |
| |
| \section{What Makes Parallel Programming Hard?} |
| \label{sec:intro:What Makes Parallel Programming Hard?} |
| |
| \OriginallyPublished{Section}{sec:intro:What Makes Parallel Programming Hard?}{What Makes Parallel Programming Hard?}{a Portland State University Technical Report}{PaulEMcKenney2009ProgrammingHard} |
| |
| It is important to note that the difficulty of parallel programming |
| is as much a human-factors issue as it is a set of technical properties of the |
| parallel programming problem. |
| This is the case because we need human beings to be able to tell parallel |
| systems what to do, and this two-way communication between human and |
| computer is as much a function of the human as it is of the computer. |
| Therefore, appeals to abstractions or to mathematical analyses will |
| necessarily be of severely limited utility. |
| |
| In the Industrial Revolution, the interface between human and machine |
| was evaluated by human-factor studies, then called time-and-motion |
| studies. |
| Although there have been a few human-factor studies examining parallel |
| programming~\cite{RyanEccles2005HPCSNovice,RyanEccles2006HPCSNoviceNeeds, |
| LorinHochstein2005SC,DuaneSzafron1994PEMPDS}, these studies have |
| been extremely narrowly focused, and hence unable to demonstrate any |
| general results. |
| Furthermore, given that the normal range of programmer productivity |
| spans more than an order of magnitude, it is unrealistic to expect |
| an affordable study to be capable of detecting (say) a 10\% difference |
| in productivity. |
| Although the multiple-order-of-magnitude differences that such studies |
| \emph{can} reliably detect are extremely valuable, the most impressive |
| improvements tend to be based on a long series of 10\% improvements. |
| |
| We must therefore take a different approach. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{intro/FourTaskCategories}} |
| \end{center} |
| \caption{Categories of Tasks Required of Parallel Programmers} |
| \label{fig:intro:Categories of Tasks Required of Parallel Programmers} |
| \end{figure} |
| |
| One such approach is to carefully consider the tasks that parallel |
| programmers must undertake that are not required of sequential programmers. |
| We can then evaluate how well a given programming language or environment |
| assists the developer with these tasks. |
| These tasks fall into the four categories shown in |
| Figure~\ref{fig:intro:Categories of Tasks Required of Parallel Programmers}, |
| each of which is covered in the following sections. |
| |
| \subsection{Work Partitioning} |
| \label{sec:intro:Work Partitioning} |
| |
| Work partitioning is absolutely required for parallel execution: |
| if there is but one ``glob'' of work, then it can be executed by at |
| most one CPU at a time, which is by definition sequential execution. |
| However, partitioning the code requires great care. |
| For example, uneven partitioning can result in sequential execution |
| once the small partitions have completed~\cite{GeneAmdahl1967AmdahlsLaw}. |
| In less extreme cases, load balancing can be used to fully utilize |
| available hardware, thus improving performance and scalabilty. |
| |
| In addition, partitioning of work can complicate handling of global |
| errors and events: a parallel |
| program may need to carry out non-trivial synchronization in order |
| to safely process such global events. |
| |
| Each partition requires some sort of communication: after all, if |
| a given thread did not communicate at all, it would have no effect and |
| would thus not need to be executed. |
| However, because communication incurs overhead, careless partitioning choices |
| can result in severe performance degradation. |
| |
| Furthermore, the number of concurrent threads must often be controlled, |
| as each such thread occupies common resources, for example, |
| space in CPU caches. |
| If too many threads are permitted to execute concurrently, the |
| CPU caches will overflow, resulting in high cache miss rate, which in |
| turn degrades performance. |
| On the other hand, large numbers of threads are often required to |
| overlap computation and I/O. |
| |
| \QuickQuiz{} |
| What besides CPU cache capacity might require limiting the |
| number of concurrent threads? |
| \QuickQuizAnswer{ |
| There are any number of potential limits on the number of |
| threads: |
| \begin{enumerate} |
| \item Main memory. Each thread consumes some memory |
| (for its stack if nothing else), so that excessive |
| numbers of threads can exhaust memory, resulting |
| in excessive paging or memory-allocation failures. |
| \item I/O bandwidth. If each thread initiates a given |
| amount of mass-storage I/O or networking traffic, |
| excessive numbers of threads can result in excessive |
| I/O queuing delays, again degrading performance. |
| Some networking protocols may be subject to timeouts |
| or other failures if there are so many threads that |
| networking events cannot be responded to in a timely |
| fashion. |
| \item Synchronization overhead. |
| For many synchronization protocols, excessive numbers |
| of threads can result in excessive spinning, blocking, |
| or rollbacks, thus degrading performance. |
| \end{enumerate} |
| |
| Specific applications and platforms may have any number of additional |
| limiting factors. |
| } \QuickQuizEnd |
| |
| Finally, permitting threads to execute concurrently greatly increases |
| the program's state space, which can make the program difficult to |
| understand, degrading productivity. |
| All else being equal, smaller state spaces having more regular structure |
| are more easily understood, but this is a human-factors statement as much |
| as it is a technical or mathematical statement. |
| Good parallel designs might have extremely large state spaces, but |
| nevertheless be easy to understand due to their regular structure, |
| while poor designs can be impenetrable despite having a comparatively |
| small state space. |
| The best designs exploit embarrassing parallelism, or transform the |
| problem to one having an embarrassingly parallel solution. |
| In either case, ``embarrassingly parallel'' is in fact |
| an embarrassment of riches. |
| The current state of the art enumerates good designs; more work is |
| required to make more general judgments on |
| state-space size and structure. |
| |
| \subsection{Parallel Access Control} |
| \label{sec:Parallel Access Control} |
| |
| Given a sequential program with only a single thread, that single |
| thread has full access to all of the program's resources. |
| These resources are most often in-memory data structures, but can be CPUs, |
| memory (including caches), I/O devices, computational accelerators, files, |
| and much else besides. |
| |
| The first parallel-access-control issue is whether the form of the access to |
| a given resource depends on that resource's location. |
| For example, in many message-passing environments, local-variable |
| access is via expressions and assignments, |
| while remote-variable access uses an entirely different |
| syntax, usually involving messaging. |
| The POSIX Threads environment~\cite{OpenGroup1997pthreads}, |
| Structured Query Language (SQL)~\cite{DIS9075SQL92}, and |
| partitioned global address-space (PGAS) environments |
| such as Universal Parallel C (UPC)~\cite{ElGhazawi2003UPC} |
| offer implicit access, |
| while Message Passing Interface (MPI)~\cite{MPIForum2008} offers |
| explicit access because access to remote data requires explicit |
| messaging. |
| |
| The other parallel-access-control issue is how threads coordinate |
| access to the resources. |
| This coordination is carried out by |
| the very large number of synchronization mechanisms |
| provided by various parallel languages and environments, |
| including message passing, locking, transactions, |
| reference counting, explicit timing, shared atomic variables, and data |
| ownership. |
| Many traditional parallel-programming concerns such as deadlock, |
| livelock, and transaction rollback stem from this coordination. |
| This framework can be elaborated to include comparisons |
| of these synchronization mechanisms, for example locking vs. transactional |
| memory~\cite{McKenney2007PLOSTM}, but such elaboration is beyond the |
| scope of this section. |
| |
| \subsection{Resource Partitioning and Replication} |
| \label{sec:Resource Partitioning and Replication} |
| |
| The most effective parallel algorithms and systems exploit resource |
| parallelism, so much so that it is |
| usually wise to begin parallelization by partitioning your write-intensive |
| resources and replicating frequently accessed read-mostly resources. |
| The resource in question is most frequently data, which might be |
| partitioned over computer systems, mass-storage devices, NUMA nodes, |
| CPU cores (or dies or hardware threads), pages, cache lines, instances |
| of synchronization primitives, or critical sections of code. |
| For example, partitioning over locking primitives is termed |
| ``data locking''~\cite{Beck85}. |
| |
| Resource partitioning is frequently application dependent, for |
| example, numerical applications frequently partition matrices |
| by row, column, or sub-matrix, while commercial applications frequently |
| partition write-intensive data structures and replicate |
| read-mostly data structures. |
| For example, a commercial application might assign the data for a |
| given customer to a given few computer systems out of a large cluster. |
| An application might statically partition data, or dynamically |
| change the partitioning over time. |
| |
| Resource partitioning is extremely effective, but |
| it can be quite challenging for complex multilinked data |
| structures. |
| |
| \subsection{Interacting With Hardware} |
| \label{sec:Interacting With Hardware} |
| |
| Hardware interaction is normally the domain of the operating system, |
| the compiler, libraries, or other software-environment infrastructure. |
| However, developers working with novel hardware features and components |
| will often need to work directly with such hardware. |
| In addition, direct access to the hardware can be required when squeezing |
| the last drop of performance out of a given system. |
| In this case, the developer may need to tailor or configure the application |
| to the cache geometry, system topology, or interconnect protocol of the |
| target hardware. |
| |
| In some cases, hardware may be considered to be a resource which |
| may be subject to partitioning or access control, as described in |
| the previous sections. |
| |
| \subsection{Composite Capabilities} |
| \label{sec:Composite Capabilities} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{intro/FourTaskOrder}} |
| \end{center} |
| \caption{Ordering of Parallel-Programming Tasks} |
| \label{fig:intro:Ordering of Parallel-Programming Tasks} |
| \end{figure} |
| |
| Although these four capabilities are fundamental, |
| good engineering practice uses composites of |
| these capabilities. |
| For example, the data-parallel approach first |
| partitions the data so as to minimize the need for |
| inter-partition communication, partitions the code accordingly, |
| and finally maps data partitions and threads so as to maximize |
| throughput while minimizing inter-thread communication, |
| as shown in |
| Figure~\ref{fig:intro:Ordering of Parallel-Programming Tasks}. |
| The developer can then |
| consider each partition separately, greatly reducing the size |
| of the relevant state space, in turn increasing productivity. |
| Of course, some problems are non-partitionable but on the other hand, |
| clever transformations into forms permitting partitioning can greatly |
| enhance |
| both performance and scalability~\cite{PanagiotisMetaxas1999PDCS}. |
| |
| \subsection{How Do Languages and Environments Assist With These Tasks?} |
| \label{sec:intro:How Do Languages and Environments Assist With These Tasks?} |
| |
| Although many environments require that the developer deal manually |
| with these tasks, there are long-standing environments that bring |
| significant automation to bear. |
| The poster child for these environments is SQL, many implementations |
| of which automatically parallelize single large queries and also |
| automate concurrent execution of independent queries and updates. |
| |
| These four categories of tasks must be carried out in all parallel |
| programs, but that of course does not necessarily mean that the developer |
| must manually carry out these tasks. |
| We can expect to see ever-increasing automation of these four tasks |
| as parallel systems continue to become cheaper and more readily available. |
| |
| \QuickQuiz{} |
| Are there any other obstacles to parallel programming? |
| \QuickQuizAnswer{ |
| There are a great many other potential obstacles to parallel |
| programming. |
| Here are a few of them: |
| \begin{enumerate} |
| \item The only known algorithms for a given project might |
| be inherently sequential in nature. |
| In this case, either avoid parallel programming |
| (there being no law saying that your project \emph{has} |
| to run in parallel) or invent a new parallel algorithm. |
| \item The project allows binary-only plugins that share the same |
| address space, such that no one developer has access to |
| all of the source code for the project. |
| Because many parallel bugs, including deadlocks, are |
| global in nature, such binary-only plugins pose a severe |
| challenge to current software development methodologies. |
| This might well change, but for the time being, all |
| developers of parallel code sharing a given address space |
| need to be able to see \emph{all} of the code running in |
| that address space. |
| \item The project contains heavily used APIs that were designed |
| without regard to |
| parallelism~\cite{HagitAttiya2011LawsOfOrder}. |
| Some of the more ornate features of the System V |
| message-queue API form a case in point. |
| Of course, if your project has been around for a few |
| decades, and if its developers did not have access to |
| parallel hardware, your project undoubtedly has at least |
| its share of such APIs. |
| \item The project was implemented without regard to parallelism. |
| Given that there are a great many techniques that work |
| extremely well in a sequential environment, but that |
| fail miserably in parallel environments, if your project |
| ran only on sequential hardware for most of its lifetime, |
| then your project undoubtably has at least its share of |
| parallel-unfriendly code. |
| \item The project was implemented without regard to good |
| software-development practice. |
| The cruel truth is that shared-memory parallel |
| environments are often much less forgiving of sloppy |
| development practices than are sequential environments. |
| You may be well-served to clean up the existing design |
| and code prior to attempting parallelization. |
| \item The people who originally did the development on your |
| project have since moved on, and the people remaining, |
| while well able to maintain it or add small features, |
| are unable to make ``big animal'' changes. |
| In this case, unless you can work out a very simple |
| way to parallelize your project, you will probably |
| be best off leaving it sequential. |
| That said, there are a number of simple approaches that |
| you might use |
| to parallelize your project, including running multiple |
| instances of it, using a parallel implementation of |
| some heavily used library function, or making use of |
| some other parallel project, such as a database. |
| \end{enumerate} |
| |
| One can argue that many of these obstacles are non-technical |
| in nature, but that does not make them any less real. |
| In short, parallelization of a large body of code |
| can be a large and complex effort. |
| As with any large and complex effort, it makes sense to |
| do your homework beforehand. |
| } \QuickQuizEnd |
| |
| \section{Guide to This Book} |
| \label{sec:intro:Guide to This Book} |
| |
| This book is not a collection of optimal algorithms with tiny areas of |
| applicability; instead, it is a handbook of widely applicable and heavily |
| used techniques. |
| We of course could not resist the urge to include some of our favorites |
| that have not (yet!) passed the test of time (what author could?), but |
| we have nonetheless gritted our teeth and banished our darlings to |
| appendices. |
| Perhaps in time, some of them will see enough use that we can promote |
| them into the main body of the text. |
| |
| % \emph{@@@ More here. Sections. Layered Approach. Appendices. |
| % Glossary. Bibliography.} |
| |
| \subsection{Quick Quizzes} |
| |
| ``Quick quizzes'' appear throughout this book. |
| Some of these quizzes are based on material in which that quick quiz |
| appears, but others require you to think beyond that section, and, |
| in some cases, beyond the entire book. |
| As with most endeavors, what you get out of this book is largely |
| determined by what you are willing to put into it. |
| Therefore, readers who invest some time into these quizzes will |
| find their effort repaid handsomely with increased understanding |
| of parallel programming. |
| |
| Answers to the quizzes may be found |
| in |
| Appendix~\ref{chp:Answers to Quick Quizzes} starting on |
| page~\pageref{chp:Answers to Quick Quizzes}. |
| |
| \QuickQuiz{} |
| Where are the answers to the Quick Quizzes found? |
| \QuickQuizAnswer{ |
| In Appendix~\ref{chp:Answers to Quick Quizzes} starting on |
| page~\pageref{chp:Answers to Quick Quizzes}. |
| |
| Hey, I thought I owed you an easy one! |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Some of the Quick Quiz questions seem to be from the viewpoint |
| of the reader rather than the author. |
| Is that really the intent? |
| \QuickQuizAnswer{ |
| Indeed it is! |
| Many are questions that Paul E. McKenney would probably have |
| asked if he was a novice student in a class covering this material. |
| It is worth noting that Paul was taught most of this material by |
| parallel hardware and software, not by professors. |
| In Paul's experience, professors are much more likely to provide |
| answers to verbal questions than are parallel systems, |
| Watson notwithstanding. |
| Of course, we could have a lengthy debate over which of professors |
| or parallel systems provide the most useful answers to these sorts |
| of questions, |
| but for the time being let's just agree that usefulness of |
| answers varies widely across the population both of professors |
| and of parallel systems. |
| |
| Other quizzes are quite similar to actual questions that have been |
| asked during conference presentations and lectures covering the |
| material in this book. |
| A few others are from the viewpoint of the author. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| These Quick Quizzes just are not my cup of tea. |
| What do you recommend? |
| \QuickQuizAnswer{ |
| There are a number of alternatives available to you: |
| \begin{enumerate} |
| \item Just ignore the Quick Quizzes and read the rest of |
| the book. |
| You might miss out on the interesting material in |
| some of the Quick Quizzes, but the rest of the book |
| has lots of good material as well. |
| This is an eminently reasonable approach if your main |
| goal is to gain a general understanding of the material |
| or if you are skimming through to book to find a |
| solution to a specific problem. |
| \item If you find the Quick Quizzes distracting but impossible |
| to ignore, you can always clone the \LaTeX{} source for |
| this book from the git archive. |
| Then modify \co{Makefile} and \co{qqz.sty} to eliminate |
| the Quick Quizzes from the PDF output. |
| Alternatively, you could modify these two files so as |
| to pull the answers inline, immediately following |
| the questions. |
| \item Look at the answer immediately rather than investing |
| a large amount of time in coming up with your own |
| answer. |
| This approach is reasonable when a given Quick Quiz's |
| answer holds the key to a specific problem you are |
| trying to solve. |
| This approach is also reasonable if you want a somewhat |
| deeper understanding of the material, but when you do not |
| expect to be called upon to generate parallel solutions given |
| only a blank sheet of paper. |
| \item If you prefer a more academic and rigorous treatment of |
| parallel programming, |
| you might like Herlihy's and Shavit's |
| textbook~\cite{HerlihyShavit2008Textbook}. |
| This book starts with an interesting combination |
| of low-level primitives at high levels of abstraction |
| from the hardware, and works its way through locking |
| and simple data structures including lists, queues, |
| hash tables, and counters, culminating with transactional |
| memory. |
| Michael Scott's textbook~\cite{MichaelScott2013Textbook} |
| approaches similar material with more of a |
| software-engineering focus, and as far as I know, is |
| the first formally published textbook to include a |
| section devoted to RCU. |
| \item If you would like an academic treatment of parallel |
| programming from a programming-language-pragmatics viewpoint, |
| you might be interested in the concurrency chapter from Scott's |
| textbook~\cite{MichaelScott2006Textbook} |
| on programming languages. |
| \item If you are interested in an object-oriented patternist |
| treatment of parallel programming focussing on C++, |
| you might try Volumes~2 and 4 of Schmidt's POSA |
| series~\cite{SchmidtStalRohnertBuschmann2000v2Textbook, |
| BuschmannHenneySchmidt2007v4Textbook}. |
| Volume 4 in particular has some interesting chapters |
| applying this work to a warehouse application. |
| The realism of this example is attested to by |
| the section entitled ``Partitioning the Big Ball of Mud'', |
| wherein the problems inherent in parallelism often |
| take a back seat to the problems inherent in getting |
| one's head around a real-world application. |
| \item If your primary focus is scientific and technical computing, |
| and you prefer a patternist approach, |
| you might try Mattson et al.'s |
| textbook~\cite{Mattson2005Textbook}. |
| It covers Java, C/C++, OpenMP, and MPI. |
| Its patterns are admirably focused first on design, |
| then on implementation. |
| \item If you are interested in POSIX Threads, you might take |
| a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}. |
| \item If you are interested in C++, but in a Windows environment, |
| you might try Herb Sutter's ``Effective Concurrency'' |
| series in |
| Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}. |
| This series does a reasonable job of presenting a |
| commonsense approach to parallelism. |
| \item If you want to try out Intel Threading Building Blocks, |
| then perhaps James Reinders's book~\cite{Reinders2007Textbook} |
| is what you are looking for. |
| \item Finally, those preferring to work in Java might be |
| well-served by Doug Lea's |
| textbooks~\cite{DougLea1997Textbook,Goetz2007Textbook}. |
| \end{enumerate} |
| In contrast, this book meshes real-world machines with real-world |
| algorithms. |
| If your sole goal is to find (say) an optimal parallel queue, you |
| might be better served by one of the above books. |
| However, if you are interested in principles of parallel design |
| that allow multiple such queues to operate in parallel, read on! |
| |
| Coming back to the topic of Quick Quizzes, if you need a deep |
| understanding of the material, then you might well need to |
| learn to tolerate the Quick Quizzes. |
| Don't get me wrong, passively reading the material can be quite |
| valuable, but gaining full problem-solving capability really |
| does require that you practice solving problems. |
| |
| I learned this the hard way during coursework for my late-in-life |
| Ph.D. |
| I was studying a familiar topic, and was surprised at how few of |
| the chapter's exercises I could solve off the top of my head. |
| Forcing myself to answer the questions greatly increased my |
| retention of the material. |
| So with these Quick Quizzes I am not asking you to do anything |
| that I have not been doing myself! |
| } \QuickQuizEnd |
| |
| \subsection{Sample Source Code} |
| |
| This book discusses its fair share of source code, and in many cases |
| this source code may be found in the \co{CodeSamples} directory |
| of this book's git tree. |
| For example, on UNIX systems, you should be able to type: |
| {\scriptsize |
| \begin{verbatim} |
| find CodeSamples -name rcu_rcpls.c -print |
| \end{verbatim} |
| } |
| to locate the file \co{rcu_rcpls.c}, which is called out in |
| Section~\ref{sec:defer:``Toy'' RCU Implementations}. |
| Other types of systems have well-known ways of locating files by |
| filename. |
| |
| \begin{figure*}[tbp] |
| { |
| \begin{verbatim} |
| 1 git clone git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git |
| 2 cd perfbook |
| 3 make |
| 4 evince perfbook.pdf |
| \end{verbatim} |
| } |
| \caption{Creating an Up-To-Date PDF} |
| \label{fig:intro:Creating a Up-To-Date PDF} |
| \end{figure*} |
| |
| \begin{figure*}[tbp] |
| { |
| \begin{verbatim} |
| 1 git remote update |
| 2 git checkout origin/master |
| 3 make |
| 4 evince perfbook.pdf |
| \end{verbatim} |
| } |
| \caption{Generating an Updated PDF} |
| \label{fig:intro:Generating an Updated PDF} |
| \end{figure*} |
| |
| The source to this book may be found in the \co{git} archive at |
| \url{git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git}, |
| and \co{git} itself is available as part of most mainstream Linux |
| distributions. |
| To create and display a current \LaTeX{} source tree of this book, |
| use the list of Linux commands shown in |
| Figure~\ref{fig:intro:Creating a Up-To-Date PDF}. |
| In some environments, the \co{evince} that displays \co{perfbook.pdf} |
| may need to be replaced, for example, with \co{acroread}. |
| The \co{git clone} command need only be used the first time you |
| create a PDF, subsequently, you can run the commands shown in |
| Figure~\ref{fig:intro:Generating an Updated PDF} to pull in any updates |
| and generate an updated PDF. |
| The commands in |
| Figure~\ref{fig:intro:Generating an Updated PDF} |
| must be run within the \co{perfbook} directory created by the commands |
| shown in |
| Figure~\ref{fig:intro:Creating a Up-To-Date PDF}. |
| |
| PDFs of this book are sporadically posted at |
| \url{http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html} |
| and at |
| \url{http://www.rdrop.com/users/paulmck/perfbook/}. |