| % toolsoftrade/toolsoftrade.tex |
| |
| \QuickQuizChapter{chp:Tools of the Trade}{Tools of the Trade} |
| |
| This chapter provides a brief introduction to some basic tools of the |
| parallel-programming trade, focusing mainly on those available to |
| user applications running on operating systems similar to Linux. |
| Section~\ref{sec:toolsoftrade:Scripting Languages} begins with |
| scripting languages, |
| Section~\ref{sec:toolsoftrade:POSIX Multiprocessing} |
| describes the multi-process parallelism supported by the POSIX API, |
| Section~\ref{sec:toolsoftrade:POSIX Multiprocessing} touches on POSIX threads, |
| and finally, |
| Section~\ref{sec:toolsoftrade:Atomic Operations} |
| describes atomic operations. |
| |
| Please note that this chapter provides but a brief introduction. |
| More detail is available from the references cited, and more information |
| on how best to use these tools will be provided in later chapters. |
| |
| \section{Scripting Languages} |
| \label{sec:toolsoftrade:Scripting Languages} |
| |
| The Linux shell scripting languages provide simple but effective ways |
| of managing parallelism. |
| For example, suppose that you had a program \co{compute_it} |
| that you needed to run twice with two different sets of arguments. |
| This can be accomplished using UNIX shell scripting as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 compute_it 1 > compute_it.1.out & |
| 2 compute_it 2 > compute_it.2.out & |
| 3 wait |
| 4 cat compute_it.1.out |
| 5 cat compute_it.2.out |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{toolsoftrade/shellparallel}} |
| \end{center} |
| \caption{Execution Diagram for Parallel Shell Execution} |
| \label{fig:intro:Execution Diagram for Parallel Shell Execution} |
| \end{figure} |
| |
| Lines~1 and 2 launch two instances of this program, redirecting their |
| output to two separate files, with the \co{&} character directing the |
| shell to run the two instances of the program in the background. |
| Line~3 waits for both instances to complete, and lines~4 and 5 |
| display their output. |
| The resulting execution is as shown in |
| Figure~\ref{fig:intro:Execution Diagram for Parallel Shell Execution}: |
| the two instances of \co{compute_it} execute in parallel, |
| \co{wait} completes after both of them do, and then the two instances |
| of \co{cat} execute sequentially. |
| % @@@ Maui scheduler, load balancing, BOINC, and so on. |
| % @@@ Diagram showing parallel execution. |
| |
| \QuickQuiz{} |
| But this silly shell script isn't a \emph{real} parallel program! |
| Why bother with such trivia??? |
| \QuickQuizAnswer{ |
| Because you should \emph{never} forget the simple stuff! |
| |
| Please keep in mind that the title of this book is |
| ``Is Parallel Programming Hard, And, If So, What Can You Do About It?''. |
| One of the most effective things you can do about it is to |
| avoid forgetting the simple stuff! |
| After all, if you choose to do parallel programming the hard |
| way, you have no one but yourself to blame. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Is there a simpler way to create a parallel shell script? |
| If so, how? If not, why not? |
| \QuickQuizAnswer{ |
| One straightforward approach is the shell pipeline: |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| grep $pattern1 | sed -e 's/a/b/' | sort |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| For a sufficiently large input file, |
| \co{grep} will pattern-match in parallel with \co{sed} |
| editing and with the input processing of \co{sort}. |
| See the file \co{parallel.sh} for a demonstration of |
| shell-script parallelism and pipelining. |
| } \QuickQuizEnd |
| |
| For another example, the \co{make} software-build scripting language |
| provides a \co{-j} option that specifies how much parallelism should be |
| introduced into the build process. |
| For example, typing \co{make -j4} when building a Linux kernel |
| specifies that up to four parallel compiles be carried out concurrently. |
| |
| It is hoped that these simple examples convince you that parallel |
| programming need not always be complex or difficult. |
| |
| \QuickQuiz{} |
| But if script-based parallel programming is so easy, why |
| bother with anything else? |
| \QuickQuizAnswer{ |
| In fact, it is quite likely that a very large fraction of |
| parallel programs in use today are script-based. |
| However, script-based parallelism does have its limitations: |
| \begin{enumerate} |
| \item Creation of new processes is usually quite heavyweight, |
| involving the expensive \co{fork()} and \co{exec()} |
| system calls. |
| \item Sharing of data, including pipelining, typically involves |
| expensive file I/O. |
| \item The reliable synchronization primitives available to |
| scripts also typically involve expensive file I/O. |
| \end{enumerate} |
| These limitations require that script-based parallelism use |
| coarse-grained parallelism, with each unit of work having |
| execution time of at least tens of milliseconds, and preferably |
| much longer. |
| |
| Those requiring finer-grained parallelism are well advised to |
| think hard about their problem to see if it can be expressed |
| in a coarse-grained form. |
| If not, they should consider using other parallel-programming |
| environments, such as those discussed in |
| Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}. |
| } \QuickQuizEnd |
| |
| \section{POSIX Multiprocessing} |
| \label{sec:toolsoftrade:POSIX Multiprocessing} |
| |
| This section scratches the surface of the |
| POSIX environment, including pthreads~\cite{OpenGroup1997pthreads}, |
| as this environment is readily available and widely implemented. |
| Section~\ref{sec:toolsoftrade:POSIX Process Creation and Destruction} |
| provides a glimpse of the POSIX \co{fork()} and related primitives, |
| Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction} |
| touches on thread creation and destruction, |
| Section~\ref{sec:toolsoftrade:POSIX Locking} gives a brief overview |
| of POSIX locking, and, finally, |
| Section~\ref{sec:toolsoftrade:Linux-Kernel Equivalents to POSIX Operations} |
| presents the analogous operations within the Linux kernel. |
| |
| \subsection{POSIX Process Creation and Destruction} |
| \label{sec:toolsoftrade:POSIX Process Creation and Destruction} |
| |
| Processes are created using the \co{fork()} primitive, they may |
| be destroyed using the \co{kill()} primitive, they may destroy |
| themselves using the \co{exit()} primitive. |
| A process executing a \co{fork()} primitive is said to be the ``parent'' |
| of the newly created process. |
| A parent may wait on its children using the \co{wait()} primitive. |
| |
| Please note that the examples in this section are quite simple. |
| Real-world applications using these primitives might need to manipulate |
| signals, file descriptors, shared memory segments, and any number of |
| other resources. |
| In addition, some applications need to take specific actions if a given |
| child terminates, and might also need to be concerned with the reason |
| that the child terminated. |
| These concerns can of course add substantial complexity to the code. |
| For more information, see any of a number of textbooks on the |
| subject~\cite{WRichardStevens1992}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 pid = fork(); |
| 2 if (pid == 0) { |
| 3 /* child */ |
| 4 } else if (pid < 0) { |
| 5 /* parent, upon error */ |
| 6 perror("fork"); |
| 7 exit(-1); |
| 8 } else { |
| 9 /* parent, pid == child ID */ |
| 10 } |
| \end{verbatim} |
| } |
| \caption{Using the fork() Primitive} |
| \label{fig:toolsoftrade:Using the fork() Primitive} |
| \end{figure} |
| |
| If \co{fork()} succeeds, it returns twice, once for the parent |
| and again for the child. |
| The value returned from \co{fork()} allows the caller to tell |
| the difference, as shown in |
| Figure~\ref{fig:toolsoftrade:Using the fork() Primitive} |
| (\co{forkjoin.c}). |
| Line~1 executes the \co{fork()} primitive, and saves its return value |
| in local variable \co{pid}. |
| Line~2 checks to see if \co{pid} is zero, in which case, this is the |
| child, which continues on to execute line~3. |
| As noted earlier, the child may terminate via the \co{exit()} primitive. |
| Otherwise, this is the parent, which checks for an error return from |
| the \co{fork()} primitive on line~4, and prints an error and exits |
| on lines~5-7 if so. |
| Otherwise, the \co{fork()} has executed successfully, and the parent |
| therefore executes line 9 with the variable \co{pid} containing the |
| process ID of the child. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 void waitall(void) |
| 2 { |
| 3 int pid; |
| 4 int status; |
| 5 |
| 6 for (;;) { |
| 7 pid = wait(&status); |
| 8 if (pid == -1) { |
| 9 if (errno == ECHILD) |
| 10 break; |
| 11 perror("wait"); |
| 12 exit(-1); |
| 13 } |
| 14 } |
| 15 } |
| \end{verbatim} |
| } |
| \caption{Using the wait() Primitive} |
| \label{fig:toolsoftrade:Using the wait() Primitive} |
| \end{figure} |
| |
| The parent process may use the \co{wait()} primitive to wait for its children |
| to complete. |
| However, use of this primitive is a bit more complicated than its shell-script |
| counterpart, as each invocation of \co{wait()} waits for but one child |
| process. |
| It is therefore customary to wrap \co{wait()} into a function similar |
| to the \co{waitall()} function shown in |
| Figure~\ref{fig:toolsoftrade:Using the wait() Primitive} |
| (\co{api-pthread.h}), |
| this \co{waitall()} function having semantics similar to the |
| shell-script \co{wait} command. |
| Each pass through the loop spanning lines~6-15 waits on one child process. |
| Line~7 invokes the \co{wait()} primitive, which blocks until a child process |
| exits, and returns that child's process ID. |
| If the process ID is instead -1, this indicates that the \co{wait()} |
| primitive was unable to wait on a child. |
| If so, line~9 checks for the \co{ECHILD} errno, which indicates that there |
| are no more child processes, so that line~10 exits the loop. |
| Otherwise, lines~11 and 12 print an error and exit. |
| |
| \QuickQuiz{} |
| Why does this \co{wait()} primitive need to be so complicated? |
| Why not just make it work like the shell-script \co{wait} does? |
| \QuickQuizAnswer{ |
| Some parallel applications need to take special action when |
| specific children exit, and therefore need to wait for each |
| child individually. |
| In addition, some parallel applications need to detect the |
| reason that the child died. |
| As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive}, |
| it is not hard to build a \co{waitall()} function out of |
| the \co{wait()} function, but it would be impossible to |
| do the reverse. |
| Once the information about a specific child is lost, it is lost. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 int x = 0; |
| 2 int pid; |
| 3 |
| 4 pid = fork(); |
| 5 if (pid == 0) { /* child */ |
| 6 x = 1; |
| 7 printf("Child process set x=1\n"); |
| 8 exit(0); |
| 9 } |
| 10 if (pid < 0) { /* parent, upon error */ |
| 11 perror("fork"); |
| 12 exit(-1); |
| 13 } |
| 14 waitall(); |
| 15 printf("Parent process sees x=%d\n", x); |
| \end{verbatim} |
| } |
| \caption{Processes Created Via fork() Do Not Share Memory} |
| \label{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory} |
| \end{figure} |
| |
| It is critically important to note that the parent and child do \emph{not} |
| share memory. |
| This is illustrated by the program shown in |
| Figure~\ref{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory} |
| (\co{forkjoinvar.c}), |
| in which the child sets a global variable \co{x} to 1 on line~6, |
| prints a message on line~7, and exits on line~8. |
| The parent continues at line~14, where it waits on the child, |
| and on line~15 finds that its copy of the variable \co{x} is still zero. |
| The output is thus as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Child process set x=1 |
| Parent process sees x=0 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| Isn't there a lot more to \co{fork()} and \co{wait()} |
| than discussed here? |
| \QuickQuizAnswer{ |
| Indeed there is, and |
| it is quite possible that this section will be expanded in |
| future versions to include messaging features (such as UNIX |
| pipes, TCP/IP, and shared file I/O) and memory mapping |
| (such as \co{mmap()} and \co{shmget()}). |
| In the meantime, there are any number of textbooks that cover |
| these primitives in great detail, |
| and the truly motivated can read manpages, existing parallel |
| applications using these primitives, as well as the |
| source code of the Linux-kernel implementations themselves. |
| } \QuickQuizEnd |
| |
| The finest-grained parallelism requires shared memory, and |
| this is covered in |
| Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction}. |
| That said, shared-memory parallelism can be significantly more complex |
| than fork-join parallelism. |
| |
| \subsection{POSIX Thread Creation and Destruction} |
| \label{sec:toolsoftrade:POSIX Thread Creation and Destruction} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 int x = 0; |
| 2 |
| 3 void *mythread(void *arg) |
| 4 { |
| 5 x = 1; |
| 6 printf("Child process set x=1\n"); |
| 7 return NULL; |
| 8 } |
| 9 |
| 10 int main(int argc, char *argv[]) |
| 11 { |
| 12 pthread_t tid; |
| 13 void *vp; |
| 14 |
| 15 if (pthread_create(&tid, NULL, |
| 16 mythread, NULL) != 0) { |
| 17 perror("pthread_create"); |
| 18 exit(-1); |
| 19 } |
| 20 if (pthread_join(tid, &vp) != 0) { |
| 21 perror("pthread_join"); |
| 22 exit(-1); |
| 23 } |
| 24 printf("Parent process sees x=%d\n", x); |
| 25 return 0; |
| 26 } |
| \end{verbatim} |
| } |
| \caption{Threads Created Via {\tt pthread\_create()} Share Memory} |
| \label{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| \end{figure} |
| |
| To create a thread within an existing process, invoke the |
| \co{pthread_create()} primitive, for example, as shown on lines~15 |
| and~16 of |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| (\co{pcreate.c}). |
| The first argument is a pointer to a \co{pthread_t} in which to store the |
| ID of the thread to be created, the second \co{NULL} argument is a pointer |
| to an optional \co{pthread_attr_t}, the third argument is the function |
| (in this case, \co{mythread()} |
| that is to be invoked by the new thread, and the last \co{NULL} argument |
| is the argument that will be passed to \co{mythread}. |
| |
| In this example, \co{mythread()} simply returns, but it could instead |
| call \co{pthread_exit()}. |
| |
| \QuickQuiz{} |
| If the \co{mythread()} function in |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| can simply return, why bother with \co{pthread_exit()}? |
| \QuickQuizAnswer{ |
| In this simple example, there is no reason whatsoever. |
| However, imagine a more complex example, where \co{mythread()} |
| invokes other functions, possibly separately compiled. |
| In such a case, \co{pthread_exit()} allows these other functions |
| to end the thread's execution without having to pass some sort |
| of error return all the way back up to \co{mythread()}. |
| } \QuickQuizEnd |
| |
| The \co{pthread_join()} primitive, shown on line~20, is analogous to |
| the fork-join \co{wait()} primitive. |
| It blocks until the thread specified by the \co{tid} variable completes |
| execution, either by invoking \co{pthread_exit()} or by returning from |
| the thread's top-level function. |
| The thread's exit value will be stored through the pointer passed as the |
| second argument to \co{pthread_join()}. |
| The thread's exit value is either the value passed to \co{pthread_exit()} |
| or the value returned by the thread's top-level function, depending on |
| how the thread in question exits. |
| |
| The program shown in |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| produces output as follows, demonstrating that memory is in fact |
| shared between the two threads: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Child process set x=1 |
| Parent process sees x=1 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Note that this program carefully makes sure that only one of the threads |
| stores a value to variable \co{x} at a time. |
| Any situation in which one thread might be storing a value to a given |
| variable while some other thread either loads from or stores to that |
| same variable is termed a ``data race''. |
| Because the C language makes no guarantee that the results of a data race |
| will be in any way reasonable, we need some way of safely accessing |
| and modifying data concurrently, such as the locking primitives discussed |
| in the following section. |
| |
| \QuickQuiz{} |
| If the C language makes no guarantees in presence of a data |
| race, then why does the Linux kernel have so many data races? |
| Are you trying to tell me that the Linux kernel is completely |
| broken??? |
| \QuickQuizAnswer{ |
| Ah, but the Linux kernel is written in a carefully selected |
| superset of the C language that includes special gcc |
| extensions, such as asms, that permit safe execution even |
| in presence of data races. |
| In addition, the Linux kernel does not run on a number of |
| platforms where data races would be especially problematic. |
| For an example, consider embedded systems with 32-bit pointers |
| and 16-bit busses. |
| On such a system, a data race involving a store to and a load |
| from a given pointer might well result in the load returning the |
| low-order 16 bits of the old value of the pointer concatenated |
| with the high-order 16 bits of the new value of the pointer. |
| } \QuickQuizEnd |
| |
| \subsection{POSIX Locking} |
| \label{sec:toolsoftrade:POSIX Locking} |
| |
| The POSIX standard allows the programmer to avoid data races via |
| ``POSIX locking''. |
| POSIX locking features a number of primitives, the most fundamental |
| of which are \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()}. |
| These primitives operate on locks, which are of type \co{pthread_mutex_t}. |
| These locks may be declared statically and initialized with |
| \co{PTHREAD_MUTEX_INITIALIZER}, or they may be allocated dynamically |
| and initialized using the \co{pthread_mutex_init()} primitive. |
| The demonstration code in this section will take the former course. |
| |
| The \co{pthread_mutex_lock()} primitive ``acquires'' the specified lock, |
| and the \co{pthread_mutex_unlock()} ``releases'' the specified lock. |
| Because these are ``exclusive'' locking primitives, |
| only one thread at a time may ``hold'' a given lock at a given time. |
| For example, if a pair of threads attempt to acquire the same lock |
| concurrently, one of the pair will be ``granted'' the lock first, and |
| the other will wait until the first thread releases the lock. |
| |
| \QuickQuiz{} |
| What if I want several threads to hold the same lock at the |
| same time? |
| \QuickQuizAnswer{ |
| The first thing you should do is to ask yourself why you would |
| want to do such a thing. |
| If the answer is ``because I have a lot of data that is read |
| by many threads, and only occasionally updated'', then |
| POSIX reader-writer locks might be what you are looking for. |
| These are introduced in |
| Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}. |
| |
| Another way to get the effect of multiple threads holding |
| the same lock is for one thread to acquire the lock, and |
| then use \co{pthread_create()} to create the other threads. |
| The question of why this would ever be a good idea is left |
| to the reader. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER; |
| 2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER; |
| 3 int x = 0; |
| 4 |
| 5 void *lock_reader(void *arg) |
| 6 { |
| 7 int i; |
| 8 int newx = -1; |
| 9 int oldx = -1; |
| 10 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg; |
| 11 |
| 12 if (pthread_mutex_lock(pmlp) != 0) { |
| 13 perror("lock_reader:pthread_mutex_lock"); |
| 14 exit(-1); |
| 15 } |
| 16 for (i = 0; i < 100; i++) { |
| 17 newx = ACCESS_ONCE(x); |
| 18 if (newx != oldx) { |
| 19 printf("lock_reader(): x = %d\n", newx); |
| 20 } |
| 21 oldx = newx; |
| 22 poll(NULL, 0, 1); |
| 23 } |
| 24 if (pthread_mutex_unlock(pmlp) != 0) { |
| 25 perror("lock_reader:pthread_mutex_unlock"); |
| 26 exit(-1); |
| 27 } |
| 28 return NULL; |
| 29 } |
| 30 |
| 31 void *lock_writer(void *arg) |
| 32 { |
| 33 int i; |
| 34 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg; |
| 35 |
| 36 if (pthread_mutex_lock(pmlp) != 0) { |
| 37 perror("lock_reader:pthread_mutex_lock"); |
| 38 exit(-1); |
| 39 } |
| 40 for (i = 0; i < 3; i++) { |
| 41 ACCESS_ONCE(x)++; |
| 42 poll(NULL, 0, 5); |
| 43 } |
| 44 if (pthread_mutex_unlock(pmlp) != 0) { |
| 45 perror("lock_reader:pthread_mutex_unlock"); |
| 46 exit(-1); |
| 47 } |
| 48 return NULL; |
| 49 } |
| \end{verbatim} |
| } |
| \caption{Demonstration of Exclusive Locks} |
| \label{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| \end{figure} |
| |
| This exclusive-locking property is demonstrated using the code shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| (\co{lock.c}). |
| Line~1 defines and initializes a POSIX lock named \co{lock_a}, while |
| line~2 similarly defines and initializes a lock named \co{lock_b}. |
| Line~3 defines and initializes a shared variable ~\co{x}. |
| |
| Lines~5-28 defines a function \co{lock_reader()} which repeatedly |
| reads the shared variable \co{x} while holding |
| the lock specified by \co{arg}. |
| Line~10 casts \co{arg} to a pointer to a \co{pthread_mutex_t}, as |
| required by the \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()} |
| primitives. |
| |
| \QuickQuiz{} |
| Why not simply make the argument to \co{lock_reader()} |
| on line~5 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| be a pointer to a \co{pthread_mutex_t}? |
| \QuickQuizAnswer{ |
| Because we will need to pass \co{lock_reader()} to |
| \co{pthread_create()}. |
| Although we could cast the function when passing it to |
| \co{pthread_create()}, function casts are quite a bit |
| uglier and harder to get right than are simple pointer casts. |
| } \QuickQuizEnd |
| |
| Lines~12-15 acquire the specified \co{pthread_mutex_t}, checking |
| for errors and exiting the program if any occur. |
| Lines~16-23 repeatedly check the value of \co{x}, printing the new value |
| each time that it changes. |
| Line~22 sleeps for one millisecond, which allows this demonstration |
| to run nicely on a uniprocessor machine. |
| Line~24-27 release the \co{pthread_mutex_t}, again checking for |
| errors and exiting the program if any occur. |
| Finally, line~28 returns \co{NULL}, again to match the function type |
| required by \co{pthread_create()}. |
| |
| \QuickQuiz{} |
| Writing four lines of code for each acquisition and release |
| of a \co{pthread_mutex_t} sure seems painful! |
| Isn't there a better way? |
| \QuickQuizAnswer{ |
| Indeed! |
| And for that reason, the \co{pthread_mutex_lock()} and |
| \co{pthread_mutex_unlock()} primitives are normally wrapped |
| in functions that do this error checking. |
| Later on, we will wrapper them with the Linux kernel |
| \co{spin_lock()} and \co{spin_unlock()} APIs. |
| } \QuickQuizEnd |
| |
| Lines~31-49 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| shows \co{lock_writer()}, which |
| periodically update the shared variable \co{x} while holding the |
| specified \co{pthread_mutex_t}. |
| As with \co{lock_reader()}, line~34 casts \co{arg} to a pointer |
| to \co{pthread_mutex_t}, lines~36-39 acquires the specified lock, |
| and lines~44-47 releases it. |
| While holding the lock, lines~40-43 increment the shared variable \co{x}, |
| sleeping for five milliseconds between each increment. |
| Finally, lines~44-47 release the lock. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 printf("Creating two threads using same lock:\n"); |
| 2 if (pthread_create(&tid1, NULL, |
| 3 lock_reader, &lock_a) != 0) { |
| 4 perror("pthread_create"); |
| 5 exit(-1); |
| 6 } |
| 7 if (pthread_create(&tid2, NULL, |
| 8 lock_writer, &lock_a) != 0) { |
| 9 perror("pthread_create"); |
| 10 exit(-1); |
| 11 } |
| 12 if (pthread_join(tid1, &vp) != 0) { |
| 13 perror("pthread_join"); |
| 14 exit(-1); |
| 15 } |
| 16 if (pthread_join(tid2, &vp) != 0) { |
| 17 perror("pthread_join"); |
| 18 exit(-1); |
| 19 } |
| \end{verbatim} |
| } |
| \caption{Demonstration of Same Exclusive Lock} |
| \label{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| shows a code fragment that runs \co{lock_reader()} and |
| \co{lock_writer()} as thread using the same lock, namely, \co{lock_a}. |
| Lines~2-6 create a thread running \co{lock_reader()}, and then |
| Lines~7-11 create a thread running \co{lock_writer()}. |
| Lines~12-19 wait for both threads to complete. |
| The output of this code fragment is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Creating two threads using same lock: |
| lock_reader(): x = 0 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because both threads are using the same lock, the \co{lock_reader()} |
| thread cannot see any of the intermediate values of \co{x} produced |
| by \co{lock_writer()} while holding the lock. |
| |
| \QuickQuiz{} |
| Is ``x = 0'' the only possible output from the code fragment |
| shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}? |
| If so, why? |
| If not, what other output could appear, and why? |
| \QuickQuizAnswer{ |
| No. |
| The reason that ``x = 0'' was output was that \co{lock_reader()} |
| acquired the lock first. |
| Had \co{lock_writer()} instead acquired the lock first, then |
| the output would have been ``x = 3''. |
| However, because the code fragment started \co{lock_reader()} first |
| and because this run was performed on a multiprocessor, |
| one would normally expect \co{lock_reader()} to acquire the |
| lock first. |
| However, there are no guarantees, especially on a busy system. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 printf("Creating two threads w/different locks:\n"); |
| 2 x = 0; |
| 3 if (pthread_create(&tid1, NULL, |
| 4 lock_reader, &lock_a) != 0) { |
| 5 perror("pthread_create"); |
| 6 exit(-1); |
| 7 } |
| 8 if (pthread_create(&tid2, NULL, |
| 9 lock_writer, &lock_b) != 0) { |
| 10 perror("pthread_create"); |
| 11 exit(-1); |
| 12 } |
| 13 if (pthread_join(tid1, &vp) != 0) { |
| 14 perror("pthread_join"); |
| 15 exit(-1); |
| 16 } |
| 17 if (pthread_join(tid2, &vp) != 0) { |
| 18 perror("pthread_join"); |
| 19 exit(-1); |
| 20 } |
| \end{verbatim} |
| } |
| \caption{Demonstration of Different Exclusive Locks} |
| \label{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| shows a similar code fragment, but this time using different locks: |
| \co{lock_a} for \co{lock_reader()} and \co{lock_b} for |
| \co{lock_writer()}. |
| The output of this code fragment is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Creating two threads w/different locks: |
| lock_reader(): x = 0 |
| lock_reader(): x = 1 |
| lock_reader(): x = 2 |
| lock_reader(): x = 3 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because the two threads are using different locks, they do not exclude |
| each other, and can run concurrently. |
| The \co{lock_reader()} function can therefore see the intermediate |
| values of \co{x} stored by \co{lock_writer()}. |
| |
| \QuickQuiz{} |
| Using different locks could cause quite a bit of confusion, |
| what with threads seeing each others' intermediate states. |
| So should well-written parallel programs restrict themselves |
| to using a single lock in order to avoid this kind of confusion? |
| \QuickQuizAnswer{ |
| Although it is sometimes possible to write a program using a |
| single global lock that both performs and scales well, such |
| programs are exceptions to the rule. |
| You will normally need to use multiple locks to attain good |
| performance and scalability. |
| |
| One possible exception to this rule is ``transactional memory'', |
| which is currently a research topic. |
| Transactional-memory semantics can be loosely thought of as those |
| of a single global lock with optimizations permitted and |
| with the addition of rollback~\cite{HansJBoehm2009HOTPAR}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| In the code shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}, |
| is \co{lock_reader()} guaranteed to see all the values produced |
| by \co{lock_writer()}? |
| Why or why not? |
| \QuickQuizAnswer{ |
| No. |
| On a busy system, \co{lock_reader()} might be preempted |
| for the entire duration of \co{lock_writer()}'s execution, |
| in which case it would not see \emph{any} of \co{lock_writer()}'s |
| intermediate states for \co{x}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Wait a minute here!!! |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| didn't initialize shared variable \co{x}, |
| so why does it need to be initialized in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}? |
| \QuickQuizAnswer{ |
| See line~3 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}. |
| Because the code in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| ran first, it could rely on the compile-time initialization of |
| \co{x}. |
| The code in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| ran next, so it had to re-initialize \co{x}. |
| } \QuickQuizEnd |
| |
| Although there is quite a bit more to POSIX exclusive locking, these |
| primitives provide a good start and are in fact sufficient in a great |
| many situations. |
| The next section takes a brief look at POSIX reader-writer locking. |
| |
| \subsection{POSIX Reader-Writer Locking} |
| \label{sec:toolsoftrade:POSIX Reader-Writer Locking} |
| |
| The POSIX API provides a reader-writer lock, which is represented by |
| a \co{pthread_rwlock_t}. |
| As with \co{pthread_mutex_t}, \co{pthread_rwlock_t} may be statically |
| initialized via \co{PTHREAD_RWLOCK_INITIALIZER} or dynamically |
| initialized via the \co{pthread_rwlock_init()} primitive. |
| The \co{pthread_rwlock_rdlock()} primitive read-acquires the |
| specified \co{pthread_rwlock_t}, the \co{pthread_rwlock_wrlock()} |
| primitive write-acquires it, and the \co{pthread_rwlock_unlock()} |
| primitive releases it. |
| Only a single thread may write-hold a given \co{pthread_rwlock_t} |
| at any given time, but multiple threads may read-hold a given |
| \co{pthread_rwlock_t}, at least while there is no thread |
| currently write-holding it. |
| |
| As you might expect, reader-writer locks are designed for read-mostly |
| situations. |
| In these situations, a reader-writer lock can provide greater scalability |
| than can an exclusive lock because the exclusive lock is by definition |
| limited to a single thread holding the lock at any given time, while |
| the reader-writer lock permits |
| an arbitrarily large number of readers to concurrently hold the lock. |
| However, in practice, we need to know how much additional scalability is |
| provided by reader-writer locks. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER; |
| 2 int holdtime = 0; |
| 3 int thinktime = 0; |
| 4 long long *readcounts; |
| 5 int nreadersrunning = 0; |
| 6 |
| 7 #define GOFLAG_INIT 0 |
| 8 #define GOFLAG_RUN 1 |
| 9 #define GOFLAG_STOP 2 |
| 10 char goflag = GOFLAG_INIT; |
| 11 |
| 12 void *reader(void *arg) |
| 13 { |
| 14 int i; |
| 15 long long loopcnt = 0; |
| 16 long me = (long)arg; |
| 17 |
| 18 __sync_fetch_and_add(&nreadersrunning, 1); |
| 19 while (ACCESS_ONCE(goflag) == GOFLAG_INIT) { |
| 20 continue; |
| 21 } |
| 22 while (ACCESS_ONCE(goflag) == GOFLAG_RUN) { |
| 23 if (pthread_rwlock_rdlock(&rwl) != 0) { |
| 24 perror("pthread_rwlock_rdlock"); |
| 25 exit(-1); |
| 26 } |
| 27 for (i = 1; i < holdtime; i++) { |
| 28 barrier(); |
| 29 } |
| 30 if (pthread_rwlock_unlock(&rwl) != 0) { |
| 31 perror("pthread_rwlock_unlock"); |
| 32 exit(-1); |
| 33 } |
| 34 for (i = 1; i < thinktime; i++) { |
| 35 barrier(); |
| 36 } |
| 37 loopcnt++; |
| 38 } |
| 39 readcounts[me] = loopcnt; |
| 40 return NULL; |
| 41 } |
| \end{verbatim} |
| } |
| \caption{Measuring Reader-Writer Lock Scalability} |
| \label{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability} |
| (\co{rwlockscale.c}) |
| shows one way of measuring reader-writer lock scalability. |
| Line~1 shows the definition and initialization of the reader-writer |
| lock, line~2 shows the \co{holdtime} argument controlling the |
| time each thread holds the reader-writer lock, |
| line~3 shows the \co{thinktime} argument controlling the time between |
| the release of the reader-writer lock and the next acquisition, |
| line~4 defines the \co{readcounts} array into which each reader thread |
| places the number of times it acquired the lock, and |
| line~5 defines the \co{nreadersrunning} variable, which |
| determines when all reader threads have started running. |
| |
| Lines~7-10 define \co{goflag}, which synchronizes the start and the |
| end of the test. |
| This variable is initially set to \co{GOFLAG_INIT}, then set to |
| \co{GOFLAG_RUN} after all the reader threads have started, and finally |
| set to \co{GOFLAG_STOP} to terminate the test run. |
| |
| Lines~12-41 define \co{reader()}, which is the reader thread. |
| Line~18 atomically increments the \co{nreadersrunning} variable |
| to indicate that this thread is now running, and |
| lines~19-21 wait for the test to start. |
| The \co{ACCESS_ONCE()} primitive forces the compiler to fetch \co{goflag} |
| on each pass through the loop---the compiler would otherwise be within its |
| rights to assume that the value of \co{goflag} would never change. |
| |
| \QuickQuiz{} |
| Instead of using \co{ACCESS_ONCE()} everywhere, why not just |
| declare \co{goflag} as \co{volatile} on line~10 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}? |
| \QuickQuizAnswer{ |
| A \co{volatile} declaration is in fact a reasonable alternative in |
| this particular case. |
| However, use of \co{ACCESS_ONCE()} has the benefit of clearly |
| flagging to the reader that \co{goflag} is subject to concurrent |
| reads and updates. |
| However, \co{ACCESS_ONCE()} is especially useful in cases where |
| most of the accesses are protected by a lock (and thus \emph{not} |
| subject to change), but where a few of the accesses are made outside |
| of the lock. |
| Using a volatile declaration in this case would make it harder |
| for the reader to note the special accesses outside of the lock, |
| and would also make it harder for the compiler to generate good |
| code under the lock. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| \co{ACCESS_ONCE()} only affects the compiler, not the CPU. |
| Don't we also need memory barriers to make sure |
| that the change in \co{goflag}'s value propagates to the |
| CPU in a timely fashion in |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}? |
| \QuickQuizAnswer{ |
| No, memory barriers are not needed and won't help here. |
| Memory barriers only enforce ordering among multiple |
| memory references: They do absolutely nothing to expedite |
| the propagation of data from one par of the system to |
| another. |
| This leads to a quick rule of thumb: You do not need |
| memory barriers unless you are using more than one |
| variable to communicate between multiple threads. |
| |
| But what about \co{nreadersrunning}? |
| Isn't that a second variable used for communication? |
| Indeed it is, and there really are the needed memory-barrier |
| instructions buried in \co{__sync_fetch_and_add()}, |
| which make sure that the thread proclaims its presence |
| before checking to see if it should start. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Would it ever be necessary to use \co{ACCESS_ONCE()} when accessing |
| a per-thred variable, for example, a variable declared using |
| the \co{gcc} \co{__thread} storage class? |
| \QuickQuizAnswer{ |
| It depends. |
| If the per-thread variable was accessed only from its thread, |
| and never from a single handler, then no. |
| Otherwise, it is quite possible that \co{ACCESS_ONCE()} is needed. |
| We will see examples of both situations in |
| Section~\ref{sec:count:Signal-Theft Limit Counter Implementation}. |
| |
| This leads to the question of how one thread can gain access to |
| another thread's \co{__thread} variable, and the answer is that |
| the second thread must store a pointer to its \co{__thread} |
| pointer somewhere that the first thread has access to. |
| One common approach is to maintain a linked list with one |
| element per thread, and to store the address of each thread's |
| \co{__thread} variable in the corresponding element. |
| } \QuickQuizEnd |
| |
| The loop spanning lines~22-38 carries out the performance test. |
| Lines~23-26 acquire the lock, lines~27-29 hold the lock for the specified |
| duration (and the \co{barrier()} directive prevents the compiler from |
| optimizing the loop out of existence), lines~30-33 release the lock, |
| and lines~34-36 wait for the specified duration before re-acquiring the |
| lock. |
| Line~37 counts this lock acquisition. |
| |
| Line~39 moves the lock-acquisition count to this thread's element of the |
| \co{readcounts[]} array, and line~40 returns, terminating this thread. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{CodeSamples/toolsoftrade/rwlockscale}} |
| \end{center} |
| \caption{Reader-Writer Lock Scalability} |
| \label{fig:intro:Reader-Writer Lock Scalability} |
| \end{figure} |
| |
| Figure~\ref{fig:intro:Reader-Writer Lock Scalability} |
| shows the results of running this test on a 64-core Power-5 system |
| with two hardware threads per core for a total of 128 software-visible |
| CPUs. |
| The \co{thinktime} parameter was zero for all these tests, and the |
| \co{holdtime} parameter set to values ranging from one thousand (``1K'' |
| on the graph) to 100 million (``100M'' on the graph). |
| The actual value plotted is: |
| \begin{equation} |
| \frac{L_N}{N L_1} |
| \end{equation} |
| where $N$ is the number of threads, |
| $L_N$ is the number of lock acquisitions by $N$ threads, and |
| $L_1$ is the number of lock acquisitions by a single thread. |
| Given ideal hardware and software scalability, this value will always |
| be 1.0. |
| |
| As can be seen in the figure, reader-writer locking scalability is |
| decidedly non-ideal, especially for smaller sizes of critical |
| sections. |
| To see why read-acquisition can be so slow, consider |
| that all the acquiring threads must update the \co{pthread_rwlock_t} |
| data structure. |
| Therefore, if all 128 executing threads attempt to |
| read-acquire the reader-writer lock concurrently, they must update |
| this underlying \co{pthread_rwlock_t} one at a time. |
| One lucky thread might do so almost immediately, but the least-lucky |
| thread must wait for all the other 127 threads to do their updates. |
| This situation will only get worse as you add CPUs. |
| |
| \QuickQuiz{} |
| Isn't comparing against single-CPU throughput a bit harsh? |
| \QuickQuizAnswer{ |
| Not at all. |
| In fact, this comparison was, if anything, overly lenient. |
| A more balanced comparison would be against single-CPU |
| throughput with the locking primitives commented out. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But 1,000 instructions is not a particularly small size for |
| a critical section. |
| What do I do if I need a much smaller critical section, for |
| example, one containing only a few tens of instructions? |
| \QuickQuizAnswer{ |
| If the data being read \emph{never} changes, then you do not |
| need to hold any locks while accessing it. |
| If the data changes sufficiently infrequently, you might be |
| able to checkpoint execution, terminate all threads, change |
| the data, then restart at the checkpoint. |
| |
| Another approach is to keep a single exclusive lock per |
| thread, so that a thread read-acquires the larger aggregate |
| reader-writer lock by acquiring its own lock, and write-acquires |
| by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}. |
| This can work quite well for readers, but causes writers |
| to incur increasingly large overheads as the number of threads |
| increases. |
| |
| Some other ways of handling very small critical sections are |
| described in Section~\ref{sec:defer:Read-Copy Update (RCU)}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| In |
| Figure~\ref{fig:intro:Reader-Writer Lock Scalability}, |
| all of the traces other than the 100M trace deviate gently |
| from the ideal line. |
| In contrast, the 100M trace breaks sharply from the ideal |
| line at 64 CPUs. |
| In addition, the spacing between the 100M trace and the 10M |
| trace is much smaller than that between the 10M trace and the |
| 1M trace. |
| Why does the 100M trace behave so much differently than the |
| other traces? |
| \QuickQuizAnswer{ |
| Your first clue is that 64 CPUs is exactly half of the 128 |
| CPUs on the machine. |
| The difference is an artifact of hardware threading. |
| This system has 64 cores with two hardware threads per core. |
| As long as fewer than 64 threads are running, each can run |
| in its own core. |
| But as soon as there are more than 64 threads, some of the threads |
| must share cores. |
| Because the pair of threads in any given core share some hardware |
| resources, the throughput of two threads sharing a core is not |
| quite as high as that of two threads each in their own core. |
| So the performance of the 100M trace is limited not by the |
| reader-writer lock, but rather by the sharing of hardware resources |
| between hardware threads in a single core. |
| |
| This can also be seen in the 10M trace, which deviates gently from |
| the ideal line up to 64 threads, then breaks sharply down, parallel |
| to the 100M trace. |
| Up to 64 threads, the 10M trace is limited primarily by reader-writer |
| lock scalability, and beyond that, also by sharing of hardware |
| resources between hardware threads in a single core. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Power-5 is several years old, and new hardware should |
| be faster. |
| So why should anyone worry about reader-writer locks being slow? |
| \QuickQuizAnswer{ |
| In general, newer hardware is improving. |
| However, it will need to improve more than two orders of magnitude |
| to permit reader-writer lock to achieve idea performance on |
| 128 CPUs. |
| Worse yet, the greater the number of CPUs, the larger the |
| required performance improvement. |
| The performance problems of reader-writer locking are therefore |
| very likely to be with us for quite some time to come. |
| } \QuickQuizEnd |
| |
| Despite these limitations, reader-writer locking is quite useful in many |
| cases, for example when the readers must do high-latency file or network I/O. |
| There are alternatives, some of which will be presented in |
| Chapters~\ref{chp:Counting} and \ref{chp:Deferred Processing}. |
| |
| \section{Atomic Operations} |
| \label{sec:toolsoftrade:Atomic Operations} |
| |
| Given that |
| Figure~\ref{fig:intro:Reader-Writer Lock Scalability} |
| shows that the overhead of reader-writer locking is most severe for the |
| smallest critical sections, it would be nice to have some other way |
| to protect the tiniest of critical sections. |
| One such way are atomic operations. |
| We have seen one atomic operations already, in the form of the |
| \co{__sync_fetch_and_add()} primitive on line~18 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}. |
| This primitive atomically adds the value of its second argument to |
| the value referenced by its first argument, returning the old value |
| (which was ignored in this case). |
| If a pair of threads concurrently execute \co{__sync_fetch_and_add()} on |
| the same variable, the resulting value of the variable will include |
| the result of both additions. |
| |
| The {\sf gcc} compiler offers a number of additional atomic operations, |
| including \co{__sync_fetch_and_sub()}, |
| \co{__sync_fetch_and_or()}, |
| \co{__sync_fetch_and_and()}, |
| \co{__sync_fetch_and_xor()}, and |
| \co{__sync_fetch_and_nand()}, all of which return the old value. |
| If you instead need the new value, you can instead use the |
| \co{__sync_add_and_fetch()}, |
| \co{__sync_sub_and_fetch()}, |
| \co{__sync_or_and_fetch()}, |
| \co{__sync_and_and_fetch()}, |
| \co{__sync_xor_and_fetch()}, and |
| \co{__sync_nand_and_fetch()} primitives. |
| |
| \QuickQuiz{} |
| Is it really necessary to have both sets of primitives? |
| \QuickQuizAnswer{ |
| Strictly speaking, no. |
| One could implement any member of the second set using the |
| corresponding member of the first set. |
| For example, one could implement \co{__sync_nand_and_fetch()} |
| in terms of \co{__sync_fetch_and_nand()} as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| tmp = v; |
| ret = __sync_fetch_and_nand(p, tmp); |
| ret = ~ret & tmp; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is similarly possible to implement \co{__sync_fetch_and_add()}, |
| \co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()} |
| in terms of their post-value counterparts. |
| |
| However, the alternative forms can be quite convenient, both |
| for the programmer and for the compiler/library implementor. |
| } \QuickQuizEnd |
| |
| The classic compare-and-swap operation is provided by a pair of |
| primitives, \co{__sync_bool_compare_and_swap()} and |
| \co{__sync_val_compare_and_swap()}. |
| Both of these primitive atomically update a location to a new value, |
| but only if its prior value was equal to the specified old value. |
| The first variant returns 1 if the operation succeeded and 0 if it |
| failed, for example, if the prior value was not equal to the specified |
| old value. |
| The second variant returns the prior value of the location, which, if |
| equal to the specified old value, indicates that the operation succeeded. |
| Either of the compare-and-swap operation is ``universal'' in the sense |
| that any atomic operation on a single location can be implemented in |
| terms of compare-and-swap, though the earlier operations are often |
| more efficient where they apply. |
| The compare-and-swap operation is also capable of serving as the basis |
| for a wider set of atomic operations, though the more elaborate of |
| these often suffer from complexity, scalability, and performance |
| problems~\cite{MauriceHerlihy90a}. |
| |
| The \co{__sync_synchronize()} primitive issues a ``memory barrier'', |
| which constrains both the compiler's and the CPU's ability to reorder |
| operations, as discussed in Section~\ref{sec:advsync:Memory Barriers}. |
| In some cases, it is sufficient to constrain the compiler's ability |
| to reorder operations, while allowing the CPU free rein, in which |
| case the \co{barrier()} primitive may be used, as it in fact was |
| on line~28 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}. |
| In some cases, it is only necessary to ensure that the compiler |
| avoids optimizing away a given memory access, in which case the |
| \co{ACCESS_ONCE()} primitive may be used, as it was on line~17 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}. |
| These last two primitives are not provided directly by gcc, |
| but may be implemented straightforwardly as follows: |
| |
| \vspace{5pt} |
| { |
| \scriptsize |
| \begin{verbatim} |
| #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) |
| #define barrier() __asm__ __volatile__("": : :"memory") |
| \end{verbatim} |
| } |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| Given that these atomic operations will often be able to |
| generate single atomic instructions that are directly |
| supported by the underlying instruction set, shouldn't |
| they be the fastest possible way to get things done? |
| \QuickQuizAnswer{ |
| Unfortunately, no. |
| See Chapter~\ref{chp:Counting} for some stark counterexamples. |
| } \QuickQuizEnd |
| |
| \section{Linux-Kernel Equivalents to POSIX Operations} |
| \label{sec:toolsoftrade:Linux-Kernel Equivalents to POSIX Operations} |
| |
| \begin{table*} |
| \scriptsize |
| \begin{center} |
| \begin{tabular}{l|l|l} |
| Category & POSIX & Linux Kernel \\ |
| \hline |
| \hline |
| Thread Management |
| & \co{pthread_t} |
| & \co{struct task_struct} \\ |
| \cline{2-3} |
| & \co{pthread_create()} |
| & \co{kthread_create} \\ |
| \cline{2-3} |
| & \co{pthread_exit()} |
| & \co{kthread_should_stop()}~~~~~(rough) \\ |
| \cline{2-3} |
| & \co{pthread_join()} |
| & \co{kthread_stop()}~~~(rough) \\ |
| \cline{2-3} |
| & \co{poll(NULL, 0, 5)} |
| & \co{schedule_timeout_interruptible()} ~~~ \\ |
| \hline |
| \hline |
| POSIX Locking |
| & \co{pthread_mutex_t} |
| & \co{spinlock_t}~~~(rough) \\ |
| & & \co{struct mutex} \\ |
| \cline{2-3} |
| & \co{PTHREAD_MUTEX_INITIALIZER} |
| & \co{DEFINE_SPINLOCK()} \\ |
| & & \co{DEFINE_MUTEX()} \\ |
| \cline{2-3} |
| & \co{pthread_mutex_lock()} |
| & \co{spin_lock()}~~~(and friends) \\ |
| & & \co{mutex_lock()}~~~(and friends) \\ |
| \cline{2-3} |
| & \co{pthread_mutex_unlock()} |
| & \co{spin_unlock()}~~~(and friends) \\ |
| & & \co{mutex_unlock()} \\ |
| \hline |
| \hline |
| POSIX Reader-Writer |
| & \co{pthread_rwlock_t} |
| & \co{rwlock_t}~~~(rough) \\ |
| Locking & & \co{struct rw_semaphore} \\ |
| \cline{2-3} |
| & \co{PTHREAD_RWLOCK_INITIALIZER} |
| & \co{DEFINE_RWLOCK()} \\ |
| & & \co{DECLARE_RWSEM()} \\ |
| \cline{2-3} |
| & \co{pthread_rwlock_rdlock()} |
| & \co{read_lock()}~~~(and friends) \\ |
| & & \co{down_read()}~~~(and friends) \\ |
| \cline{2-3} |
| & \co{pthread_rwlock_unlock()} |
| & \co{read_unlock()}~~~(and friends) \\ |
| & & \co{up_read()} \\ |
| \cline{2-3} |
| & \co{pthread_rwlock_wrlock()} |
| & \co{write_lock()}~~~(and friends) \\ |
| & & \co{down_write()}~~~(and friends) \\ |
| \cline{2-3} |
| & \co{pthread_rwlock_unlock()} |
| & \co{write_unlock()}~~~(and friends) \\ |
| & & \co{up_write()} \\ |
| \hline |
| \hline |
| Atomic Operations |
| & C Scalar Types |
| & \co{atomic_t} \\ |
| & & \co{atomic64_t} \\ |
| \cline{2-3} |
| & \co{__sync_fetch_and_add()} |
| & \co{atomic_add_return()} \\ |
| & & \co{atomic64_add_return()} \\ |
| \cline{2-3} |
| & \co{__sync_fetch_and_sub()} |
| & \co{atomic_sub_return()} \\ |
| & & \co{atomic64_sub_return()} \\ |
| \cline{2-3} |
| & \co{__sync_val_compare_and_swap()} ~~~~~~~~~~ |
| & \co{cmpxchg()} \\ |
| \cline{2-3} |
| & \co{__sync_lock_test_and_set()} |
| & \co{xchg()}~~~(rough) \\ |
| \cline{2-3} |
| & \co{__sync_synchronize()} |
| & \co{smp_mb()} \\ |
| \end{tabular} |
| \end{center} |
| \caption{Mapping from POSIX to Linux-Kernel Primitives} |
| \label{tab:advsync:Mapping from POSIX to Linux-Kernel Primitives} |
| \end{table*} |
| |
| Unfortunately, threading operations, locking primitives, and atomic |
| operations were in reasonably wide use long before the various standards |
| committees got around to them. |
| As a result, there is considerable variation in how these operations |
| are supported. |
| It is still quite common to find these operations implemented in |
| assembly language, either for historical reasons or to obtain better |
| performance in specialized circumstances. |
| For example, the gcc \co{__sync_} family of primitives all |
| provide memory-ordering semantics, motivating many developers to |
| create their own implementations for situations where the memory |
| ordering semantics are not required. |
| |
| Therefore, |
| Table~\ref{tab:advsync:Mapping from POSIX to Linux-Kernel Primitives} on |
| page~\pageref{tab:advsync:Mapping from POSIX to Linux-Kernel Primitives} |
| provides a rough mapping between the POSIX and |
| gcc primitives to those used in the Linux kernel. |
| Exact mappings are not always available, for example, the Linux kernel |
| has a wide variety of locking primitives, while gcc has a number |
| of atomic operations that are not directly available in the Linux kernel. |
| Of course, on the one hand, user-level code does not need the Linux |
| kernel's wide array of locking primitives, while on the other hand, |
| gcc's atomic operations can be emulated reasonably straightforwardly |
| using \co{cmpxchg()}. |
| |
| \QuickQuiz{} |
| What happened to the Linux-kernel equivalents to \co{fork()} |
| and \co{join()}? |
| \QuickQuizAnswer{ |
| They don't really exist. |
| All tasks executing within the Linux kernel share memory, |
| at least unless you want to do a huge amount of memory-mapping |
| work by hand. |
| } \QuickQuizEnd |
| |
| \section{The Right Tool for the Job: How to Choose?} |
| \label{sec:toolsoftrade:The Right Tool for the Job: How to Choose?} |
| |
| As a rough rule of thumb, use the simplest tool that will get the job done. |
| If you can, simply program sequentially. |
| If that is insufficient, try using a shell script to mediate parallelism. |
| If the resulting shell-script \co{fork()}/\co{exec()} overhead |
| (about 480 microseconds for a minimal C program on an Intel Core Duo |
| laptop) is too |
| large, try using the C-language \co{fork()} and \co{wait()} primitives. |
| If the overhead of these primitives (about 80 microseconds for a minimal |
| child process) is still too large, then you |
| might need to use the POSIX threading primitives, choosing the appropriate |
| locking and/or atomic-operation primitives. |
| If the overhead of the POSIX threading primitives (typically sub-microsecond) |
| is too great, then the primitives introduced in |
| Chapter~\ref{chp:Deferred Processing} may be required. |
| Always remember that inter-process communication and message-passing |
| can be good alternatives to shared-memory multithreaded execution. |
| |
| \QuickQuiz{} |
| Wouldn't the shell normally use \co{vfork()} rather than |
| \co{fork()}? |
| \QuickQuizAnswer{ |
| It might well do that, however, checking is left as an exercise |
| for the reader. |
| But in the meantime, I hope that we cna agree that \co{vfork()} |
| is a variant of \co{fork()}, so that we can use \co{fork()} |
| as a generic term covering both. |
| } \QuickQuizEnd |
| |
| Of course, the actual overheads will depend not only on your hardware, |
| but most critically on the manner in which you use the primitives. |
| Therefore, it is necessary to make the right design choices as well as |
| the correct choice of individual primitives, |
| as is discussed at length in subsequent chapters. |